RE: [RFC] /dev/ioasid uAPI proposal

2021-10-27 Thread Tian, Kevin
> From: Paolo Bonzini 
> Sent: Wednesday, October 27, 2021 6:32 PM
> 
> On 27/10/21 08:18, Tian, Kevin wrote:
> >> I absolutely do *not* want an API that tells KVM to enable WBINVD.  This
> >> is not up for discussion.
> >>
> >> But really, let's stop calling the file descriptor a security proof or a
> >> capability.  It's overkill; all that we are doing here is kernel
> >> acceleration of the WBINVD ioctl.
> >>
> >> As a thought experiment, let's consider what would happen if wbinvd
> >> caused an unconditional exit from guest to userspace.  Userspace would
> >> react by invoking the ioctl on the ioasid.  The proposed functionality
> >> is just an acceleration of this same thing, avoiding the
> >> guest->KVM->userspace->IOASID->wbinvd trip.
> >
> > While the concept here makes sense, in reality implementing a wbinvd
> > ioctl for userspace requiring iommufd (previous /dev/ioasid is renamed
> > to /dev/iommu now) to track dirty CPUs that a given process has been
> > running since wbinvd only flushes local cache.
> >
> > Is it ok to omit the actual wbinvd ioctl here and just leverage vfio-kvm
> > contract to manage whether guest wbinvd is emulated as no-op?
> 
> Yes, it'd be okay for me.  As I wrote in the message, the concept of a
> wbinvd ioctl is mostly important as a thought experiment for what is
> security sensitive and what is not.  If a wbinvd ioctl would not be
> privileged on the iommufd, then WBINVD is not considered privileged in a
> guest either.
> 
> That does not imply a requirement to implement the wbinvd ioctl, though.
> Of course, non-KVM usage of iommufd systems/devices with non-coherent
> DMA would be less useful; but that's already the case for VFIO.

Thanks for confirming it!

> 
> > btw does kvm community set a strict criteria that any operation that
> > the guest can do must be first carried in host uAPI first? In concept
> > KVM deals with ISA-level to cover both guest kernel and guest user
> > while host uAPI is only for host user. Introducing new uAPIs to allow
> > host user doing whatever guest kernel can do sounds ideal, but not
> > exactly necessary imho.
> 
> I agree; however, it's the right mindset in my opinion because
> virtualization (in a perfect world) should not be a way to give
> processes privilege to do something that they cannot do.  If it does,
> it's usually a good idea to ask yourself "should this functionality be
> accessible outside KVM too?".
> 

Agree. It's always good to keep such mindset in thought practice.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-10-27 Thread Paolo Bonzini

On 27/10/21 08:18, Tian, Kevin wrote:

I absolutely do *not* want an API that tells KVM to enable WBINVD.  This
is not up for discussion.

But really, let's stop calling the file descriptor a security proof or a
capability.  It's overkill; all that we are doing here is kernel
acceleration of the WBINVD ioctl.

As a thought experiment, let's consider what would happen if wbinvd
caused an unconditional exit from guest to userspace.  Userspace would
react by invoking the ioctl on the ioasid.  The proposed functionality
is just an acceleration of this same thing, avoiding the
guest->KVM->userspace->IOASID->wbinvd trip.


While the concept here makes sense, in reality implementing a wbinvd
ioctl for userspace requiring iommufd (previous /dev/ioasid is renamed
to /dev/iommu now) to track dirty CPUs that a given process has been
running since wbinvd only flushes local cache.

Is it ok to omit the actual wbinvd ioctl here and just leverage vfio-kvm
contract to manage whether guest wbinvd is emulated as no-op?


Yes, it'd be okay for me.  As I wrote in the message, the concept of a 
wbinvd ioctl is mostly important as a thought experiment for what is 
security sensitive and what is not.  If a wbinvd ioctl would not be 
privileged on the iommufd, then WBINVD is not considered privileged in a 
guest either.


That does not imply a requirement to implement the wbinvd ioctl, though. 
Of course, non-KVM usage of iommufd systems/devices with non-coherent 
DMA would be less useful; but that's already the case for VFIO.



btw does kvm community set a strict criteria that any operation that
the guest can do must be first carried in host uAPI first? In concept
KVM deals with ISA-level to cover both guest kernel and guest user
while host uAPI is only for host user. Introducing new uAPIs to allow
host user doing whatever guest kernel can do sounds ideal, but not
exactly necessary imho.


I agree; however, it's the right mindset in my opinion because 
virtualization (in a perfect world) should not be a way to give 
processes privilege to do something that they cannot do.  If it does, 
it's usually a good idea to ask yourself "should this functionality be 
accessible outside KVM too?".


Thanks,

Paolo

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-10-27 Thread Tian, Kevin
Hi, Paolo,

> From: Paolo Bonzini 
> Sent: Wednesday, June 9, 2021 11:21 PM
> 
> On 09/06/21 16:45, Jason Gunthorpe wrote:
> > On Wed, Jun 09, 2021 at 08:31:34AM -0600, Alex Williamson wrote:
> >
> >> If we go back to the wbinvd ioctl mechanism, if I call that ioctl with
> >> an ioasidfd that contains no devices, then I shouldn't be able to
> >> generate a wbinvd on the processor, right?  If I add a device,
> >> especially in a configuration that can generate non-coherent DMA, now
> >> that ioctl should work.  If I then remove all devices from that ioasid,
> >> what then is the difference from the initial state.  Should the ioctl
> >> now work because it worked once in the past?
> >
> > The ioctl is fine, but telling KVM to enable WBINVD is very similar to
> > open and then reconfiguring the ioasid_fd is very similar to
> > chmod. From a security perspective revoke is not strictly required,
> > IMHO.
> 
> I absolutely do *not* want an API that tells KVM to enable WBINVD.  This
> is not up for discussion.
> 
> But really, let's stop calling the file descriptor a security proof or a
> capability.  It's overkill; all that we are doing here is kernel
> acceleration of the WBINVD ioctl.
> 
> As a thought experiment, let's consider what would happen if wbinvd
> caused an unconditional exit from guest to userspace.  Userspace would
> react by invoking the ioctl on the ioasid.  The proposed functionality
> is just an acceleration of this same thing, avoiding the
> guest->KVM->userspace->IOASID->wbinvd trip.

While the concept here makes sense, in reality implementing a wbinvd
ioctl for userspace requiring iommufd (previous /dev/ioasid is renamed
to /dev/iommu now) to track dirty CPUs that a given process has been 
running since wbinvd only flushes local cache. KVM tracks dirty CPUs by 
registering preempt notifier on the current vCPU. iommufd may do the 
same thing for the thread which opens /dev/iommu, but per below
discussion one open is whether it's worthwhile adding such hassle for
something which no real user is interested in today except kvm:

https://lore.kernel.org/lkml/20211025233459.gm2744...@nvidia.com/

Is it ok to omit the actual wbinvd ioctl here and just leverage vfio-kvm
contract to manage whether guest wbinvd is emulated as no-op? It is 
still iommufd which decides whether wbinvd is allowed (based on IOAS 
and device attach information) but just sort of special casing that the 
operation can only be done via kvm path...

btw does kvm community set a strict criteria that any operation that
the guest can do must be first carried in host uAPI first? In concept
KVM deals with ISA-level to cover both guest kernel and guest user
while host uAPI is only for host user. Introducing new uAPIs to allow
host user doing whatever guest kernel can do sounds ideal, but not
exactly necessary imho.

> 
> This is why the API that I want, and that is already exists for VFIO
> group file descriptors, informs KVM of which "ioctls" the guest should
> be able to do via privileged instructions[1].  Then the kernel works out
> with KVM how to ensure a 1:1 correspondence between the operation of the
> ioctls and the privileged operations.
> 
> One way to do it would be to always trap WBINVD and invoke the same
> kernel function that implements the ioctl.  The function would do either
> a wbinvd or nothing, based on whether the ioasid has any device.  The
> next logical step is a notification mechanism that enables WBINVD (by
> disabling the WBINVD intercept) when there are devices in the ioasidfd,
> and disables WBINVD (by enabling a no-op intercept) when there are none.
> 
> And in fact once all VFIO devices are gone, wbinvd is for all purposes a
> no-op as far as the guest kernel can tell.  So there's no reason to
> treat it as anything but a no-op.
> 
> Thanks,
> 
> Paolo
> 
> [1] As an aside, I must admit I didn't entirely understand the design of
> the KVM-VFIO device back when Alex added it.  But with this model it was
> absolutely the right thing to do, and it remains the right thing to do
> even if VFIO groups are replaced with IOASID file descriptors.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-30 Thread Christoph Hellwig
On Mon, Jun 07, 2021 at 11:14:24AM -0300, Jason Gunthorpe wrote:
> "non-coherent DMA" is some general euphemism that evokes images of
> embedded platforms that don't have coherent DMA at all and have low
> cost ways to regain coherence. This is not at all what we are talking
> about here at all.

It literally is the same way of working.  And not just low-end embedded
platforms use this, but a lot of older server platforms did as well.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-30 Thread Christoph Hellwig
On Fri, Jun 04, 2021 at 08:58:05AM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 04, 2021 at 09:11:03AM +0800, Jason Wang wrote:
> > > nor do any virtio drivers implement the required platform specific
> > > cache flushing to make no-snoop TLPs work.
> > 
> > I don't get why virtio drivers needs to do that. I think DMA API should hide
> > those arch/platform specific stuffs from us.
> 
> It is not arch/platform stuff. If the device uses no-snoop then a
> very platform specific recovery is required in the device driver.

Well, the proper way to support NO_SNOOP DMA would be to force the
DMA API into supporting the device as if the platform was not DMA
coherent, probably on a per-call basis.  It is just that no one bothered
to actually do the work an people just kept piling hacks over hacks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-30 Thread Christoph Hellwig
On Wed, Jun 09, 2021 at 09:47:42AM -0300, Jason Gunthorpe wrote:
> I can vaugely understand this rational for vfio, but not at all for
> the platform's iommu driver, sorry.

Agreed.  More importantly the dependency is not for the platform iommu
driver but just for the core iommu code, which is always built in if
enabled.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-30 Thread Christoph Hellwig
On Mon, Jun 07, 2021 at 03:25:32AM +, Tian, Kevin wrote:
> 
> Possibly just a naming thing, but I feel it's better to just talk about
> no-snoop or non-coherent in the uAPI. Per Intel SDM wbinvd is a
> privileged instruction. A process on the host has no privilege to 
> execute it. Only when this process holds a VM, this instruction matters
> as there are guest privilege levels. But having VFIO uAPI (which is
> userspace oriented) to explicitly deal with a CPU instruction which
> makes sense only in a virtualization context sounds a bit weird...

More importantly the Intel instructions here are super weird.
Pretty much every other architecture just has plan old cache
writeback/invalidate/writeback+invalidate instructions without all these
weird implications.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-30 Thread Christoph Hellwig
On Tue, Jun 08, 2021 at 09:20:29AM +0800, Jason Wang wrote:
> "
> 
> 6.2.17 _CCA (Cache Coherency Attribute) The _CCA object returns whether or
> not a bus-master device supports hardware managed cache coherency. Expected
> values are 0 to indicate it is not supported, and 1 to indicate that it is
> supported. All other values are reserved.
> 
> ...
> 
> On Intel platforms, if the _CCA object is not supplied, the OSPM will assume
> the devices are hardware cache coherent.
> 
> "

_CCA is mostly used on arm/arm64 platforms to figure out if a device
needs non-coherent DMA handling in the DMA API or not.  It is not
related to the NoSnoop TLPs that override the setting for an otherwise
coherent device.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-30 Thread Christoph Hellwig
On Mon, Jun 07, 2021 at 04:08:02PM -0300, Jason Gunthorpe wrote:
> Compatibility is important, but when I look in the kernel code I see
> very few places that call wbinvd(). Basically all DRM for something
> relavent to qemu.
> 
> That tells me that the vast majority of PCI devices do not generate
> no-snoop traffic.

Part of it is that we have no general API for it, because the DRM folks
as usual just tarted piling up local hacks instead of introducing
a proper API..
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-23 Thread David Gibson
On Wed, Jun 23, 2021 at 07:59:21AM +, Tian, Kevin wrote:
> > From: David Gibson 
> > Sent: Thursday, June 17, 2021 11:48 AM
> > 
> > On Tue, Jun 08, 2021 at 10:17:56AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Jun 08, 2021 at 12:37:04PM +1000, David Gibson wrote:
> > >
> > > > > The PPC/SPAPR support allows KVM to associate a vfio group to an
> > IOMMU
> > > > > page table so that it can handle iotlb programming from pre-registered
> > > > > memory without trapping out to userspace.
> > > >
> > > > To clarify that's a guest side logical vIOMMU page table which is
> > > > partially managed by KVM.  This is an optimization - things can work
> > > > without it, but it means guest iomap/unmap becomes a hot path because
> > > > each map/unmap hypercall has to go
> > > > guest -> KVM -> qemu -> VFIO
> > > >
> > > > So there are multiple context transitions.
> > >
> > > Isn't this overhead true of many of the vIOMMUs?
> > 
> > Yes, but historically it bit much harder on POWER for a couple of reasons:
> > 
> > 1) POWER guests *always* have a vIOMMU - the platform has no concept
> >of passthrough mode.  We therefore had a vIOMMU implementation some
> >time before the AMD or Intel IOMMUs were implemented as vIOMMUs in
> >qemu.
> > 
> > 2) At the time we were implementing this the supported IOVA window for
> >the paravirtualized IOMMU was pretty small (1G, I think) making
> >vIOMMU maps and unmaps a pretty common operation.
> > 
> > > Can the fast path be
> > > generalized?
> > 
> > Not really.  This is a paravirtualized guest IOMMU, so it's a platform
> > specific group of hypercalls that's being interpreted by KVM and
> > passed through to the IOMMU side using essentially the same backend
> > that that the userspace implementation would eventually get to after a
> > bunch more context switches.
> > 
> 
> Can virtio-iommu work on PPC? iirc Jean has a plan to implement
> a vhost-iommu which is supposed to implement the similar in-kernel
> acceleration...

I don't know - I'd have to research virtio-iommu a bunch to determine
that.

Even if we can, the platform IOMMU would still be there (it's a
platform requirement), so we couldn't completely ignore it.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-23 Thread David Gibson
On Fri, Jun 18, 2021 at 07:03:31PM +0200, Jean-Philippe Brucker wrote:
> On Thu, Jun 17, 2021 at 01:00:14PM +1000, David Gibson wrote:
> > On Thu, Jun 10, 2021 at 06:37:31PM +0200, Jean-Philippe Brucker wrote:
> > > On Tue, Jun 08, 2021 at 04:31:50PM +1000, David Gibson wrote:
> > > > For the qemu case, I would imagine a two stage fallback:
> > > > 
> > > > 1) Ask for the exact IOMMU capabilities (including pagetable
> > > >format) that the vIOMMU has.  If the host can supply, you're
> > > >good
> > > > 
> > > > 2) If not, ask for a kernel managed IOAS.  Verify that it can map
> > > >all the IOVA ranges the guest vIOMMU needs, and has an equal or
> > > >smaller pagesize than the guest vIOMMU presents.  If so,
> > > >software emulate the vIOMMU by shadowing guest io pagetable
> > > >updates into the kernel managed IOAS.
> > > > 
> > > > 3) You're out of luck, don't start.
> > > > 
> > > > For both (1) and (2) I'd expect it to be asking this question *after*
> > > > saying what devices are attached to the IOAS, based on the virtual
> > > > hardware configuration.  That doesn't cover hotplug, of course, for
> > > > that you have to just fail the hotplug if the new device isn't
> > > > supportable with the IOAS you already have.
> > > 
> > > Yes. So there is a point in time when the IOAS is frozen, and cannot take
> > > in new incompatible devices. I think that can support the usage I had in
> > > mind. If the VMM (non-QEMU, let's say) wanted to create one IOASID FD per
> > > feature set it could bind the first device, freeze the features, then bind
> > 
> > Are you thinking of this "freeze the features" as an explicitly
> > triggered action?  I have suggested that an explicit "ENABLE" step
> > might be useful, but that hasn't had much traction from what I've
> > seen.
> 
> Seems like we do need an explicit enable step for the flow you described
> above:
> 
> a) Bind all devices to an ioasid. Each bind succeeds.
> b) Ask for a specific set of features for this aggregate of device. Ask
>for (1), fall back to (2), or abort.
> c) Boot the VM
> d) Hotplug a device, bind it to the ioasid. We're long past negotiating
>features for the ioasid, so the host needs to reject the bind if the
>new device is incompatible with what was requested at (b)
> 
> So a successful request at (b) would be the point where we change the
> behavior of bind.
> 
> Since the kernel needs a form of feature check in any case, I still have a
> preference for aborting the bind at (a) if the device isn't exactly
> compatible with other devices already in the ioasid, because it might be
> simpler to implement in the host, but I don't feel strongly about this.
> 
> 
> > > I'd like to understand better where the difficulty lies, with migration.
> > > Is the problem, once we have a guest running on physical machine A, to
> > > make sure that physical machine B supports the same IOMMU properties
> > > before migrating the VM over to B?  Why can't QEMU (instead of the user)
> > > select a feature set on machine A, then when time comes to migrate, query
> > > all information from the host kernel on machine B and check that it
> > > matches what was picked for machine A?  Or is it only trying to
> > > accommodate different sets of features between A and B, that would be too
> > > difficult?
> > 
> > There are two problems
> > 
> > 1) Although it could be done in theory, it's hard, and it would need a
> > huge rewrite to qemu's whole migration infrastructure to do this.
> > We'd need a way of representing host features, working out which sets
> > are compatible with which others depending on what things the guest is
> > allowed to use, encoding the information in the migration stream and
> > reporting failure.  None of this exists now.
> > 
> > Indeed qemu requires that you create the (stopped) machine on the
> > destination (including virtual hardware configuration) before even
> > attempting to process the incoming migration.  It does not for the
> > most part transfer the machine configuration in the migration stream.
> > Now, that's generally considered a flaw with the design, but fixing it
> > is a huge project that no-one's really had the energy to begin despite
> > the idea being around for years.
> > 
> > 2) It makes behaviour really hard to predict for management layers
> > above.  Things like oVirt automatically migrate around a cluster for
> > load balancing.  At the moment the model which works is basically that
> > you if you request the same guest features on each end of the
> > migration, and qemu starts with that configuration on each end, the
> > migration should work (or only fail for transient reasons).  If you
> > can't know if the migration is possible until you get the incoming
> > stream, reporting and exposing what will and won't work to the layer
> > above also becomes an immensely fiddly problem.
> 
> That was really useful, thanks. One thing I'm 

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-23 Thread David Gibson
On Wed, Jun 23, 2021 at 08:00:49AM +, Tian, Kevin wrote:
> > From: David Gibson
> > Sent: Thursday, June 17, 2021 12:08 PM
> > 
> > On Thu, Jun 03, 2021 at 08:12:27AM +, Tian, Kevin wrote:
> > > > From: David Gibson 
> > > > Sent: Wednesday, June 2, 2021 2:15 PM
> > > >
> > > [...]
> > >
> > > > >
> > > > > /*
> > > > >   * Get information about an I/O address space
> > > > >   *
> > > > >   * Supported capabilities:
> > > > >   *   - VFIO type1 map/unmap;
> > > > >   *   - pgtable/pasid_table binding
> > > > >   *   - hardware nesting vs. software nesting;
> > > > >   *   - ...
> > > > >   *
> > > > >   * Related attributes:
> > > > >   *   - supported page sizes, reserved IOVA ranges (DMA
> > mapping);
> > > >
> > > > Can I request we represent this in terms of permitted IOVA ranges,
> > > > rather than reserved IOVA ranges.  This works better with the "window"
> > > > model I have in mind for unifying the restrictions of the POWER IOMMU
> > > > with Type1 like mapping.
> > >
> > > Can you elaborate how permitted range work better here?
> > 
> > Pretty much just that MAP operations would fail if they don't entirely
> > lie within a permitted range.  So, for example if your IOMMU only
> > implements say, 45 bits of IOVA, then you'd have 0..0x1fff as
> > your only permitted range.  If, like the POWER paravirtual IOMMU (in
> > defaut configuration) you have a small (1G) 32-bit range and a large
> > (45-bit) 64-bit range at a high address, you'd have say:
> > 0x..0x3fff (32-bit range)
> > and
> > 0x800 .. 0x8001fff (64-bit range)
> > as your permitted ranges.
> > 
> > If your IOMMU supports truly full 64-bit addressing, but has a
> > reserved range (for MSIs or whatever) at 0x000..0x then
> > you'd have permitted ranges of 0..0xaaa9 and
> > 0x..0x.
> 
> I see. Has incorporated this comment in v2.
> 
> > 
> > [snip]
> > > > For debugging and certain hypervisor edge cases it might be useful to
> > > > have a call to allow userspace to lookup and specific IOVA in a guest
> > > > managed pgtable.
> > >
> > > Since all the mapping metadata is from userspace, why would one
> > > rely on the kernel to provide such service? Or are you simply asking
> > > for some debugfs node to dump the I/O page table for a given
> > > IOASID?
> > 
> > I'm thinking of this as a debugging aid so you can make sure that how
> > the kernel is interpreting that metadata in the same way that your
> > userspace expects it to interpret that metadata.
> > 
> 
> I'll not include it in this RFC. There are already too many stuff. The
> debugging aid can be added anyway when it's actually required.

Fair enough.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC] /dev/ioasid uAPI proposal

2021-06-23 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Saturday, June 19, 2021 2:31 AM
> 
> On Fri, Jun 18, 2021 at 07:03:31PM +0200, Jean-Philippe Brucker wrote:
> 
> > configuration. The Arm SMMUs have a lot of small features that
> > implementations can mix and match and that a user shouldn't have to care
> > about, and there are lots of different implementations by various
> > vendors.
> 
> This is really something to think about carefully in this RFC - I do
> have a guess that a 'HW specific' channel is going to be useful here.

Agree.

> 
> If the goal is for qemu to provide all these fiddly things and they
> cannot be SW emulated, then direct access to the fiddly HW native
> stuff is possibly necessary.
> 
> We've kind of seen this mistake in DRM and RDMA
> historically. Attempting to generalize too early, or generalize
> something that is really a one off. Better for everyone to just keep
> it as a one off.
> 

Yes. Except some generic info (e.g. addr_width), most format info can 
be exposed via a vendor specific data union region. There is no need 
to define those vendor bits explicitly in the uAPI. Kernel IOMMU driver 
and vIOMMU emulation logic know how to interpret them. 

Take VT-d for example, all format/cap info are carried in two registers
(cap_reg and ecap_reg) thus two u64 fields are sufficient... 

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-23 Thread Tian, Kevin
> From: David Gibson
> Sent: Thursday, June 17, 2021 12:08 PM
> 
> On Thu, Jun 03, 2021 at 08:12:27AM +, Tian, Kevin wrote:
> > > From: David Gibson 
> > > Sent: Wednesday, June 2, 2021 2:15 PM
> > >
> > [...]
> >
> > > >
> > > > /*
> > > >   * Get information about an I/O address space
> > > >   *
> > > >   * Supported capabilities:
> > > >   * - VFIO type1 map/unmap;
> > > >   * - pgtable/pasid_table binding
> > > >   * - hardware nesting vs. software nesting;
> > > >   * - ...
> > > >   *
> > > >   * Related attributes:
> > > >   * - supported page sizes, reserved IOVA ranges (DMA
> mapping);
> > >
> > > Can I request we represent this in terms of permitted IOVA ranges,
> > > rather than reserved IOVA ranges.  This works better with the "window"
> > > model I have in mind for unifying the restrictions of the POWER IOMMU
> > > with Type1 like mapping.
> >
> > Can you elaborate how permitted range work better here?
> 
> Pretty much just that MAP operations would fail if they don't entirely
> lie within a permitted range.  So, for example if your IOMMU only
> implements say, 45 bits of IOVA, then you'd have 0..0x1fff as
> your only permitted range.  If, like the POWER paravirtual IOMMU (in
> defaut configuration) you have a small (1G) 32-bit range and a large
> (45-bit) 64-bit range at a high address, you'd have say:
>   0x..0x3fff (32-bit range)
> and
>   0x800 .. 0x8001fff (64-bit range)
> as your permitted ranges.
> 
> If your IOMMU supports truly full 64-bit addressing, but has a
> reserved range (for MSIs or whatever) at 0x000..0x then
> you'd have permitted ranges of 0..0xaaa9 and
> 0x..0x.

I see. Has incorporated this comment in v2.

> 
> [snip]
> > > For debugging and certain hypervisor edge cases it might be useful to
> > > have a call to allow userspace to lookup and specific IOVA in a guest
> > > managed pgtable.
> >
> > Since all the mapping metadata is from userspace, why would one
> > rely on the kernel to provide such service? Or are you simply asking
> > for some debugfs node to dump the I/O page table for a given
> > IOASID?
> 
> I'm thinking of this as a debugging aid so you can make sure that how
> the kernel is interpreting that metadata in the same way that your
> userspace expects it to interpret that metadata.
> 

I'll not include it in this RFC. There are already too many stuff. The
debugging aid can be added anyway when it's actually required. 

Thanks,
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-23 Thread Tian, Kevin
> From: David Gibson 
> Sent: Thursday, June 17, 2021 11:48 AM
> 
> On Tue, Jun 08, 2021 at 10:17:56AM -0300, Jason Gunthorpe wrote:
> > On Tue, Jun 08, 2021 at 12:37:04PM +1000, David Gibson wrote:
> >
> > > > The PPC/SPAPR support allows KVM to associate a vfio group to an
> IOMMU
> > > > page table so that it can handle iotlb programming from pre-registered
> > > > memory without trapping out to userspace.
> > >
> > > To clarify that's a guest side logical vIOMMU page table which is
> > > partially managed by KVM.  This is an optimization - things can work
> > > without it, but it means guest iomap/unmap becomes a hot path because
> > > each map/unmap hypercall has to go
> > >   guest -> KVM -> qemu -> VFIO
> > >
> > > So there are multiple context transitions.
> >
> > Isn't this overhead true of many of the vIOMMUs?
> 
> Yes, but historically it bit much harder on POWER for a couple of reasons:
> 
> 1) POWER guests *always* have a vIOMMU - the platform has no concept
>of passthrough mode.  We therefore had a vIOMMU implementation some
>time before the AMD or Intel IOMMUs were implemented as vIOMMUs in
>qemu.
> 
> 2) At the time we were implementing this the supported IOVA window for
>the paravirtualized IOMMU was pretty small (1G, I think) making
>vIOMMU maps and unmaps a pretty common operation.
> 
> > Can the fast path be
> > generalized?
> 
> Not really.  This is a paravirtualized guest IOMMU, so it's a platform
> specific group of hypercalls that's being interpreted by KVM and
> passed through to the IOMMU side using essentially the same backend
> that that the userspace implementation would eventually get to after a
> bunch more context switches.
> 

Can virtio-iommu work on PPC? iirc Jean has a plan to implement
a vhost-iommu which is supposed to implement the similar in-kernel
acceleration...

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-23 Thread Tian, Kevin
> From: Jean-Philippe Brucker
> Sent: Saturday, June 19, 2021 1:04 AM
> 
> On Thu, Jun 17, 2021 at 01:00:14PM +1000, David Gibson wrote:
> > On Thu, Jun 10, 2021 at 06:37:31PM +0200, Jean-Philippe Brucker wrote:
> > > On Tue, Jun 08, 2021 at 04:31:50PM +1000, David Gibson wrote:
> > > > For the qemu case, I would imagine a two stage fallback:
> > > >
> > > > 1) Ask for the exact IOMMU capabilities (including pagetable
> > > >format) that the vIOMMU has.  If the host can supply, you're
> > > >good
> > > >
> > > > 2) If not, ask for a kernel managed IOAS.  Verify that it can map
> > > >all the IOVA ranges the guest vIOMMU needs, and has an equal or
> > > >smaller pagesize than the guest vIOMMU presents.  If so,
> > > >software emulate the vIOMMU by shadowing guest io pagetable
> > > >updates into the kernel managed IOAS.
> > > >
> > > > 3) You're out of luck, don't start.
> > > >
> > > > For both (1) and (2) I'd expect it to be asking this question *after*
> > > > saying what devices are attached to the IOAS, based on the virtual
> > > > hardware configuration.  That doesn't cover hotplug, of course, for
> > > > that you have to just fail the hotplug if the new device isn't
> > > > supportable with the IOAS you already have.
> > >
> > > Yes. So there is a point in time when the IOAS is frozen, and cannot take
> > > in new incompatible devices. I think that can support the usage I had in
> > > mind. If the VMM (non-QEMU, let's say) wanted to create one IOASID FD
> per
> > > feature set it could bind the first device, freeze the features, then bind
> >
> > Are you thinking of this "freeze the features" as an explicitly
> > triggered action?  I have suggested that an explicit "ENABLE" step
> > might be useful, but that hasn't had much traction from what I've
> > seen.
> 
> Seems like we do need an explicit enable step for the flow you described
> above:
> 
> a) Bind all devices to an ioasid. Each bind succeeds.

let's use consistent terms in this discussion. :)

'bind' the device to a IOMMU fd (container of I/O address spaces). 

'attach' the device to an IOASID (representing an I/O address space 
within the IOMMU fd)

> b) Ask for a specific set of features for this aggregate of device. Ask
>for (1), fall back to (2), or abort.
> c) Boot the VM
> d) Hotplug a device, bind it to the ioasid. We're long past negotiating
>features for the ioasid, so the host needs to reject the bind if the
>new device is incompatible with what was requested at (b)
> 
> So a successful request at (b) would be the point where we change the
> behavior of bind.

Per Jason's recommendation v2 will move to a new model:

a) Bind all devices to an IOMMU fd:
- The user should provide a 'device_cookie' to mark each bound 
  device in following uAPIs.

b) Successful binding allows user to check the capability/format info per
 device_cookie (GET_DEVICE_INFO), before creating any IOASID:
- Sample capability info:
   * VFIO type1 map: supported page sizes, permitted IOVA ranges, 
etc.;
   * IOASID nesting: hardware nesting vs. software nesting;
   * User-managed page table: vendor specific formats;
   * User-managed pasid table: vendor specific formats;
   * vPASID: whether delegated to user, if kernel-managed per-RID 
or global;
   * coherency: what's kernel default policy, whether allows user 
to change;
   * ...
   - Actual logistics might be finalized when code is implemented;

c) When creating a new IOASID, the user should specify a format which
is compatible to one or more devices which will be attached to this 
IOASID right after.

d) Attaching a device which has incompatible format to this IOASID 
 is simply rejected. Whether it's hotplugged doesn't matter.

Qemu is expected to query capability/format information for all devices
according to what a specified vIOMMU model requires. Then decide
whether to fail vIOMMU creation if not strictly matched or fall back to
a hybrid model with software emulation to bridge the gap. In any case
before a new I/O address space is created, Qemu should have a clear 
picture about what format is required given a set of to-be-attached 
devices and whether multiple IOASIDs are required if these devices 
have incompatible formats. 

With this model we don't need a separate 'enable' step.  

> 
> Since the kernel needs a form of feature check in any case, I still have a
> preference for aborting the bind at (a) if the device isn't exactly
> compatible with other devices already in the ioasid, because it might be
> simpler to implement in the host, but I don't feel strongly about this.

this is covered by d). Actually with all the format information available
Qemu even should not attempt to attach incompatible device in the 
first place, though the kernel will do this simple check under the hood.

Thanks
Kevin

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-18 Thread Jason Gunthorpe
On Fri, Jun 18, 2021 at 07:03:31PM +0200, Jean-Philippe Brucker wrote:

> configuration. The Arm SMMUs have a lot of small features that
> implementations can mix and match and that a user shouldn't have to care
> about, and there are lots of different implementations by various
> vendors.

This is really something to think about carefully in this RFC - I do
have a guess that a 'HW specific' channel is going to be useful here.

If the goal is for qemu to provide all these fiddly things and they
cannot be SW emulated, then direct access to the fiddly HW native
stuff is possibly necessary.

We've kind of seen this mistake in DRM and RDMA
historically. Attempting to generalize too early, or generalize
something that is really a one off. Better for everyone to just keep
it as a one off.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-18 Thread Jean-Philippe Brucker
On Thu, Jun 17, 2021 at 01:00:14PM +1000, David Gibson wrote:
> On Thu, Jun 10, 2021 at 06:37:31PM +0200, Jean-Philippe Brucker wrote:
> > On Tue, Jun 08, 2021 at 04:31:50PM +1000, David Gibson wrote:
> > > For the qemu case, I would imagine a two stage fallback:
> > > 
> > > 1) Ask for the exact IOMMU capabilities (including pagetable
> > >format) that the vIOMMU has.  If the host can supply, you're
> > >good
> > > 
> > > 2) If not, ask for a kernel managed IOAS.  Verify that it can map
> > >all the IOVA ranges the guest vIOMMU needs, and has an equal or
> > >smaller pagesize than the guest vIOMMU presents.  If so,
> > >software emulate the vIOMMU by shadowing guest io pagetable
> > >updates into the kernel managed IOAS.
> > > 
> > > 3) You're out of luck, don't start.
> > > 
> > > For both (1) and (2) I'd expect it to be asking this question *after*
> > > saying what devices are attached to the IOAS, based on the virtual
> > > hardware configuration.  That doesn't cover hotplug, of course, for
> > > that you have to just fail the hotplug if the new device isn't
> > > supportable with the IOAS you already have.
> > 
> > Yes. So there is a point in time when the IOAS is frozen, and cannot take
> > in new incompatible devices. I think that can support the usage I had in
> > mind. If the VMM (non-QEMU, let's say) wanted to create one IOASID FD per
> > feature set it could bind the first device, freeze the features, then bind
> 
> Are you thinking of this "freeze the features" as an explicitly
> triggered action?  I have suggested that an explicit "ENABLE" step
> might be useful, but that hasn't had much traction from what I've
> seen.

Seems like we do need an explicit enable step for the flow you described
above:

a) Bind all devices to an ioasid. Each bind succeeds.
b) Ask for a specific set of features for this aggregate of device. Ask
   for (1), fall back to (2), or abort.
c) Boot the VM
d) Hotplug a device, bind it to the ioasid. We're long past negotiating
   features for the ioasid, so the host needs to reject the bind if the
   new device is incompatible with what was requested at (b)

So a successful request at (b) would be the point where we change the
behavior of bind.

Since the kernel needs a form of feature check in any case, I still have a
preference for aborting the bind at (a) if the device isn't exactly
compatible with other devices already in the ioasid, because it might be
simpler to implement in the host, but I don't feel strongly about this.


> > I'd like to understand better where the difficulty lies, with migration.
> > Is the problem, once we have a guest running on physical machine A, to
> > make sure that physical machine B supports the same IOMMU properties
> > before migrating the VM over to B?  Why can't QEMU (instead of the user)
> > select a feature set on machine A, then when time comes to migrate, query
> > all information from the host kernel on machine B and check that it
> > matches what was picked for machine A?  Or is it only trying to
> > accommodate different sets of features between A and B, that would be too
> > difficult?
> 
> There are two problems
> 
> 1) Although it could be done in theory, it's hard, and it would need a
> huge rewrite to qemu's whole migration infrastructure to do this.
> We'd need a way of representing host features, working out which sets
> are compatible with which others depending on what things the guest is
> allowed to use, encoding the information in the migration stream and
> reporting failure.  None of this exists now.
> 
> Indeed qemu requires that you create the (stopped) machine on the
> destination (including virtual hardware configuration) before even
> attempting to process the incoming migration.  It does not for the
> most part transfer the machine configuration in the migration stream.
> Now, that's generally considered a flaw with the design, but fixing it
> is a huge project that no-one's really had the energy to begin despite
> the idea being around for years.
> 
> 2) It makes behaviour really hard to predict for management layers
> above.  Things like oVirt automatically migrate around a cluster for
> load balancing.  At the moment the model which works is basically that
> you if you request the same guest features on each end of the
> migration, and qemu starts with that configuration on each end, the
> migration should work (or only fail for transient reasons).  If you
> can't know if the migration is possible until you get the incoming
> stream, reporting and exposing what will and won't work to the layer
> above also becomes an immensely fiddly problem.

That was really useful, thanks. One thing I'm worried about is the user
having to know way too much detail about IOMMUs in order to pick a precise
configuration. The Arm SMMUs have a lot of small features that
implementations can mix and match and that a user shouldn't have to care
about, and there are 

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-17 Thread David Gibson
On Tue, Jun 08, 2021 at 10:17:56AM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 08, 2021 at 12:37:04PM +1000, David Gibson wrote:
> 
> > > The PPC/SPAPR support allows KVM to associate a vfio group to an IOMMU
> > > page table so that it can handle iotlb programming from pre-registered
> > > memory without trapping out to userspace.
> > 
> > To clarify that's a guest side logical vIOMMU page table which is
> > partially managed by KVM.  This is an optimization - things can work
> > without it, but it means guest iomap/unmap becomes a hot path because
> > each map/unmap hypercall has to go
> > guest -> KVM -> qemu -> VFIO
> > 
> > So there are multiple context transitions.
> 
> Isn't this overhead true of many of the vIOMMUs?

Yes, but historically it bit much harder on POWER for a couple of reasons:

1) POWER guests *always* have a vIOMMU - the platform has no concept
   of passthrough mode.  We therefore had a vIOMMU implementation some
   time before the AMD or Intel IOMMUs were implemented as vIOMMUs in
   qemu.

2) At the time we were implementing this the supported IOVA window for
   the paravirtualized IOMMU was pretty small (1G, I think) making
   vIOMMU maps and unmaps a pretty common operation.

> Can the fast path be
> generalized?

Not really.  This is a paravirtualized guest IOMMU, so it's a platform
specific group of hypercalls that's being interpreted by KVM and
passed through to the IOMMU side using essentially the same backend
that that the userspace implementation would eventually get to after a
bunch more context switches.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-17 Thread David Gibson
On Thu, Jun 03, 2021 at 08:12:27AM +, Tian, Kevin wrote:
> > From: David Gibson 
> > Sent: Wednesday, June 2, 2021 2:15 PM
> >
> [...]
>  
> > >
> > > /*
> > >   * Get information about an I/O address space
> > >   *
> > >   * Supported capabilities:
> > >   *   - VFIO type1 map/unmap;
> > >   *   - pgtable/pasid_table binding
> > >   *   - hardware nesting vs. software nesting;
> > >   *   - ...
> > >   *
> > >   * Related attributes:
> > >   *   - supported page sizes, reserved IOVA ranges (DMA mapping);
> > 
> > Can I request we represent this in terms of permitted IOVA ranges,
> > rather than reserved IOVA ranges.  This works better with the "window"
> > model I have in mind for unifying the restrictions of the POWER IOMMU
> > with Type1 like mapping.
> 
> Can you elaborate how permitted range work better here?

Pretty much just that MAP operations would fail if they don't entirely
lie within a permitted range.  So, for example if your IOMMU only
implements say, 45 bits of IOVA, then you'd have 0..0x1fff as
your only permitted range.  If, like the POWER paravirtual IOMMU (in
defaut configuration) you have a small (1G) 32-bit range and a large
(45-bit) 64-bit range at a high address, you'd have say:
0x..0x3fff (32-bit range)
and
0x800 .. 0x8001fff (64-bit range)
as your permitted ranges.

If your IOMMU supports truly full 64-bit addressing, but has a
reserved range (for MSIs or whatever) at 0x000..0x then
you'd have permitted ranges of 0..0xaaa9 and
0x..0x.

[snip]
> > For debugging and certain hypervisor edge cases it might be useful to
> > have a call to allow userspace to lookup and specific IOVA in a guest
> > managed pgtable.
> 
> Since all the mapping metadata is from userspace, why would one 
> rely on the kernel to provide such service? Or are you simply asking
> for some debugfs node to dump the I/O page table for a given 
> IOASID?

I'm thinking of this as a debugging aid so you can make sure that how
the kernel is interpreting that metadata in the same way that your
userspace expects it to interpret that metadata.


-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-17 Thread David Gibson
On Tue, Jun 08, 2021 at 04:04:06PM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 08, 2021 at 10:53:02AM +1000, David Gibson wrote:
> > On Thu, Jun 03, 2021 at 08:52:24AM -0300, Jason Gunthorpe wrote:
> > > On Thu, Jun 03, 2021 at 03:13:44PM +1000, David Gibson wrote:
> > > 
> > > > > We can still consider it a single "address space" from the IOMMU
> > > > > perspective. What has happened is that the address table is not just a
> > > > > 64 bit IOVA, but an extended ~80 bit IOVA formed by "PASID, IOVA".
> > > > 
> > > > True.  This does complexify how we represent what IOVA ranges are
> > > > valid, though.  I'll bet you most implementations don't actually
> > > > implement a full 64-bit IOVA, which means we effectively have a large
> > > > number of windows from (0..max IOVA) for each valid pasid.  This adds
> > > > another reason I don't think my concept of IOVA windows is just a
> > > > power specific thing.
> > > 
> > > Yes
> > > 
> > > Things rapidly get into weird hardware specific stuff though, the
> > > request will be for things like:
> > >   "ARM PASID page table format from SMMU IP block vXX"
> > 
> > So, I'm happy enough for picking a user-managed pagetable format to
> > imply the set of valid IOVA ranges (though a query might be nice).
> 
> I think a query is mandatory, and optionally asking for ranges seems
> generally useful as a HW property.
> 
> The danger is things can get really tricky as the app can ask for
> ranges some HW needs but other HW can't provide. 
> 
> I would encourage a flow where "generic" apps like DPDK can somehow
> just ignore this, or at least be very, very simplified "I want around
> XX GB of IOVA space"
> 
> dpdk type apps vs qemu apps are really quite different and we should
> be carefully that the needs of HW accelerated vIOMMU emulation do not
> trump the needs of simple universal control over a DMA map.

Agreed.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-17 Thread David Gibson
On Thu, Jun 10, 2021 at 06:37:31PM +0200, Jean-Philippe Brucker wrote:
> On Tue, Jun 08, 2021 at 04:31:50PM +1000, David Gibson wrote:
> > For the qemu case, I would imagine a two stage fallback:
> > 
> > 1) Ask for the exact IOMMU capabilities (including pagetable
> >format) that the vIOMMU has.  If the host can supply, you're
> >good
> > 
> > 2) If not, ask for a kernel managed IOAS.  Verify that it can map
> >all the IOVA ranges the guest vIOMMU needs, and has an equal or
> >smaller pagesize than the guest vIOMMU presents.  If so,
> >software emulate the vIOMMU by shadowing guest io pagetable
> >updates into the kernel managed IOAS.
> > 
> > 3) You're out of luck, don't start.
> > 
> > For both (1) and (2) I'd expect it to be asking this question *after*
> > saying what devices are attached to the IOAS, based on the virtual
> > hardware configuration.  That doesn't cover hotplug, of course, for
> > that you have to just fail the hotplug if the new device isn't
> > supportable with the IOAS you already have.
> 
> Yes. So there is a point in time when the IOAS is frozen, and cannot take
> in new incompatible devices. I think that can support the usage I had in
> mind. If the VMM (non-QEMU, let's say) wanted to create one IOASID FD per
> feature set it could bind the first device, freeze the features, then bind

Are you thinking of this "freeze the features" as an explicitly
triggered action?  I have suggested that an explicit "ENABLE" step
might be useful, but that hasn't had much traction from what I've
seen.

> the second device. If the second bind fails it creates a new FD, allowing
> to fall back to (2) for the second device while keeping (1) for the first
> device. A paravirtual IOMMU like virtio-iommu could easily support this as
> it describes pIOMMU properties for each device to the guest. An emulated
> vIOMMU could also support some hybrid cases as you describe below.

Eh.. in some cases.  The vIOMMU model will often dictate what guest
side devices need to share an an address space, which may make it very
impractical to have them in different address spaces on the host side.

> > One can imagine optimizations where for certain intermediate cases you
> > could do a lighter SW emu if the host supports a model that's close to
> > the vIOMMU one, and you're able to trap and emulate the differences.
> > In practice I doubt anyone's going to have time to look for such cases
> > and implement the logic for it.
> > 
> > > For example depending whether the hardware IOMMU is SMMUv2 or SMMUv3, that
> > > completely changes the capabilities offered to the guest (some v2
> > > implementations support nesting page tables, but never PASID nor PRI
> > > unlike v3.) The same vIOMMU could support either, presenting different
> > > capabilities to the guest, even multiple page table formats if we wanted
> > > to be exhaustive (SMMUv2 supports the older 32-bit descriptor), but it
> > > needs to know early on what the hardware is precisely. Then some new page
> > > table format shows up and, although the vIOMMU can support that in
> > > addition to older ones, QEMU will have to pick a single one, that it
> > > assumes the guest knows how to drive?
> > > 
> > > I think once it binds a device to an IOASID fd, QEMU will want to probe
> > > what hardware features are available before going further with the vIOMMU
> > > setup (is there PASID, PRI, which page table formats are supported,
> > > address size, page granule, etc). Obtaining precise information about the
> > > hardware would be less awkward than trying different configurations until
> > > one succeeds. Binding an additional device would then fail if its pIOMMU
> > > doesn't support exactly the features supported for the first device,
> > > because we don't know which ones the guest will choose. QEMU will have to
> > > open a new IOASID fd for that device.
> > 
> > No, this fundamentally misunderstands the qemu model.  The user
> > *chooses* the guest visible platform, and qemu supplies it or fails.
> > There is no negotiation with the guest, because this makes managing
> > migration impossibly difficult.
> 
> I'd like to understand better where the difficulty lies, with migration.
> Is the problem, once we have a guest running on physical machine A, to
> make sure that physical machine B supports the same IOMMU properties
> before migrating the VM over to B?  Why can't QEMU (instead of the user)
> select a feature set on machine A, then when time comes to migrate, query
> all information from the host kernel on machine B and check that it
> matches what was picked for machine A?  Or is it only trying to
> accommodate different sets of features between A and B, that would be too
> difficult?

There are two problems

1) Although it could be done in theory, it's hard, and it would need a
huge rewrite to qemu's whole migration infrastructure to do this.
We'd need a way of representing host features, 

RE: [RFC] /dev/ioasid uAPI proposal

2021-06-15 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, June 16, 2021 7:59 AM
> 
> On Tue, Jun 15, 2021 at 11:56:28PM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Wednesday, June 16, 2021 7:41 AM
> > >
> > > On Tue, Jun 15, 2021 at 11:09:37PM +, Tian, Kevin wrote:
> > >
> > > > which information can you elaborate? This is the area which I'm not
> > > > familiar with thus would appreciate if you can help explain how this
> > > > bus specific information is utilized within the attach function or
> > > > sometime later.
> > >
> > > This is the idea that the device driver needs to specify which bus
> > > specific protocol it uses to issue DMA's when it attaches itself to an
> > > IOASID. For PCI:
> >
> > What about defining some general attributes instead of asking iommu
> > fd to understand those bus specific detail?
> 
> I prefer the API be very clear and intent driven, otherwise things
> just get confused.
> 
> The whole WBINVD/no-snoop discussion I think is proof of that :\
> 
> > from iommu p.o.v there is no difference from last one. In v2 the device
> > driver just needs to communicate the PASID virtualization policy at
> > device binding time,
> 
> I want it documented in the kernel source WTF is happening, because
> otherwise we are going to be completely lost in a few years. And your
> RFC did have device driver specific differences here
> 
> > > The device knows what it is going to do, we need to convey that to the
> > > IOMMU layer so it is prepared properly.
> >
> > Yes, but it's not necessarily to have iommu fd understand bus specific
> > attributes. In the end when /dev/iommu uAPI calls iommu layer interface,
> > it's all bus agnostic.
> 
> Why not? Just put some inline wrappers to translate the bus specific
> language to your generic language if that is what makes the most
> sense.
> 

I can do this. Thanks
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-15 Thread Jason Gunthorpe
On Tue, Jun 15, 2021 at 11:56:28PM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Wednesday, June 16, 2021 7:41 AM
> > 
> > On Tue, Jun 15, 2021 at 11:09:37PM +, Tian, Kevin wrote:
> > 
> > > which information can you elaborate? This is the area which I'm not
> > > familiar with thus would appreciate if you can help explain how this
> > > bus specific information is utilized within the attach function or
> > > sometime later.
> > 
> > This is the idea that the device driver needs to specify which bus
> > specific protocol it uses to issue DMA's when it attaches itself to an
> > IOASID. For PCI:
> 
> What about defining some general attributes instead of asking iommu
> fd to understand those bus specific detail?

I prefer the API be very clear and intent driven, otherwise things
just get confused.

The whole WBINVD/no-snoop discussion I think is proof of that :\

> from iommu p.o.v there is no difference from last one. In v2 the device
> driver just needs to communicate the PASID virtualization policy at
> device binding time, 

I want it documented in the kernel source WTF is happening, because
otherwise we are going to be completely lost in a few years. And your
RFC did have device driver specific differences here

> > The device knows what it is going to do, we need to convey that to the
> > IOMMU layer so it is prepared properly.
> 
> Yes, but it's not necessarily to have iommu fd understand bus specific
> attributes. In the end when /dev/iommu uAPI calls iommu layer interface,
> it's all bus agnostic. 

Why not? Just put some inline wrappers to translate the bus specific
language to your generic language if that is what makes the most
sense.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-15 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, June 16, 2021 7:41 AM
> 
> On Tue, Jun 15, 2021 at 11:09:37PM +, Tian, Kevin wrote:
> 
> > which information can you elaborate? This is the area which I'm not
> > familiar with thus would appreciate if you can help explain how this
> > bus specific information is utilized within the attach function or
> > sometime later.
> 
> This is the idea that the device driver needs to specify which bus
> specific protocol it uses to issue DMA's when it attaches itself to an
> IOASID. For PCI:

What about defining some general attributes instead of asking iommu
fd to understand those bus specific detail?

> 
> - Normal RID DMA

this is struct device pointer

> - PASID DMA

this is PASID or SSID which is understood by underlying iommu driver

> - ENQCMD triggered PASID DMA

from iommu p.o.v there is no difference from last one. In v2 the device
driver just needs to communicate the PASID virtualization policy at
device binding time, e.g.  whether vPASID is allowed, if yes whether
vPASID must be registered to the kernel, if via kernel whether per-RID
vs. global, etc. This policy is then conveyed to userspace via device 
capability query interface via iommu fd.

> - ATS/PRI enabled or not

Just a generic support I/O page fault or not

> 
> And maybe more. Eg CXL has some other operating modes, I think
> 
> The device knows what it is going to do, we need to convey that to the
> IOMMU layer so it is prepared properly.
> 

Yes, but it's not necessarily to have iommu fd understand bus specific
attributes. In the end when /dev/iommu uAPI calls iommu layer interface,
it's all bus agnostic. 

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-15 Thread Jason Gunthorpe
On Tue, Jun 15, 2021 at 11:09:37PM +, Tian, Kevin wrote:

> which information can you elaborate? This is the area which I'm not
> familiar with thus would appreciate if you can help explain how this
> bus specific information is utilized within the attach function or 
> sometime later.

This is the idea that the device driver needs to specify which bus
specific protocol it uses to issue DMA's when it attaches itself to an
IOASID. For PCI:

- Normal RID DMA
- PASID DMA
- ENQCMD triggered PASID DMA
- ATS/PRI enabled or not

And maybe more. Eg CXL has some other operating modes, I think

The device knows what it is going to do, we need to convey that to the
IOMMU layer so it is prepared properly.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-15 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, June 16, 2021 7:02 AM
> 
> On Tue, Jun 15, 2021 at 10:59:06PM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Tuesday, June 15, 2021 11:07 PM
> > >
> > > On Tue, Jun 15, 2021 at 08:59:25AM +, Tian, Kevin wrote:
> > > > Hi, Jason,
> > > >
> > > > > From: Jason Gunthorpe
> > > > > Sent: Thursday, June 3, 2021 9:05 PM
> > > > >
> > > > > On Thu, Jun 03, 2021 at 06:39:30AM +, Tian, Kevin wrote:
> > > > > > > > Two helper functions are provided to support
> VFIO_ATTACH_IOASID:
> > > > > > > >
> > > > > > > > struct attach_info {
> > > > > > > > u32 ioasid;
> > > > > > > > // If valid, the PASID to be used physically
> > > > > > > > u32 pasid;
> > > > > > > > };
> > > > > > > > int ioasid_device_attach(struct ioasid_dev *dev,
> > > > > > > > struct attach_info info);
> > > > > > > > int ioasid_device_detach(struct ioasid_dev *dev, u32 
> > > > > > > > ioasid);
> > > > > > >
> > > > > > > Honestly, I still prefer this to be highly explicit as this is 
> > > > > > > where
> > > > > > > all device driver authors get invovled:
> > > > > > >
> > > > > > > ioasid_pci_device_attach(struct pci_device *pdev, struct
> ioasid_dev
> > > *dev,
> > > > > > > u32 ioasid);
> > > > > > > ioasid_pci_device_pasid_attach(struct pci_device *pdev, u32
> > > > > *physical_pasid,
> > > > > > > struct ioasid_dev *dev, u32 ioasid);
> > > > > >
> > > > > > Then better naming it as pci_device_attach_ioasid since the 1st
> > > parameter
> > > > > > is struct pci_device?
> > > > >
> > > > > No, the leading tag indicates the API's primary subystem, in this case
> > > > > it is iommu (and if you prefer list the iommu related arguments first)
> > > > >
> > > >
> > > > I have a question on this suggestion when working on v2.
> > > >
> > > > Within IOMMU fd it uses only the generic struct device pointer, which
> > > > is already saved in struct ioasid_dev at device bind time:
> > > >
> > > > struct ioasid_dev *ioasid_register_device(struct ioasid_ctx 
> > > > *ctx,
> > > > struct device *device, u64 device_label);
> > > >
> > > > What does this additional struct pci_device bring when it's specified in
> > > > the attach call? If we save it in attach_data, at which point will it be
> > > > used or checked?
> > >
> > > The above was for attaching to an ioasid not the register path
> >
> > Yes, I know. and this is my question. When receiving a struct pci_device
> > at attach time, what should IOMMU fd do with it? Just verify whether
> > pci_device->device is same as ioasid_dev->device? if saving it to per-device
> > attach data under ioasid then when will it be further used?
> >
> > I feel once ioasid_dev is returned in the register path, following 
> > operations
> > (unregister, attach, detach) just uses ioasid_dev as the main object.
> 
> The point of having the pci_device specific API was to convey bus
> specific information during the attachment to the IOASID.

which information can you elaborate? This is the area which I'm not
familiar with thus would appreciate if you can help explain how this
bus specific information is utilized within the attach function or 
sometime later.

> 
> The registration of the device to the iommu_fd doesn't need bus
> specific information, AFIAK? So just use a normal struct device
> pointer
> 

yes.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-15 Thread Jason Gunthorpe
On Tue, Jun 15, 2021 at 10:59:06PM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Tuesday, June 15, 2021 11:07 PM
> > 
> > On Tue, Jun 15, 2021 at 08:59:25AM +, Tian, Kevin wrote:
> > > Hi, Jason,
> > >
> > > > From: Jason Gunthorpe
> > > > Sent: Thursday, June 3, 2021 9:05 PM
> > > >
> > > > On Thu, Jun 03, 2021 at 06:39:30AM +, Tian, Kevin wrote:
> > > > > > > Two helper functions are provided to support VFIO_ATTACH_IOASID:
> > > > > > >
> > > > > > >   struct attach_info {
> > > > > > >   u32 ioasid;
> > > > > > >   // If valid, the PASID to be used physically
> > > > > > >   u32 pasid;
> > > > > > >   };
> > > > > > >   int ioasid_device_attach(struct ioasid_dev *dev,
> > > > > > >   struct attach_info info);
> > > > > > >   int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);
> > > > > >
> > > > > > Honestly, I still prefer this to be highly explicit as this is where
> > > > > > all device driver authors get invovled:
> > > > > >
> > > > > > ioasid_pci_device_attach(struct pci_device *pdev, struct ioasid_dev
> > *dev,
> > > > > > u32 ioasid);
> > > > > > ioasid_pci_device_pasid_attach(struct pci_device *pdev, u32
> > > > *physical_pasid,
> > > > > > struct ioasid_dev *dev, u32 ioasid);
> > > > >
> > > > > Then better naming it as pci_device_attach_ioasid since the 1st
> > parameter
> > > > > is struct pci_device?
> > > >
> > > > No, the leading tag indicates the API's primary subystem, in this case
> > > > it is iommu (and if you prefer list the iommu related arguments first)
> > > >
> > >
> > > I have a question on this suggestion when working on v2.
> > >
> > > Within IOMMU fd it uses only the generic struct device pointer, which
> > > is already saved in struct ioasid_dev at device bind time:
> > >
> > >   struct ioasid_dev *ioasid_register_device(struct ioasid_ctx *ctx,
> > >   struct device *device, u64 device_label);
> > >
> > > What does this additional struct pci_device bring when it's specified in
> > > the attach call? If we save it in attach_data, at which point will it be
> > > used or checked?
> > 
> > The above was for attaching to an ioasid not the register path
> 
> Yes, I know. and this is my question. When receiving a struct pci_device
> at attach time, what should IOMMU fd do with it? Just verify whether 
> pci_device->device is same as ioasid_dev->device? if saving it to per-device
> attach data under ioasid then when will it be further used?
> 
> I feel once ioasid_dev is returned in the register path, following operations
> (unregister, attach, detach) just uses ioasid_dev as the main object.

The point of having the pci_device specific API was to convey bus
specific information during the attachment to the IOASID.

The registration of the device to the iommu_fd doesn't need bus
specific information, AFIAK? So just use a normal struct device
pointer

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-15 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, June 15, 2021 11:07 PM
> 
> On Tue, Jun 15, 2021 at 08:59:25AM +, Tian, Kevin wrote:
> > Hi, Jason,
> >
> > > From: Jason Gunthorpe
> > > Sent: Thursday, June 3, 2021 9:05 PM
> > >
> > > On Thu, Jun 03, 2021 at 06:39:30AM +, Tian, Kevin wrote:
> > > > > > Two helper functions are provided to support VFIO_ATTACH_IOASID:
> > > > > >
> > > > > > struct attach_info {
> > > > > > u32 ioasid;
> > > > > > // If valid, the PASID to be used physically
> > > > > > u32 pasid;
> > > > > > };
> > > > > > int ioasid_device_attach(struct ioasid_dev *dev,
> > > > > > struct attach_info info);
> > > > > > int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);
> > > > >
> > > > > Honestly, I still prefer this to be highly explicit as this is where
> > > > > all device driver authors get invovled:
> > > > >
> > > > > ioasid_pci_device_attach(struct pci_device *pdev, struct ioasid_dev
> *dev,
> > > > > u32 ioasid);
> > > > > ioasid_pci_device_pasid_attach(struct pci_device *pdev, u32
> > > *physical_pasid,
> > > > > struct ioasid_dev *dev, u32 ioasid);
> > > >
> > > > Then better naming it as pci_device_attach_ioasid since the 1st
> parameter
> > > > is struct pci_device?
> > >
> > > No, the leading tag indicates the API's primary subystem, in this case
> > > it is iommu (and if you prefer list the iommu related arguments first)
> > >
> >
> > I have a question on this suggestion when working on v2.
> >
> > Within IOMMU fd it uses only the generic struct device pointer, which
> > is already saved in struct ioasid_dev at device bind time:
> >
> > struct ioasid_dev *ioasid_register_device(struct ioasid_ctx *ctx,
> > struct device *device, u64 device_label);
> >
> > What does this additional struct pci_device bring when it's specified in
> > the attach call? If we save it in attach_data, at which point will it be
> > used or checked?
> 
> The above was for attaching to an ioasid not the register path

Yes, I know. and this is my question. When receiving a struct pci_device
at attach time, what should IOMMU fd do with it? Just verify whether 
pci_device->device is same as ioasid_dev->device? if saving it to per-device
attach data under ioasid then when will it be further used?

I feel once ioasid_dev is returned in the register path, following operations
(unregister, attach, detach) just uses ioasid_dev as the main object.

> 
> You should call 'device_label' 'device_cookie' if it is a user
> provided u64
> 

will do.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-15 Thread Jason Gunthorpe
On Tue, Jun 15, 2021 at 08:59:25AM +, Tian, Kevin wrote:
> Hi, Jason,
> 
> > From: Jason Gunthorpe
> > Sent: Thursday, June 3, 2021 9:05 PM
> > 
> > On Thu, Jun 03, 2021 at 06:39:30AM +, Tian, Kevin wrote:
> > > > > Two helper functions are provided to support VFIO_ATTACH_IOASID:
> > > > >
> > > > >   struct attach_info {
> > > > >   u32 ioasid;
> > > > >   // If valid, the PASID to be used physically
> > > > >   u32 pasid;
> > > > >   };
> > > > >   int ioasid_device_attach(struct ioasid_dev *dev,
> > > > >   struct attach_info info);
> > > > >   int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);
> > > >
> > > > Honestly, I still prefer this to be highly explicit as this is where
> > > > all device driver authors get invovled:
> > > >
> > > > ioasid_pci_device_attach(struct pci_device *pdev, struct ioasid_dev 
> > > > *dev,
> > > > u32 ioasid);
> > > > ioasid_pci_device_pasid_attach(struct pci_device *pdev, u32
> > *physical_pasid,
> > > > struct ioasid_dev *dev, u32 ioasid);
> > >
> > > Then better naming it as pci_device_attach_ioasid since the 1st parameter
> > > is struct pci_device?
> > 
> > No, the leading tag indicates the API's primary subystem, in this case
> > it is iommu (and if you prefer list the iommu related arguments first)
> > 
> 
> I have a question on this suggestion when working on v2.
> 
> Within IOMMU fd it uses only the generic struct device pointer, which
> is already saved in struct ioasid_dev at device bind time:
> 
>   struct ioasid_dev *ioasid_register_device(struct ioasid_ctx *ctx,
>   struct device *device, u64 device_label);
> 
> What does this additional struct pci_device bring when it's specified in
> the attach call? If we save it in attach_data, at which point will it be
> used or checked? 

The above was for attaching to an ioasid not the register path

You should call 'device_label' 'device_cookie' if it is a user
provided u64

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-15 Thread Tian, Kevin
Hi, Jason,

> From: Jason Gunthorpe
> Sent: Thursday, June 3, 2021 9:05 PM
> 
> On Thu, Jun 03, 2021 at 06:39:30AM +, Tian, Kevin wrote:
> > > > Two helper functions are provided to support VFIO_ATTACH_IOASID:
> > > >
> > > > struct attach_info {
> > > > u32 ioasid;
> > > > // If valid, the PASID to be used physically
> > > > u32 pasid;
> > > > };
> > > > int ioasid_device_attach(struct ioasid_dev *dev,
> > > > struct attach_info info);
> > > > int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);
> > >
> > > Honestly, I still prefer this to be highly explicit as this is where
> > > all device driver authors get invovled:
> > >
> > > ioasid_pci_device_attach(struct pci_device *pdev, struct ioasid_dev *dev,
> > > u32 ioasid);
> > > ioasid_pci_device_pasid_attach(struct pci_device *pdev, u32
> *physical_pasid,
> > > struct ioasid_dev *dev, u32 ioasid);
> >
> > Then better naming it as pci_device_attach_ioasid since the 1st parameter
> > is struct pci_device?
> 
> No, the leading tag indicates the API's primary subystem, in this case
> it is iommu (and if you prefer list the iommu related arguments first)
> 

I have a question on this suggestion when working on v2.

Within IOMMU fd it uses only the generic struct device pointer, which
is already saved in struct ioasid_dev at device bind time:

struct ioasid_dev *ioasid_register_device(struct ioasid_ctx *ctx,
struct device *device, u64 device_label);

What does this additional struct pci_device bring when it's specified in
the attach call? If we save it in attach_data, at which point will it be
used or checked? 

Can you help elaborate?

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-10 Thread Jason Wang


在 2021/6/10 下午7:47, Jason Gunthorpe 写道:

On Thu, Jun 10, 2021 at 10:00:01AM +0800, Jason Wang wrote:

在 2021/6/8 下午9:20, Jason Gunthorpe 写道:

On Tue, Jun 08, 2021 at 09:10:42AM +0800, Jason Wang wrote:


Well, this sounds like a re-invention of io_uring which has already worked
for multifds.

How so? io_uring is about sending work to the kernel, not getting
structued events back?


Actually it can. Userspace can poll multiple fds via preparing multiple sqes
with IORING_OP_ADD flag.

Poll is only a part of what is needed here, the main issue is
transfering the PRI events to userspace quickly.



Do we really care e.g at most one more syscall in this case? I think the 
time spent on demand paging is much more than transferring #PF to 
userspace. What's more, a well designed vIOMMU capable IOMMU hardware 
should have the ability to inject such event directly to guest if #PF 
happens on L1.






This means another ring and we need introduce ioctl() to add or remove
ioasids from the poll. And it still need a kind of fallback like a list if
the ring is full.

The max size of the ring should be determinable based on the PRI
concurrance of each device and the number of devices sharing the ring



This has at least one assumption, #PF event is the only event for the 
ring, I'm not sure this is the case.


Thanks




In any event, I'm not entirely convinced eliding the PRI user/kernel
copy is the main issue here.. If we want this to be low latency I
think it ends up with some kernel driver component assisting the
vIOMMU emulation and avoiding the round trip to userspace

Jason



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-10 Thread Jean-Philippe Brucker
On Tue, Jun 08, 2021 at 04:31:50PM +1000, David Gibson wrote:
> For the qemu case, I would imagine a two stage fallback:
> 
> 1) Ask for the exact IOMMU capabilities (including pagetable
>format) that the vIOMMU has.  If the host can supply, you're
>good
> 
> 2) If not, ask for a kernel managed IOAS.  Verify that it can map
>all the IOVA ranges the guest vIOMMU needs, and has an equal or
>smaller pagesize than the guest vIOMMU presents.  If so,
>software emulate the vIOMMU by shadowing guest io pagetable
>updates into the kernel managed IOAS.
> 
> 3) You're out of luck, don't start.
> 
> For both (1) and (2) I'd expect it to be asking this question *after*
> saying what devices are attached to the IOAS, based on the virtual
> hardware configuration.  That doesn't cover hotplug, of course, for
> that you have to just fail the hotplug if the new device isn't
> supportable with the IOAS you already have.

Yes. So there is a point in time when the IOAS is frozen, and cannot take
in new incompatible devices. I think that can support the usage I had in
mind. If the VMM (non-QEMU, let's say) wanted to create one IOASID FD per
feature set it could bind the first device, freeze the features, then bind
the second device. If the second bind fails it creates a new FD, allowing
to fall back to (2) for the second device while keeping (1) for the first
device. A paravirtual IOMMU like virtio-iommu could easily support this as
it describes pIOMMU properties for each device to the guest. An emulated
vIOMMU could also support some hybrid cases as you describe below.

> One can imagine optimizations where for certain intermediate cases you
> could do a lighter SW emu if the host supports a model that's close to
> the vIOMMU one, and you're able to trap and emulate the differences.
> In practice I doubt anyone's going to have time to look for such cases
> and implement the logic for it.
> 
> > For example depending whether the hardware IOMMU is SMMUv2 or SMMUv3, that
> > completely changes the capabilities offered to the guest (some v2
> > implementations support nesting page tables, but never PASID nor PRI
> > unlike v3.) The same vIOMMU could support either, presenting different
> > capabilities to the guest, even multiple page table formats if we wanted
> > to be exhaustive (SMMUv2 supports the older 32-bit descriptor), but it
> > needs to know early on what the hardware is precisely. Then some new page
> > table format shows up and, although the vIOMMU can support that in
> > addition to older ones, QEMU will have to pick a single one, that it
> > assumes the guest knows how to drive?
> > 
> > I think once it binds a device to an IOASID fd, QEMU will want to probe
> > what hardware features are available before going further with the vIOMMU
> > setup (is there PASID, PRI, which page table formats are supported,
> > address size, page granule, etc). Obtaining precise information about the
> > hardware would be less awkward than trying different configurations until
> > one succeeds. Binding an additional device would then fail if its pIOMMU
> > doesn't support exactly the features supported for the first device,
> > because we don't know which ones the guest will choose. QEMU will have to
> > open a new IOASID fd for that device.
> 
> No, this fundamentally misunderstands the qemu model.  The user
> *chooses* the guest visible platform, and qemu supplies it or fails.
> There is no negotiation with the guest, because this makes managing
> migration impossibly difficult.

I'd like to understand better where the difficulty lies, with migration.
Is the problem, once we have a guest running on physical machine A, to
make sure that physical machine B supports the same IOMMU properties
before migrating the VM over to B?  Why can't QEMU (instead of the user)
select a feature set on machine A, then when time comes to migrate, query
all information from the host kernel on machine B and check that it
matches what was picked for machine A?  Or is it only trying to
accommodate different sets of features between A and B, that would be too
difficult?

Thanks,
Jean

> 
> -cpu host is an exception, which is used because it is so useful, but
> it's kind of a pain on the qemu side.  Virt management systems like
> oVirt/RHV almost universally *do not use* -cpu host, precisely because
> it cannot support predictable migration.
> 
> -- 
> David Gibson  | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au| minimalist, thank you.  NOT _the_ 
> _other_
>   | _way_ _around_!
> http://www.ozlabs.org/~dgibson


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-10 Thread Jason Gunthorpe
On Thu, Jun 10, 2021 at 10:00:01AM +0800, Jason Wang wrote:
> 
> 在 2021/6/8 下午9:20, Jason Gunthorpe 写道:
> > On Tue, Jun 08, 2021 at 09:10:42AM +0800, Jason Wang wrote:
> > 
> > > Well, this sounds like a re-invention of io_uring which has already worked
> > > for multifds.
> > How so? io_uring is about sending work to the kernel, not getting
> > structued events back?
> 
> 
> Actually it can. Userspace can poll multiple fds via preparing multiple sqes
> with IORING_OP_ADD flag.

Poll is only a part of what is needed here, the main issue is
transfering the PRI events to userspace quickly.

> This means another ring and we need introduce ioctl() to add or remove
> ioasids from the poll. And it still need a kind of fallback like a list if
> the ring is full.

The max size of the ring should be determinable based on the PRI
concurrance of each device and the number of devices sharing the ring

In any event, I'm not entirely convinced eliding the PRI user/kernel
copy is the main issue here.. If we want this to be low latency I
think it ends up with some kernel driver component assisting the
vIOMMU emulation and avoiding the round trip to userspace

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-09 Thread Jason Wang


在 2021/6/10 上午10:00, Jason Wang 写道:


在 2021/6/8 下午9:20, Jason Gunthorpe 写道:

On Tue, Jun 08, 2021 at 09:10:42AM +0800, Jason Wang wrote:

Well, this sounds like a re-invention of io_uring which has already 
worked

for multifds.

How so? io_uring is about sending work to the kernel, not getting
structued events back?



Actually it can. Userspace can poll multiple fds via preparing 
multiple sqes with IORING_OP_ADD flag.



IORING_OP_POLL_ADD actually.

Thanks







It is more like one of the perf rings



This means another ring and we need introduce ioctl() to add or remove 
ioasids from the poll. And it still need a kind of fallback like a 
list if the ring is full.


Thanks




Jason



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-09 Thread Jason Wang


在 2021/6/8 下午6:45, Enrico Weigelt, metux IT consult 写道:

On 07.06.21 20:01, Jason Gunthorpe wrote:

 it is what it is, select has a fixed size bitmap of FD #s and
a hard upper bound on that size as part of the glibc ABI - can't be
fixed.


in glibc ABI ? Uuuuh!



Note that dealing with select() or try to overcome the limitation via 
epoll() directly via the application is not a good practice (or it's not 
portable).


It's suggested to use building blocks provided by glib, e.g the main 
event loop[1]. That is how Qemu solve the issues of dealing with a lot 
of file descriptors.


Thanks

[1] https://developer.gnome.org/glib/stable/glib-The-Main-Event-Loop.html




--mtx



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-09 Thread Jason Wang


在 2021/6/8 下午9:20, Jason Gunthorpe 写道:

On Tue, Jun 08, 2021 at 09:10:42AM +0800, Jason Wang wrote:


Well, this sounds like a re-invention of io_uring which has already worked
for multifds.

How so? io_uring is about sending work to the kernel, not getting
structued events back?



Actually it can. Userspace can poll multiple fds via preparing multiple 
sqes with IORING_OP_ADD flag.





It is more like one of the perf rings



This means another ring and we need introduce ioctl() to add or remove 
ioasids from the poll. And it still need a kind of fallback like a list 
if the ring is full.


Thanks




Jason



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-09 Thread Alex Williamson
On Wed, 9 Jun 2021 02:49:32 +
"Tian, Kevin"  wrote:

> > From: Alex Williamson 
> > Sent: Wednesday, June 9, 2021 2:47 AM
> > 
> > On Tue, 8 Jun 2021 15:44:26 +0200
> > Paolo Bonzini  wrote:
> >   
> > > On 08/06/21 15:15, Jason Gunthorpe wrote:  
> > > > On Tue, Jun 08, 2021 at 09:56:09AM +0200, Paolo Bonzini wrote:
> > > >  
> > >  Alternatively you can add a KVM_DEV_IOASID_{ADD,DEL} pair of  
> > ioctls. But it  
> > >  seems useless complication compared to just using what we have  
> > now, at least  
> > >  while VMs only use IOASIDs via VFIO.  
> > > >>>
> > > >>> The simplest is KVM_ENABLE_WBINVD() and be  
> > done  
> > > >>> with it.  
> > 
> > Even if we were to relax wbinvd access to any device (capable of
> > no-snoop or not) in any IOMMU configuration (blocking no-snoop or not),
> > I think as soon as we say "proof" is required to gain this access then
> > that proof should be ongoing for the life of the access.
> > 
> > That alone makes this more than a "I want this feature, here's my
> > proof", one-shot ioctl.  Like the groupfd enabling a path for KVM to
> > ask if that group is non-coherent and holding a group reference to
> > prevent the group from being used to authorize multiple KVM instances,
> > the ioasidfd proof would need to minimally validate that devices are
> > present and provide some reference to enforce that model as ongoing, or
> > notifier to indicate an end of that authorization.  I don't think we
> > can simplify that out of the equation or we've essentially invalidated
> > that the proof is really required.
> >   
> > > >>
> > > >> The simplest one is KVM_DEV_VFIO_GROUP_ADD/DEL, that already  
> > exists and also  
> > > >> covers hot-unplug.  The second simplest one is  
> > KVM_DEV_IOASID_ADD/DEL.  
> > > >
> > > > This isn't the same thing, this is back to trying to have the kernel
> > > > set policy for userspace.  
> > >
> > > If you want a userspace policy then there would be three states:
> > >
> > > * WBINVD enabled because a WBINVD-enabled VFIO device is attached.
> > >
> > > * WBINVD potentially enabled but no WBINVD-enabled VFIO device  
> > attached  
> > >
> > > * WBINVD forcefully disabled
> > >
> > > KVM_DEV_VFIO_GROUP_ADD/DEL can still be used to distinguish the first
> > > two.  Due to backwards compatibility, those two describe the default
> > > behavior; disabling wbinvd can be done easily with a new sub-ioctl of
> > > KVM_ENABLE_CAP and doesn't require any security proof.  
> > 
> > That seems like a good model, use the kvm-vfio device for the default
> > behavior and extend an existing KVM ioctl if QEMU still needs a way to
> > tell KVM to assume all DMA is coherent, regardless of what the kvm-vfio
> > device reports.
> > 
> > If feels like we should be able to support a backwards compatibility
> > mode using the vfio group, but I expect long term we'll want to
> > transition the kvm-vfio device from a groupfd to an ioasidfd.
> >   
> > > The meaning of WBINVD-enabled is "won't return -ENXIO for the wbinvd
> > > ioctl", nothing more nothing less.  If all VFIO devices are going to be
> > > WBINVD-enabled, then that will reflect on KVM as well, and I won't have
> > > anything to object if there's consensus on the device assignment side of
> > > things that the wbinvd ioctl won't ever fail.  
> > 
> > If we create the IOMMU vs device coherency matrix:
> > 
> > \ Device supports
> > IOMMU blocks \   no-snoop
> >   no-snoop\  yes | no  |
> > ---+-+-+
> >yes |  1  |  2  |
> > ---+-+-+
> >no  |  3  |  4  |
> > ---+-+-+
> > 
> > DMA is always coherent in boxes {1,2,4} (wbinvd emulation is not
> > needed).  VFIO will currently always configure the IOMMU for {1,2} when
> > the feature is supported.  Boxes {3,4} are where we'll currently
> > emulate wbinvd.  The best we could do, not knowing the guest or
> > insights into the guest driver would be to only emulate wbinvd for {3}.
> > 
> > The majority of devices appear to report no-snoop support {1,3}, but
> > the claim is that it's mostly unused outside of GPUs, effectively {2,4}.
> > I'll speculate that most IOMMUs support enforcing coherency (amd, arm,
> > fsl unconditionally, intel conditionally) {1,2}.
> > 
> > I think that means we're currently operating primarily in Box {1},
> > which does not seem to lean towards unconditional wbinvd access with
> > device ownership.
> > 
> > I think we have a desire with IOASID to allow user policy to operate
> > certain devices in {3} and I think the argument there is that a
> > specific software enforced coherence sync is more efficient on the bus
> > than the constant coherence enforcement by the IOMMU.
> > 
> > I think that the disable mode Jason proposed is essentially just a way
> > to move a device from {3} to {4}, ie. the IOASID support or
> > configuration does not block no-snoop and the device claims to support
> > no-snoop, but doesn't use 

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-09 Thread Paolo Bonzini

On 09/06/21 16:45, Jason Gunthorpe wrote:

On Wed, Jun 09, 2021 at 08:31:34AM -0600, Alex Williamson wrote:


If we go back to the wbinvd ioctl mechanism, if I call that ioctl with
an ioasidfd that contains no devices, then I shouldn't be able to
generate a wbinvd on the processor, right?  If I add a device,
especially in a configuration that can generate non-coherent DMA, now
that ioctl should work.  If I then remove all devices from that ioasid,
what then is the difference from the initial state.  Should the ioctl
now work because it worked once in the past?


The ioctl is fine, but telling KVM to enable WBINVD is very similar to
open and then reconfiguring the ioasid_fd is very similar to
chmod. From a security perspective revoke is not strictly required,
IMHO.


I absolutely do *not* want an API that tells KVM to enable WBINVD.  This 
is not up for discussion.


But really, let's stop calling the file descriptor a security proof or a 
capability.  It's overkill; all that we are doing here is kernel 
acceleration of the WBINVD ioctl.


As a thought experiment, let's consider what would happen if wbinvd 
caused an unconditional exit from guest to userspace.  Userspace would 
react by invoking the ioctl on the ioasid.  The proposed functionality 
is just an acceleration of this same thing, avoiding the 
guest->KVM->userspace->IOASID->wbinvd trip.


This is why the API that I want, and that is already exists for VFIO 
group file descriptors, informs KVM of which "ioctls" the guest should 
be able to do via privileged instructions[1].  Then the kernel works out 
with KVM how to ensure a 1:1 correspondence between the operation of the 
ioctls and the privileged operations.


One way to do it would be to always trap WBINVD and invoke the same 
kernel function that implements the ioctl.  The function would do either 
a wbinvd or nothing, based on whether the ioasid has any device.  The 
next logical step is a notification mechanism that enables WBINVD (by 
disabling the WBINVD intercept) when there are devices in the ioasidfd, 
and disables WBINVD (by enabling a no-op intercept) when there are none.


And in fact once all VFIO devices are gone, wbinvd is for all purposes a 
no-op as far as the guest kernel can tell.  So there's no reason to 
treat it as anything but a no-op.


Thanks,

Paolo

[1] As an aside, I must admit I didn't entirely understand the design of 
the KVM-VFIO device back when Alex added it.  But with this model it was 
absolutely the right thing to do, and it remains the right thing to do 
even if VFIO groups are replaced with IOASID file descriptors.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-09 Thread Jason Gunthorpe
On Wed, Jun 09, 2021 at 08:31:34AM -0600, Alex Williamson wrote:

> If we go back to the wbinvd ioctl mechanism, if I call that ioctl with
> an ioasidfd that contains no devices, then I shouldn't be able to
> generate a wbinvd on the processor, right?  If I add a device,
> especially in a configuration that can generate non-coherent DMA, now
> that ioctl should work.  If I then remove all devices from that ioasid,
> what then is the difference from the initial state.  Should the ioctl
> now work because it worked once in the past?

The ioctl is fine, but telling KVM to enable WBINVD is very similar to
open and then reconfiguring the ioasid_fd is very similar to
chmod. From a security perspective revoke is not strictly required,
IMHO.

> access.  This is no different than starting a shell via sudo (ie. an
> ongoing reference) or having the previous authentication time out, or
> in our case be notified it has expired.

Those are all authentication gates as well, yes sudo has a timer, but
once the timer expires it doesn't forcibly revoke & close all the
existing sudo sessions. It just means you can't create new ones
without authenticating.

> > > That's already more or less meaningless for both KVM and VFIO, since they
> > > are tied to an mm.  
> > 
> > vfio isn't supposed to be tied to a mm.
> 
> vfio does accounting against an mm, why shouldn't it be tied to an mm?

It looks like vfio type 1 is doing it properly, each ranch of of user
VA is stuffed into a struct vfio_dma and that contains a struct task
(which can be a mm_struct these days) that refers to the owning mm.

Looks like a single fd can hold multiple vfio_dma's and I don't see an
enforcment that current is locked to any specific process.

When the accounting is done it is done via the mm obtained through the
vfio_dma struct, not a global FD wide mm.

This appears all fine for something using pin_user_pages(). We don't
expect FDs to become locked to a single process on the first call to
pin_user_pages() that is un-unixy.

kvm is special in this regard.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-09 Thread Jason Gunthorpe
On Wed, Jun 09, 2021 at 03:24:11PM +0200, Paolo Bonzini wrote:
> On 09/06/21 14:47, Jason Gunthorpe wrote:
> > On Wed, Jun 09, 2021 at 02:46:05PM +0200, Paolo Bonzini wrote:
> > > On 09/06/21 13:57, Jason Gunthorpe wrote:
> > > > On Wed, Jun 09, 2021 at 02:49:32AM +, Tian, Kevin wrote:
> > > > 
> > > > > Last unclosed open. Jason, you dislike symbol_get in this contract per
> > > > > earlier comment. As Alex explained, looks it's more about module
> > > > > dependency which is orthogonal to how this contract is designed. What
> > > > > is your opinion now?
> > > > 
> > > > Generally when you see symbol_get like this it suggests something is
> > > > wrong in the layering..
> > > > 
> > > > Why shouldn't kvm have a normal module dependency on drivers/iommu?
> > > 
> > > It allows KVM to load even if there's an "install /bin/false" for vfio
> > > (typically used together with the blacklist directive) in modprobe.conf.
> > > This rationale should apply to iommu as well.
> > 
> > I can vaugely understand this rational for vfio, but not at all for
> > the platform's iommu driver, sorry.
> 
> Sorry, should apply to ioasid, not iommu (assuming that /dev/ioasid support
> would be modular).

/dev/ioasid and drivers/iommu are tightly coupled things, I we can put
a key symbol or two in a place where it will not be a problem to
depend on.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-09 Thread Alex Williamson
On Wed, 9 Jun 2021 08:54:45 -0300
Jason Gunthorpe  wrote:

> On Wed, Jun 09, 2021 at 11:11:17AM +0200, Paolo Bonzini wrote:
> > On 09/06/21 10:51, Enrico Weigelt, metux IT consult wrote:  
> > > On 08.06.21 21:00, Jason Gunthorpe wrote:
> > >   
> > > > Eg I can do open() on a file and I get to keep that FD. I get to keep
> > > > that FD even if someone later does chmod() on that file so I can't
> > > > open it again.
> > > > 
> > > > There are lots of examples where a one time access control check
> > > > provides continuing access to a resource. I feel the ongoing proof is
> > > > the rarity in Unix.. 'revoke' is an uncommon concept in Unix..  
> > > 
> > > Yes, it's even possible that somebody w/ privileges opens an fd and
> > > hands it over to somebody unprivileged (eg. via unix socket). This is
> > > a very basic unix concept. If some (already opened) fd now suddenly
> > > behaves differently based on the current caller, that would be a break
> > > with traditional unix semantics.  

That's not the scenario here, this would work as expected.  It's not a
matter of who uses the fd it's that a property of the fd that provided
elevated access has been removed.  I only need to look as far as sudo
to find examples of protocols where having access at some point in time
does not guarantee ongoing access.

If we go back to the wbinvd ioctl mechanism, if I call that ioctl with
an ioasidfd that contains no devices, then I shouldn't be able to
generate a wbinvd on the processor, right?  If I add a device,
especially in a configuration that can generate non-coherent DMA, now
that ioctl should work.  If I then remove all devices from that ioasid,
what then is the difference from the initial state.  Should the ioctl
now work because it worked once in the past?

If we equate KVM to the process access to the ioctl, then a reference
or notification method should be used to retain or mark the end of that
access.  This is no different than starting a shell via sudo (ie. an
ongoing reference) or having the previous authentication time out, or
in our case be notified it has expired.

> > That's already more or less meaningless for both KVM and VFIO, since they
> > are tied to an mm.  
> 
> vfio isn't supposed to be tied to a mm.

vfio does accounting against an mm, why shouldn't it be tied to an mm?
Maybe you mean in the new model where vfio is just a device driver?
Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-09 Thread Paolo Bonzini

On 09/06/21 14:47, Jason Gunthorpe wrote:

On Wed, Jun 09, 2021 at 02:46:05PM +0200, Paolo Bonzini wrote:

On 09/06/21 13:57, Jason Gunthorpe wrote:

On Wed, Jun 09, 2021 at 02:49:32AM +, Tian, Kevin wrote:


Last unclosed open. Jason, you dislike symbol_get in this contract per
earlier comment. As Alex explained, looks it's more about module
dependency which is orthogonal to how this contract is designed. What
is your opinion now?


Generally when you see symbol_get like this it suggests something is
wrong in the layering..

Why shouldn't kvm have a normal module dependency on drivers/iommu?


It allows KVM to load even if there's an "install /bin/false" for vfio
(typically used together with the blacklist directive) in modprobe.conf.
This rationale should apply to iommu as well.


I can vaugely understand this rational for vfio, but not at all for
the platform's iommu driver, sorry.


Sorry, should apply to ioasid, not iommu (assuming that /dev/ioasid 
support would be modular).


Paolo

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-09 Thread Jason Gunthorpe
On Wed, Jun 09, 2021 at 02:46:05PM +0200, Paolo Bonzini wrote:
> On 09/06/21 13:57, Jason Gunthorpe wrote:
> > On Wed, Jun 09, 2021 at 02:49:32AM +, Tian, Kevin wrote:
> > 
> > > Last unclosed open. Jason, you dislike symbol_get in this contract per
> > > earlier comment. As Alex explained, looks it's more about module
> > > dependency which is orthogonal to how this contract is designed. What
> > > is your opinion now?
> > 
> > Generally when you see symbol_get like this it suggests something is
> > wrong in the layering..
> > 
> > Why shouldn't kvm have a normal module dependency on drivers/iommu?
> 
> It allows KVM to load even if there's an "install /bin/false" for vfio
> (typically used together with the blacklist directive) in modprobe.conf.
> This rationale should apply to iommu as well.

I can vaugely understand this rational for vfio, but not at all for
the platform's iommu driver, sorry.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-09 Thread Paolo Bonzini

On 09/06/21 13:57, Jason Gunthorpe wrote:

On Wed, Jun 09, 2021 at 02:49:32AM +, Tian, Kevin wrote:


Last unclosed open. Jason, you dislike symbol_get in this contract per
earlier comment. As Alex explained, looks it's more about module
dependency which is orthogonal to how this contract is designed. What
is your opinion now?


Generally when you see symbol_get like this it suggests something is
wrong in the layering..

Why shouldn't kvm have a normal module dependency on drivers/iommu?


It allows KVM to load even if there's an "install /bin/false" for vfio 
(typically used together with the blacklist directive) in modprobe.conf. 
 This rationale should apply to iommu as well.


Paolo

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-09 Thread Jason Gunthorpe
On Wed, Jun 09, 2021 at 02:49:32AM +, Tian, Kevin wrote:

> Last unclosed open. Jason, you dislike symbol_get in this contract per
> earlier comment. As Alex explained, looks it's more about module
> dependency which is orthogonal to how this contract is designed. What
> is your opinion now?

Generally when you see symbol_get like this it suggests something is
wrong in the layering..

Why shouldn't kvm have a normal module dependency on drivers/iommu?

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-09 Thread Jason Gunthorpe
On Wed, Jun 09, 2021 at 11:11:17AM +0200, Paolo Bonzini wrote:
> On 09/06/21 10:51, Enrico Weigelt, metux IT consult wrote:
> > On 08.06.21 21:00, Jason Gunthorpe wrote:
> > 
> > > Eg I can do open() on a file and I get to keep that FD. I get to keep
> > > that FD even if someone later does chmod() on that file so I can't
> > > open it again.
> > > 
> > > There are lots of examples where a one time access control check
> > > provides continuing access to a resource. I feel the ongoing proof is
> > > the rarity in Unix.. 'revoke' is an uncommon concept in Unix..
> > 
> > Yes, it's even possible that somebody w/ privileges opens an fd and
> > hands it over to somebody unprivileged (eg. via unix socket). This is
> > a very basic unix concept. If some (already opened) fd now suddenly
> > behaves differently based on the current caller, that would be a break
> > with traditional unix semantics.
> 
> That's already more or less meaningless for both KVM and VFIO, since they
> are tied to an mm.

vfio isn't supposed to be tied to a mm.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-09 Thread Paolo Bonzini

On 09/06/21 10:51, Enrico Weigelt, metux IT consult wrote:

On 08.06.21 21:00, Jason Gunthorpe wrote:


Eg I can do open() on a file and I get to keep that FD. I get to keep
that FD even if someone later does chmod() on that file so I can't
open it again.

There are lots of examples where a one time access control check
provides continuing access to a resource. I feel the ongoing proof is
the rarity in Unix.. 'revoke' is an uncommon concept in Unix..


Yes, it's even possible that somebody w/ privileges opens an fd and
hands it over to somebody unprivileged (eg. via unix socket). This is
a very basic unix concept. If some (already opened) fd now suddenly
behaves differently based on the current caller, that would be a break
with traditional unix semantics.


That's already more or less meaningless for both KVM and VFIO, since 
they are tied to an mm.


Paolo

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-09 Thread Enrico Weigelt, metux IT consult

On 08.06.21 21:00, Jason Gunthorpe wrote:


Eg I can do open() on a file and I get to keep that FD. I get to keep
that FD even if someone later does chmod() on that file so I can't
open it again.

There are lots of examples where a one time access control check
provides continuing access to a resource. I feel the ongoing proof is
the rarity in Unix.. 'revoke' is an uncommon concept in Unix..


Yes, it's even possible that somebody w/ privileges opens an fd and
hands it over to somebody unprivileged (eg. via unix socket). This is
a very basic unix concept. If some (already opened) fd now suddenly
behaves differently based on the current caller, that would be a break
with traditional unix semantics.


--mtx

--
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
i...@metux.net -- +49-151-27565287
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC] /dev/ioasid uAPI proposal

2021-06-08 Thread Tian, Kevin
> From: David Gibson 
> Sent: Tuesday, June 8, 2021 8:50 AM
> 
> On Thu, Jun 03, 2021 at 06:49:20AM +, Tian, Kevin wrote:
> > > From: David Gibson
> > > Sent: Thursday, June 3, 2021 1:09 PM
> > [...]
> > > > > In this way the SW mode is the same as a HW mode with an infinite
> > > > > cache.
> > > > >
> > > > > The collaposed shadow page table is really just a cache.
> > > > >
> > > >
> > > > OK. One additional thing is that we may need a 'caching_mode"
> > > > thing reported by /dev/ioasid, indicating whether invalidation is
> > > > required when changing non-present to present. For hardware
> > > > nesting it's not reported as the hardware IOMMU will walk the
> > > > guest page table in cases of iotlb miss. For software nesting
> > > > caching_mode is reported so the user must issue invalidation
> > > > upon any change in guest page table so the kernel can update
> > > > the shadow page table timely.
> > >
> > > For the fist cut, I'd have the API assume that invalidates are
> > > *always* required.  Some bypass to avoid them in cases where they're
> > > not needed can be an additional extension.
> > >
> >
> > Isn't a typical TLB semantics is that non-present entries are not
> > cached thus invalidation is not required when making non-present
> > to present?
> 
> Usually, but not necessarily.
> 
> > It's true to both CPU TLB and IOMMU TLB.
> 
> I don't think it's entirely true of the CPU TLB on all ppc MMU models
> (of which there are far too many).
> 
> > In reality
> > I feel there are more usages built on hardware nesting than software
> > nesting thus making default following hardware TLB behavior makes
> > more sense...
> 
> I'm arguing for always-require-invalidate because it's strictly more
> general.  Requiring the invalidate will support models that don't
> require it in all cases; we just make the invalidate a no-op.  The
> reverse is not true, so we should tackle the general case first, then
> optimize.
> 

It makes sense. Will adopt this way.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-08 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Wednesday, June 9, 2021 2:47 AM
> 
> On Tue, 8 Jun 2021 15:44:26 +0200
> Paolo Bonzini  wrote:
> 
> > On 08/06/21 15:15, Jason Gunthorpe wrote:
> > > On Tue, Jun 08, 2021 at 09:56:09AM +0200, Paolo Bonzini wrote:
> > >
> >  Alternatively you can add a KVM_DEV_IOASID_{ADD,DEL} pair of
> ioctls. But it
> >  seems useless complication compared to just using what we have
> now, at least
> >  while VMs only use IOASIDs via VFIO.
> > >>>
> > >>> The simplest is KVM_ENABLE_WBINVD() and be
> done
> > >>> with it.
> 
> Even if we were to relax wbinvd access to any device (capable of
> no-snoop or not) in any IOMMU configuration (blocking no-snoop or not),
> I think as soon as we say "proof" is required to gain this access then
> that proof should be ongoing for the life of the access.
> 
> That alone makes this more than a "I want this feature, here's my
> proof", one-shot ioctl.  Like the groupfd enabling a path for KVM to
> ask if that group is non-coherent and holding a group reference to
> prevent the group from being used to authorize multiple KVM instances,
> the ioasidfd proof would need to minimally validate that devices are
> present and provide some reference to enforce that model as ongoing, or
> notifier to indicate an end of that authorization.  I don't think we
> can simplify that out of the equation or we've essentially invalidated
> that the proof is really required.
> 
> > >>
> > >> The simplest one is KVM_DEV_VFIO_GROUP_ADD/DEL, that already
> exists and also
> > >> covers hot-unplug.  The second simplest one is
> KVM_DEV_IOASID_ADD/DEL.
> > >
> > > This isn't the same thing, this is back to trying to have the kernel
> > > set policy for userspace.
> >
> > If you want a userspace policy then there would be three states:
> >
> > * WBINVD enabled because a WBINVD-enabled VFIO device is attached.
> >
> > * WBINVD potentially enabled but no WBINVD-enabled VFIO device
> attached
> >
> > * WBINVD forcefully disabled
> >
> > KVM_DEV_VFIO_GROUP_ADD/DEL can still be used to distinguish the first
> > two.  Due to backwards compatibility, those two describe the default
> > behavior; disabling wbinvd can be done easily with a new sub-ioctl of
> > KVM_ENABLE_CAP and doesn't require any security proof.
> 
> That seems like a good model, use the kvm-vfio device for the default
> behavior and extend an existing KVM ioctl if QEMU still needs a way to
> tell KVM to assume all DMA is coherent, regardless of what the kvm-vfio
> device reports.
> 
> If feels like we should be able to support a backwards compatibility
> mode using the vfio group, but I expect long term we'll want to
> transition the kvm-vfio device from a groupfd to an ioasidfd.
> 
> > The meaning of WBINVD-enabled is "won't return -ENXIO for the wbinvd
> > ioctl", nothing more nothing less.  If all VFIO devices are going to be
> > WBINVD-enabled, then that will reflect on KVM as well, and I won't have
> > anything to object if there's consensus on the device assignment side of
> > things that the wbinvd ioctl won't ever fail.
> 
> If we create the IOMMU vs device coherency matrix:
> 
> \ Device supports
> IOMMU blocks \   no-snoop
>   no-snoop\  yes | no  |
> ---+-+-+
>yes |  1  |  2  |
> ---+-+-+
>no  |  3  |  4  |
> ---+-+-+
> 
> DMA is always coherent in boxes {1,2,4} (wbinvd emulation is not
> needed).  VFIO will currently always configure the IOMMU for {1,2} when
> the feature is supported.  Boxes {3,4} are where we'll currently
> emulate wbinvd.  The best we could do, not knowing the guest or
> insights into the guest driver would be to only emulate wbinvd for {3}.
> 
> The majority of devices appear to report no-snoop support {1,3}, but
> the claim is that it's mostly unused outside of GPUs, effectively {2,4}.
> I'll speculate that most IOMMUs support enforcing coherency (amd, arm,
> fsl unconditionally, intel conditionally) {1,2}.
> 
> I think that means we're currently operating primarily in Box {1},
> which does not seem to lean towards unconditional wbinvd access with
> device ownership.
> 
> I think we have a desire with IOASID to allow user policy to operate
> certain devices in {3} and I think the argument there is that a
> specific software enforced coherence sync is more efficient on the bus
> than the constant coherence enforcement by the IOMMU.
> 
> I think that the disable mode Jason proposed is essentially just a way
> to move a device from {3} to {4}, ie. the IOASID support or
> configuration does not block no-snoop and the device claims to support
> no-snoop, but doesn't use it.  How we'd determine this to be true for
> a device without a crystal ball of driver development or hardware
> errata that no-snoop transactions are not possible regardless of the
> behavior of the enable bit, I'm not sure.  If we're operating a device
> in {3}, but the device does not generate no-snoop 

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-08 Thread Jason Gunthorpe
On Tue, Jun 08, 2021 at 04:04:26PM +1000, David Gibson wrote:

> > What I would like is that the /dev/iommu side managing the IOASID
> > doesn't really care much, but the device driver has to tell
> > drivers/iommu what it is going to do when it attaches.
> 
> By the device driver, do you mean the userspace or guest device
> driver?  Or do you mean the vfio_pci or mdev "shim" device driver"?

I mean vfio_pci, mdev "shim", vdpa, etc. Some kernel driver that is
allowing qemu access to a HW resource.

> > It makes sense, in PCI terms, only the driver knows what TLPs the
> > device will generate. The IOMMU needs to know what TLPs it will
> > recieve to configure properly.
> > 
> > PASID or not is major device specific variation, as is the ENQCMD/etc
> > 
> > Having the device be explicit when it tells the IOMMU what it is going
> > to be sending is a major plus to me. I actually don't want to see this
> > part of the interface be made less strong.
> 
> Ok, if I'm understanding this right a PASID capable IOMMU will be able
> to process *both* transactions with just a RID and transactions with a
> RID+PASID.

Yes

> So if we're thinking of this notional 84ish-bit address space, then
> that includes "no PASID" as well as all the possible PASID values.
> Yes?  Or am I confused?

Right, though I expect how to model 'no pasid' vs all the pasids is
some micro-detail someone would need to work on a real vIOMMU
implemetnation to decide..
 
> > /dev/iommu is concerned with setting up the IOAS and filling the IO
> > page tables with information
> > 
> > The driver behind "struct vfio_device" is responsible to "route" its
> > HW into that IOAS.
> > 
> > They are two halfs of the problem, one is only the io page table, and one
> > the is connection of a PCI TLP to a specific io page table.
> > 
> > Only the driver knows what format of TLPs the device will generate so
> > only the driver can specify the "route"
> 
> Ok.  I'd really like if we can encode this in a way that doesn't build
> PCI-specific structure into the API, though.

I think we should at least have bus specific "convenience" APIs for
the popular cases. It is complicated enough already, trying to force
people to figure out the kernel synonym for a PCI standard name gets
pretty rough... Plus the RID is inherently a hardware specific
concept.

> > Inability to match the RID is rare, certainly I would expect any IOMMU
> > HW that can do PCIEe PASID matching can also do RID matching.
> 
> It's not just up to the IOMMU.  The obvious case is a PCIe-to-PCI
> bridge.  

Yes.. but PCI is *really* old at this point. Even PCI-X sustains the
originating RID.

The general case here is that each device can route to its own
IOAS. The specialty case is that only one IOAS in a group can be
used. We should make room in the API for the special case without
destroying the general case.

> > Oh, I hadn't spent time thinking about any of those.. It is messy but
> > it can still be forced to work, I guess. A device centric model means
> > all the devices using the same routing ID have to be connected to the
> > same IOASID by userspace. So some of the connections will be NOPs.
> 
> See, that's exactly what I thought the group checks were enforcing.
> I'm really hoping we don't need two levels of granularity here: groups
> of devices that can't be identified from each other, and then groups
> of those that can't be isolated from each other.  That introduces a
> huge amount of extra conceptual complexity.

We've got this far with groups that mean all those things together, I
wouldn't propose to do a bunch of kernel work to change that
significantly.

I just want to have a device centric uAPI so we are not trapped
forever in groups being 1:1 with an IOASID model, which is clearly not
accurately modeling what today's systems are actually able to do,
especially with PASID.

We can report some fixed info to user space 'all these devices share
one ioasid' and leave it for now/ever

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-08 Thread Jason Gunthorpe
On Tue, Jun 08, 2021 at 10:53:02AM +1000, David Gibson wrote:
> On Thu, Jun 03, 2021 at 08:52:24AM -0300, Jason Gunthorpe wrote:
> > On Thu, Jun 03, 2021 at 03:13:44PM +1000, David Gibson wrote:
> > 
> > > > We can still consider it a single "address space" from the IOMMU
> > > > perspective. What has happened is that the address table is not just a
> > > > 64 bit IOVA, but an extended ~80 bit IOVA formed by "PASID, IOVA".
> > > 
> > > True.  This does complexify how we represent what IOVA ranges are
> > > valid, though.  I'll bet you most implementations don't actually
> > > implement a full 64-bit IOVA, which means we effectively have a large
> > > number of windows from (0..max IOVA) for each valid pasid.  This adds
> > > another reason I don't think my concept of IOVA windows is just a
> > > power specific thing.
> > 
> > Yes
> > 
> > Things rapidly get into weird hardware specific stuff though, the
> > request will be for things like:
> >   "ARM PASID page table format from SMMU IP block vXX"
> 
> So, I'm happy enough for picking a user-managed pagetable format to
> imply the set of valid IOVA ranges (though a query might be nice).

I think a query is mandatory, and optionally asking for ranges seems
generally useful as a HW property.

The danger is things can get really tricky as the app can ask for
ranges some HW needs but other HW can't provide. 

I would encourage a flow where "generic" apps like DPDK can somehow
just ignore this, or at least be very, very simplified "I want around
XX GB of IOVA space"

dpdk type apps vs qemu apps are really quite different and we should
be carefully that the needs of HW accelerated vIOMMU emulation do not
trump the needs of simple universal control over a DMA map.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-08 Thread Jason Gunthorpe
On Tue, Jun 08, 2021 at 12:47:00PM -0600, Alex Williamson wrote:
> On Tue, 8 Jun 2021 15:44:26 +0200
> Paolo Bonzini  wrote:
> 
> > On 08/06/21 15:15, Jason Gunthorpe wrote:
> > > On Tue, Jun 08, 2021 at 09:56:09AM +0200, Paolo Bonzini wrote:
> > >   
> >  Alternatively you can add a KVM_DEV_IOASID_{ADD,DEL} pair of ioctls. 
> >  But it
> >  seems useless complication compared to just using what we have now, at 
> >  least
> >  while VMs only use IOASIDs via VFIO.  
> > >>>
> > >>> The simplest is KVM_ENABLE_WBINVD() and be done
> > >>> with it.  
> 
> Even if we were to relax wbinvd access to any device (capable of
> no-snoop or not) in any IOMMU configuration (blocking no-snoop or not),
> I think as soon as we say "proof" is required to gain this access then
> that proof should be ongoing for the life of the access.

This idea is not entirely consistent with the usual Unix access
control model..

Eg I can do open() on a file and I get to keep that FD. I get to keep
that FD even if someone later does chmod() on that file so I can't
open it again.

There are lots of examples where a one time access control check
provides continuing access to a resource. I feel the ongoing proof is
the rarity in Unix.. 'revoke' is an uncommon concept in Unix..

That said, I don't feel strongly either way, would just like to see
something implementatbale. Even the various options to change the
feature are more thought explorations to try to understand how to
model the feature than any requirements I am aware of.

> notifier to indicate an end of that authorization.  I don't think we
> can simplify that out of the equation or we've essentially invalidated
> that the proof is really required.

The proof is like the chown in the above open() example. Once kvm is
authorized it keeps working even if a new authorization could not be
obtained. It is not very different from chmod'ing a file after
something opened it.

Inablity to revoke doesn't invalidate the value of the initial
one-time access control check.

Generally agree on the rest of your message

Regards,
Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-08 Thread Alex Williamson
On Tue, 8 Jun 2021 15:44:26 +0200
Paolo Bonzini  wrote:

> On 08/06/21 15:15, Jason Gunthorpe wrote:
> > On Tue, Jun 08, 2021 at 09:56:09AM +0200, Paolo Bonzini wrote:
> >   
>  Alternatively you can add a KVM_DEV_IOASID_{ADD,DEL} pair of ioctls. But 
>  it
>  seems useless complication compared to just using what we have now, at 
>  least
>  while VMs only use IOASIDs via VFIO.  
> >>>
> >>> The simplest is KVM_ENABLE_WBINVD() and be done
> >>> with it.  

Even if we were to relax wbinvd access to any device (capable of
no-snoop or not) in any IOMMU configuration (blocking no-snoop or not),
I think as soon as we say "proof" is required to gain this access then
that proof should be ongoing for the life of the access.

That alone makes this more than a "I want this feature, here's my
proof", one-shot ioctl.  Like the groupfd enabling a path for KVM to
ask if that group is non-coherent and holding a group reference to
prevent the group from being used to authorize multiple KVM instances,
the ioasidfd proof would need to minimally validate that devices are
present and provide some reference to enforce that model as ongoing, or
notifier to indicate an end of that authorization.  I don't think we
can simplify that out of the equation or we've essentially invalidated
that the proof is really required.

> >>
> >> The simplest one is KVM_DEV_VFIO_GROUP_ADD/DEL, that already exists and 
> >> also
> >> covers hot-unplug.  The second simplest one is KVM_DEV_IOASID_ADD/DEL.  
> > 
> > This isn't the same thing, this is back to trying to have the kernel
> > set policy for userspace.  
> 
> If you want a userspace policy then there would be three states:
> 
> * WBINVD enabled because a WBINVD-enabled VFIO device is attached.
> 
> * WBINVD potentially enabled but no WBINVD-enabled VFIO device attached
> 
> * WBINVD forcefully disabled
> 
> KVM_DEV_VFIO_GROUP_ADD/DEL can still be used to distinguish the first 
> two.  Due to backwards compatibility, those two describe the default 
> behavior; disabling wbinvd can be done easily with a new sub-ioctl of 
> KVM_ENABLE_CAP and doesn't require any security proof.

That seems like a good model, use the kvm-vfio device for the default
behavior and extend an existing KVM ioctl if QEMU still needs a way to
tell KVM to assume all DMA is coherent, regardless of what the kvm-vfio
device reports.

If feels like we should be able to support a backwards compatibility
mode using the vfio group, but I expect long term we'll want to
transition the kvm-vfio device from a groupfd to an ioasidfd.

> The meaning of WBINVD-enabled is "won't return -ENXIO for the wbinvd 
> ioctl", nothing more nothing less.  If all VFIO devices are going to be 
> WBINVD-enabled, then that will reflect on KVM as well, and I won't have 
> anything to object if there's consensus on the device assignment side of 
> things that the wbinvd ioctl won't ever fail.

If we create the IOMMU vs device coherency matrix:

\ Device supports
IOMMU blocks \   no-snoop
  no-snoop\  yes | no  |
---+-+-+
   yes |  1  |  2  |
---+-+-+
   no  |  3  |  4  |
---+-+-+

DMA is always coherent in boxes {1,2,4} (wbinvd emulation is not
needed).  VFIO will currently always configure the IOMMU for {1,2} when
the feature is supported.  Boxes {3,4} are where we'll currently
emulate wbinvd.  The best we could do, not knowing the guest or
insights into the guest driver would be to only emulate wbinvd for {3}.

The majority of devices appear to report no-snoop support {1,3}, but
the claim is that it's mostly unused outside of GPUs, effectively {2,4}.
I'll speculate that most IOMMUs support enforcing coherency (amd, arm,
fsl unconditionally, intel conditionally) {1,2}.

I think that means we're currently operating primarily in Box {1},
which does not seem to lean towards unconditional wbinvd access with
device ownership.

I think we have a desire with IOASID to allow user policy to operate
certain devices in {3} and I think the argument there is that a
specific software enforced coherence sync is more efficient on the bus
than the constant coherence enforcement by the IOMMU.

I think that the disable mode Jason proposed is essentially just a way
to move a device from {3} to {4}, ie. the IOASID support or
configuration does not block no-snoop and the device claims to support
no-snoop, but doesn't use it.  How we'd determine this to be true for
a device without a crystal ball of driver development or hardware
errata that no-snoop transactions are not possible regardless of the
behavior of the enable bit, I'm not sure.  If we're operating a device
in {3}, but the device does not generate no-snoop transactions, then
presumably the guest driver isn't generating wbinvd instructions for us
to emulate, so where are the wbinvd instructions that this feature
would prevent being emulated coming from?  Thanks,

Alex


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-08 Thread Paolo Bonzini

On 08/06/21 15:15, Jason Gunthorpe wrote:

On Tue, Jun 08, 2021 at 09:56:09AM +0200, Paolo Bonzini wrote:


Alternatively you can add a KVM_DEV_IOASID_{ADD,DEL} pair of ioctls. But it
seems useless complication compared to just using what we have now, at least
while VMs only use IOASIDs via VFIO.


The simplest is KVM_ENABLE_WBINVD() and be done
with it.


The simplest one is KVM_DEV_VFIO_GROUP_ADD/DEL, that already exists and also
covers hot-unplug.  The second simplest one is KVM_DEV_IOASID_ADD/DEL.


This isn't the same thing, this is back to trying to have the kernel
set policy for userspace.


If you want a userspace policy then there would be three states:

* WBINVD enabled because a WBINVD-enabled VFIO device is attached.

* WBINVD potentially enabled but no WBINVD-enabled VFIO device attached

* WBINVD forcefully disabled

KVM_DEV_VFIO_GROUP_ADD/DEL can still be used to distinguish the first 
two.  Due to backwards compatibility, those two describe the default 
behavior; disabling wbinvd can be done easily with a new sub-ioctl of 
KVM_ENABLE_CAP and doesn't require any security proof.


The meaning of WBINVD-enabled is "won't return -ENXIO for the wbinvd 
ioctl", nothing more nothing less.  If all VFIO devices are going to be 
WBINVD-enabled, then that will reflect on KVM as well, and I won't have 
anything to object if there's consensus on the device assignment side of 
things that the wbinvd ioctl won't ever fail.


Paolo

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-08 Thread Jason Gunthorpe
On Tue, Jun 08, 2021 at 09:10:42AM +0800, Jason Wang wrote:

> Well, this sounds like a re-invention of io_uring which has already worked
> for multifds.

How so? io_uring is about sending work to the kernel, not getting
structued events back?

It is more like one of the perf rings

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-08 Thread Jason Gunthorpe
On Tue, Jun 08, 2021 at 12:37:04PM +1000, David Gibson wrote:

> > The PPC/SPAPR support allows KVM to associate a vfio group to an IOMMU
> > page table so that it can handle iotlb programming from pre-registered
> > memory without trapping out to userspace.
> 
> To clarify that's a guest side logical vIOMMU page table which is
> partially managed by KVM.  This is an optimization - things can work
> without it, but it means guest iomap/unmap becomes a hot path because
> each map/unmap hypercall has to go
>   guest -> KVM -> qemu -> VFIO
> 
> So there are multiple context transitions.

Isn't this overhead true of many of the vIOMMUs? Can the fast path be
generalized?

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-08 Thread Jason Gunthorpe
On Tue, Jun 08, 2021 at 09:56:09AM +0200, Paolo Bonzini wrote:

> > > Alternatively you can add a KVM_DEV_IOASID_{ADD,DEL} pair of ioctls. But 
> > > it
> > > seems useless complication compared to just using what we have now, at 
> > > least
> > > while VMs only use IOASIDs via VFIO.
> > 
> > The simplest is KVM_ENABLE_WBINVD() and be done
> > with it.
> 
> The simplest one is KVM_DEV_VFIO_GROUP_ADD/DEL, that already exists and also
> covers hot-unplug.  The second simplest one is KVM_DEV_IOASID_ADD/DEL.

This isn't the same thing, this is back to trying to have the kernel
set policy for userspace.

qmeu need to be in direct control of this specific KVM emulation
feature if it is going to support the full range of options.

IMHO obfuscating it with some ADD/DEL doesn't achieve that.

Especially since even today GROUP_ADD/DEL is not just about
controlling wbinvd but also about linking mdevs to the kvm struct - it
is both not optional to call from qemu and triggers behavior that is
against the userspace policy.

This is why I prefer a direct and obvious KVM_ENABLE_WBINVD approach

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-08 Thread Jason Gunthorpe
On Tue, Jun 08, 2021 at 12:43:38PM +0200, Enrico Weigelt, metux IT consult 
wrote:

> > It is two devices, thus two files.
> 
> Two separate real (hardware) devices or just two logical device nodes ?

A real PCI device and a real IOMMU block integrated into the CPU

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-08 Thread Jason Gunthorpe
On Tue, Jun 08, 2021 at 10:54:59AM +0200, Enrico Weigelt, metux IT consult 
wrote:

> Maybe the device as well as the transport could announce their
> capability (which IMHO should go via the virtio protocol), and if both
> are capable, the (guest's) virtio subsys tells the driver whether it's
> usable for a specific device. Perhaps we should also have a mechanism
> to tell the device that it's actually used.

The usage should be extremely narrow, like 

"If the driver issues a GPU command with flag X then the resulting
DMAs will be no-snoop and the driver must re-establish coherency at
the right moment"

It is not a general idea, but something baked directly into the device
protocol that virtio carries.

The general notion of no-nsoop should ideally be carried in the PCIe
extended config space flag. If 0 then no-snoop should never be issued,
expected, or requested.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-08 Thread Enrico Weigelt, metux IT consult

On 07.06.21 20:01, Jason Gunthorpe wrote:

 it is what it is, select has a fixed size bitmap of FD #s and
a hard upper bound on that size as part of the glibc ABI - can't be
fixed.


in glibc ABI ? Uuuuh!


--mtx

--
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
i...@metux.net -- +49-151-27565287
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-08 Thread Enrico Weigelt, metux IT consult

On 04.06.21 14:30, Jason Gunthorpe wrote:

Hi,


Containers already needed to do this today. Container orchestration is
hard.


Yes, but I hate to see even more work upcoming here.


Yes, /dev/ioasid shouldn't do anything unless you have a device to
connect it with. In this way it is probably safe to stuff it into
every container.


Okay, if we can guarantee that, I'm completely fine.


Having FDs spawn other FDs is pretty ugly, it defeats the "everything
is a file" model of UNIX.


Unfortunately, this is already defeated in many other places :(
(I'd even claim that ioctls already break it :p)


I think you are reaching a bit :)


It seems your approach also breaks this, since we now need to open two
files in order to talk to one device.


It is two devices, thus two files.


Two separate real (hardware) devices or just two logical device nodes ?


--mtx

--
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
i...@metux.net -- +49-151-27565287
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-08 Thread Enrico Weigelt, metux IT consult

On 08.06.21 03:00, Jason Wang wrote:

Hi folks,

Just to make sure we are in the same page. What I meant is, if the DMA 
behavior like (no-snoop) is device specific. There's no need to mandate 
a virtio general attributes. We can describe it per device. The devices 
implemented in the current spec does not use non-coherent DMA doesn't 
mean any future devices won't do that. The driver could choose to use 
transport (e.g PCI), platform (ACPI) or device specific (general virtio 
command) way to detect and flush cache when necessary.


Maybe I've totally misunderstood the whole issue, but what I've learned
to far:

* it's a performance improvement for certain scenarios
* whether it can be used depends on the devices as well as the
  underlying transport (combination of both)
* whether it should be used (when possible) can only be decided by the
  driver

Correct ?

I tend to believe that's something that virtio infrastructure should
handle in a generic way.

Maybe the device as well as the transport could announce their
capability (which IMHO should go via the virtio protocol), and if both
are capable, the (guest's) virtio subsys tells the driver whether it's
usable for a specific device. Perhaps we should also have a mechanism
to tell the device that it's actually used.


Sorry, if i'm completely on the wrong page and just talking junk here :o


--mtx

--
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
i...@metux.net -- +49-151-27565287
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-08 Thread Paolo Bonzini

On 07/06/21 19:59, Jason Gunthorpe wrote:

The KVM interface is the same kvm-vfio device that exists already.  The
userspace API does not need to change at all: adding one VFIO file
descriptor with WBINVD enabled to the kvm-vfio device lets the VM use WBINVD
functionality (see kvm_vfio_update_coherency).


The problem is we are talking about adding a new /dev/ioasid FD and it
won't fit into the existing KVM VFIO FD interface. There are lots of
options here, one is to add new ioctls that specifically use the new
FD, the other is to somehow use VFIO as a proxy to carry things to the
/dev/ioasid FD code.


Exactly.


Alternatively you can add a KVM_DEV_IOASID_{ADD,DEL} pair of ioctls. But it
seems useless complication compared to just using what we have now, at least
while VMs only use IOASIDs via VFIO.


The simplest is KVM_ENABLE_WBINVD() and be done
with it.


The simplest one is KVM_DEV_VFIO_GROUP_ADD/DEL, that already exists and 
also covers hot-unplug.  The second simplest one is KVM_DEV_IOASID_ADD/DEL.


It need not be limited to wbinvd support, it's just a generic "let VMs 
do what userspace can do if it has access to this file descriptor". 
That it enables guest WBINVD is an implementation detail.



Either way, there should be no policy attached to the add/delete operations.
KVM users want to add the VFIO (or IOASID) file descriptors to the device
independent of WBINVD.  If userspace wants/needs to apply its own policy on
whether to enable WBINVD or not, they can do it on the VFIO/IOASID side:


Why does KVM need to know abut IOASID's? I don't think it can do
anything with this general information.


Indeed, it only uses them as the security proofs---either VFIO or IOASID 
file descriptors can be used as such.


Paolo

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-08 Thread David Gibson
On Thu, Jun 03, 2021 at 09:11:05AM -0300, Jason Gunthorpe wrote:
> On Thu, Jun 03, 2021 at 03:45:09PM +1000, David Gibson wrote:
> > On Wed, Jun 02, 2021 at 01:58:38PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote:
> > > > > > /* Bind guest I/O page table  */
> > > > > > bind_data = {
> > > > > > .ioasid = gva_ioasid;
> > > > > > .addr   = gva_pgtable1;
> > > > > > // and format information
> > > > > > };
> > > > > > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, _data);
> > > > > 
> > > > > Again I do wonder if this should just be part of alloc_ioasid. Is
> > > > > there any reason to split these things? The only advantage to the
> > > > > split is the device is known, but the device shouldn't impact
> > > > > anything..
> > > > 
> > > > I'm pretty sure the device(s) could matter, although they probably
> > > > won't usually. 
> > > 
> > > It is a bit subtle, but the /dev/iommu fd itself is connected to the
> > > devices first. This prevents wildly incompatible devices from being
> > > joined together, and allows some "get info" to report the capability
> > > union of all devices if we want to do that.
> > 
> > Right.. but I've not been convinced that having a /dev/iommu fd
> > instance be the boundary for these types of things actually makes
> > sense.  For example if we were doing the preregistration thing
> > (whether by child ASes or otherwise) then that still makes sense
> > across wildly different devices, but we couldn't share that layer if
> > we have to open different instances for each of them.
> 
> It is something that still seems up in the air.. What seems clear for
> /dev/iommu is that it
>  - holds a bunch of IOASID's organized into a tree
>  - holds a bunch of connected devices

Right, and it's still not really clear to me what devices connected to
the same /dev/iommu instance really need to have in common, as
distinct from what devices connected to the same specific ioasid need
to have in common.

>  - holds a pinned memory cache
> 
> One thing it must do is enforce IOMMU group security. A device cannot
> be attached to an IOASID unless all devices in its IOMMU group are
> part of the same /dev/iommu FD.

Well, you can't attach a device to an individual IOASID unless all
devices in its group are attached to the same individual IOASID
either, so I'm not clear what benefit there is to enforcing it at the
/dev/iommu instance as well as at the individual ioasid level.

> The big open question is what parameters govern allowing devices to
> connect to the /dev/iommu:
>  - all devices can connect and we model the differences inside the API
>somehow.
>  - Only sufficiently "similar" devices can be connected
>  - The FD's capability is the minimum of all the connected devices
> 
> There are some practical problems here, when an IOASID is created the
> kernel does need to allocate a page table for it, and that has to be
> in some definite format.
> 
> It may be that we had a false start thinking the FD container should
> be limited. Perhaps creating an IOASID should pass in a list
> of the "device labels" that the IOASID will be used with and that can
> guide the kernel what to do?
> 
> > Right, but at this stage I'm just not seeing a really clear (across
> > platforms and device typpes) boundary for what things have to be per
> > IOASID container and what have to be per IOASID, so I'm just not sure
> > the /dev/iommu instance grouping makes any sense.
> 
> I would push as much stuff as possible to be per-IOASID..

I agree.  But the question is what's *not* possible to be per-IOASID,
so what's the semantic boundary that defines when things have to be in
the same /dev/iommu instance, but not the same IOASID.

> > > I don't know if that small advantage is worth the extra complexity
> > > though.
> > > 
> > > > But it would certainly be possible for a system to have two
> > > > different host bridges with two different IOMMUs with different
> > > > pagetable formats.  Until you know which devices (and therefore
> > > > which host bridge) you're talking about, you don't know what formats
> > > > of pagetable to accept.  And if you have devices from *both* bridges
> > > > you can't bind a page table at all - you could theoretically support
> > > > a kernel managed pagetable by mirroring each MAP and UNMAP to tables
> > > > in both formats, but it would be pretty reasonable not to support
> > > > that.
> > > 
> > > The basic process for a user space owned pgtable mode would be:
> > > 
> > >  1) qemu has to figure out what format of pgtable to use
> > > 
> > > Presumably it uses query functions using the device label.
> > 
> > No... in the qemu case it would always select the page table format
> > that it needs to present to the guest.  That's part of the
> > guest-visible platform that's selected by qemu's configuration.
> 
> I should have said "vfio user" here because apps like DPDK might use
> this 

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-08 Thread David Gibson
On Thu, Jun 03, 2021 at 07:17:23AM +, Tian, Kevin wrote:
> > From: David Gibson 
> > Sent: Wednesday, June 2, 2021 2:15 PM
> > 
> [...] 
> > > An I/O address space takes effect in the IOMMU only after it is attached
> > > to a device. The device in the /dev/ioasid context always refers to a
> > > physical one or 'pdev' (PF or VF).
> > 
> > What you mean by "physical" device here isn't really clear - VFs
> > aren't really physical devices, and the PF/VF terminology also doesn't
> > extent to non-PCI devices (which I think we want to consider for the
> > API, even if we're not implemenenting it any time soon).
> 
> Yes, it's not very clear, and more in PCI context to simplify the 
> description. A "physical" one here means an PCI endpoint function
> which has a unique RID. It's more to differentiate with later mdev/
> subdevice which uses both RID+PASID. Naming is always a hard
> exercise to me... Possibly I'll just use device vs. subdevice in future
> versions.
> 
> > 
> > Now, it's clear that we can't program things into the IOMMU before
> > attaching a device - we might not even know which IOMMU to use.
> 
> yes
> 
> > However, I'm not sure if its wise to automatically make the AS "real"
> > as soon as we attach a device:
> > 
> >  * If we're going to attach a whole bunch of devices, could we (for at
> >least some IOMMU models) end up doing a lot of work which then has
> >to be re-done for each extra device we attach?
> 
> which extra work did you specifically refer to? each attach just implies
> writing the base address of the I/O page table to the IOMMU structure
> corresponding to this device (either being a per-device entry, or per
> device+PASID entry).
> 
> and generally device attach should not be in a hot path.
> 
> > 
> >  * With kernel managed IO page tables could attaching a second device
> >(at least on some IOMMU models) require some operation which would
> >require discarding those tables?  e.g. if the second device somehow
> >forces a different IO page size
> 
> Then the attach should fail and the user should create another IOASID
> for the second device.

Couldn't this make things weirdly order dependent though?  If device A
has strictly more capabilities than device B, then attaching A then B
will be fine, but B then A will trigger a new ioasid fd.

> > For that reason I wonder if we want some sort of explicit enable or
> > activate call.  Device attaches would only be valid before, map or
> > attach pagetable calls would only be valid after.
> 
> I'm interested in learning a real example requiring explicit enable...
> 
> > 
> > > One I/O address space could be attached to multiple devices. In this case,
> > > /dev/ioasid uAPI applies to all attached devices under the specified 
> > > IOASID.
> > >
> > > Based on the underlying IOMMU capability one device might be allowed
> > > to attach to multiple I/O address spaces, with DMAs accessing them by
> > > carrying different routing information. One of them is the default I/O
> > > address space routed by PCI Requestor ID (RID) or ARM Stream ID. The
> > > remaining are routed by RID + Process Address Space ID (PASID) or
> > > Stream+Substream ID. For simplicity the following context uses RID and
> > > PASID when talking about the routing information for I/O address spaces.
> > 
> > I'm not really clear on how this interacts with nested ioasids.  Would
> > you generally expect the RID+PASID IOASes to be children of the base
> > RID IOAS, or not?
> 
> No. With Intel SIOV both parent/children could be RID+PASID, e.g.
> when one enables vSVA on a mdev.

Hm, ok.  I really haven't understood how the PASIDs fit into this
then.  I'll try again on v2.

> > If the PASID ASes are children of the RID AS, can we consider this not
> > as the device explicitly attaching to multiple IOASIDs, but instead
> > attaching to the parent IOASID with awareness of the child ones?
> > 
> > > Device attachment is initiated through passthrough framework uAPI (use
> > > VFIO for simplicity in following context). VFIO is responsible for 
> > > identifying
> > > the routing information and registering it to the ioasid driver when 
> > > calling
> > > ioasid attach helper function. It could be RID if the assigned device is
> > > pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition,
> > > user might also provide its view of virtual routing information (vPASID) 
> > > in
> > > the attach call, e.g. when multiple user-managed I/O address spaces are
> > > attached to the vfio_device. In this case VFIO must figure out whether
> > > vPASID should be directly used (for pdev) or converted to a kernel-
> > > allocated one (pPASID, for mdev) for physical routing (see section 4).
> > >
> > > Device must be bound to an IOASID FD before attach operation can be
> > > conducted. This is also through VFIO uAPI. In this proposal one device
> > > should not be bound to multiple FD's. Not sure about the gain of
> > > allowing it except adding 

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-08 Thread David Gibson
On Fri, Jun 04, 2021 at 12:24:08PM +0200, Jean-Philippe Brucker wrote:
> On Thu, Jun 03, 2021 at 03:45:09PM +1000, David Gibson wrote:
> > > > But it would certainly be possible for a system to have two
> > > > different host bridges with two different IOMMUs with different
> > > > pagetable formats.  Until you know which devices (and therefore
> > > > which host bridge) you're talking about, you don't know what formats
> > > > of pagetable to accept.  And if you have devices from *both* bridges
> > > > you can't bind a page table at all - you could theoretically support
> > > > a kernel managed pagetable by mirroring each MAP and UNMAP to tables
> > > > in both formats, but it would be pretty reasonable not to support
> > > > that.
> > > 
> > > The basic process for a user space owned pgtable mode would be:
> > > 
> > >  1) qemu has to figure out what format of pgtable to use
> > > 
> > > Presumably it uses query functions using the device label.
> > 
> > No... in the qemu case it would always select the page table format
> > that it needs to present to the guest.  That's part of the
> > guest-visible platform that's selected by qemu's configuration.
> > 
> > There's no negotiation here: either the kernel can supply what qemu
> > needs to pass to the guest, or it can't.  If it can't qemu, will have
> > to either emulate in SW (if possible, probably using a kernel-managed
> > IOASID to back it) or fail outright.
> > 
> > > The
> > > kernel code should look at the entire device path through all the
> > > IOMMU HW to determine what is possible.
> > > 
> > > Or it already knows because the VM's vIOMMU is running in some
> > > fixed page table format, or the VM's vIOMMU already told it, or
> > > something.
> > 
> > Again, I think you have the order a bit backwards.  The user selects
> > the capabilities that the vIOMMU will present to the guest as part of
> > the qemu configuration.  Qemu then requests that of the host kernel,
> > and either the host kernel supplies it, qemu emulates it in SW, or
> > qemu fails to start.
> 
> Hm, how fine a capability are we talking about?  If it's just "give me
> VT-d capabilities" or "give me Arm capabilities" that would work, but
> probably isn't useful. Anything finer will be awkward because userspace
> will have to try combinations of capabilities to see what sticks, and
> supporting new hardware will drop compatibility for older one.

For the qemu case, I would imagine a two stage fallback:

1) Ask for the exact IOMMU capabilities (including pagetable
   format) that the vIOMMU has.  If the host can supply, you're
   good

2) If not, ask for a kernel managed IOAS.  Verify that it can map
   all the IOVA ranges the guest vIOMMU needs, and has an equal or
   smaller pagesize than the guest vIOMMU presents.  If so,
   software emulate the vIOMMU by shadowing guest io pagetable
   updates into the kernel managed IOAS.

3) You're out of luck, don't start.

For both (1) and (2) I'd expect it to be asking this question *after*
saying what devices are attached to the IOAS, based on the virtual
hardware configuration.  That doesn't cover hotplug, of course, for
that you have to just fail the hotplug if the new device isn't
supportable with the IOAS you already have.

One can imagine optimizations where for certain intermediate cases you
could do a lighter SW emu if the host supports a model that's close to
the vIOMMU one, and you're able to trap and emulate the differences.
In practice I doubt anyone's going to have time to look for such cases
and implement the logic for it.

> For example depending whether the hardware IOMMU is SMMUv2 or SMMUv3, that
> completely changes the capabilities offered to the guest (some v2
> implementations support nesting page tables, but never PASID nor PRI
> unlike v3.) The same vIOMMU could support either, presenting different
> capabilities to the guest, even multiple page table formats if we wanted
> to be exhaustive (SMMUv2 supports the older 32-bit descriptor), but it
> needs to know early on what the hardware is precisely. Then some new page
> table format shows up and, although the vIOMMU can support that in
> addition to older ones, QEMU will have to pick a single one, that it
> assumes the guest knows how to drive?
> 
> I think once it binds a device to an IOASID fd, QEMU will want to probe
> what hardware features are available before going further with the vIOMMU
> setup (is there PASID, PRI, which page table formats are supported,
> address size, page granule, etc). Obtaining precise information about the
> hardware would be less awkward than trying different configurations until
> one succeeds. Binding an additional device would then fail if its pIOMMU
> doesn't support exactly the features supported for the first device,
> because we don't know which ones the guest will choose. QEMU will have to
> open a new IOASID fd for that device.

No, this fundamentally 

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-08 Thread David Gibson
On Thu, Jun 03, 2021 at 09:28:32AM -0300, Jason Gunthorpe wrote:
> On Thu, Jun 03, 2021 at 03:23:17PM +1000, David Gibson wrote:
> > On Wed, Jun 02, 2021 at 01:37:53PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Jun 02, 2021 at 04:57:52PM +1000, David Gibson wrote:
> > > 
> > > > I don't think presence or absence of a group fd makes a lot of
> > > > difference to this design.  Having a group fd just means we attach
> > > > groups to the ioasid instead of individual devices, and we no longer
> > > > need the bookkeeping of "partial" devices.
> > > 
> > > Oh, I think we really don't want to attach the group to an ioasid, or
> > > at least not as a first-class idea.
> > > 
> > > The fundamental problem that got us here is we now live in a world
> > > where there are many ways to attach a device to an IOASID:
> > 
> > I'm not seeing that that's necessarily a problem.
> > 
> > >  - A RID binding
> > >  - A RID,PASID binding
> > >  - A RID,PASID binding for ENQCMD
> > 
> > I have to admit I haven't fully grasped the differences between these
> > modes.  I'm hoping we can consolidate at least some of them into the
> > same sort of binding onto different IOASIDs (which may be linked in
> > parent/child relationships).
> 
> What I would like is that the /dev/iommu side managing the IOASID
> doesn't really care much, but the device driver has to tell
> drivers/iommu what it is going to do when it attaches.

By the device driver, do you mean the userspace or guest device
driver?  Or do you mean the vfio_pci or mdev "shim" device driver"?

> It makes sense, in PCI terms, only the driver knows what TLPs the
> device will generate. The IOMMU needs to know what TLPs it will
> recieve to configure properly.
> 
> PASID or not is major device specific variation, as is the ENQCMD/etc
> 
> Having the device be explicit when it tells the IOMMU what it is going
> to be sending is a major plus to me. I actually don't want to see this
> part of the interface be made less strong.

Ok, if I'm understanding this right a PASID capable IOMMU will be able
to process *both* transactions with just a RID and transactions with a
RID+PASID.

So if we're thinking of this notional 84ish-bit address space, then
that includes "no PASID" as well as all the possible PASID values.
Yes?  Or am I confused?

> 
> > > The selection of which mode to use is based on the specific
> > > driver/device operation. Ie the thing that implements the 'struct
> > > vfio_device' is the thing that has to select the binding mode.
> > 
> > I thought userspace selected the binding mode - although not all modes
> > will be possible for all devices.
> 
> /dev/iommu is concerned with setting up the IOAS and filling the IO
> page tables with information
> 
> The driver behind "struct vfio_device" is responsible to "route" its
> HW into that IOAS.
> 
> They are two halfs of the problem, one is only the io page table, and one
> the is connection of a PCI TLP to a specific io page table.
> 
> Only the driver knows what format of TLPs the device will generate so
> only the driver can specify the "route"

Ok.  I'd really like if we can encode this in a way that doesn't build
PCI-specific structure into the API, though.

>  
> > > eg if two PCI devices are in a group then it is perfectly fine that
> > > one device uses RID binding and the other device uses RID,PASID
> > > binding.
> > 
> > U... I don't see how that can be.  They could well be in the same
> > group because their RIDs cannot be distinguished from each other.
> 
> Inability to match the RID is rare, certainly I would expect any IOMMU
> HW that can do PCIEe PASID matching can also do RID matching.

It's not just up to the IOMMU.  The obvious case is a PCIe-to-PCI
bridge.  All transactions show the RID of the bridge, because vanilla
PCI doesn't have them.  Same situation with a buggy multifunction
device which uses function 0's RID for all functions.

It may be rare, but we still have to deal with it one way or another.

I really don't think we want to support multiple binding types for a
single group.

> With
> such HW the above is perfectly fine - the group may not be secure
> between members (eg !ACS), but the TLPs still carry valid RIDs and
> PASID and the IOMMU can still discriminate.

They carry RIDs, whether they're valid depends on how buggy your
hardware is.

> I think you are talking about really old IOMMU's that could only
> isolate based on ingress port or something.. I suppose modern PCIe has
> some cases like this in the NTB stuff too.

Depends what you mean by really old.  They may seem really old to
those working on new fancy IOMMU technology.  But I hit problems in
practice not long ago with awkwardly multi-device groups because it
was on a particular Dell server without ACS implementation.  Likewise
I strongly suspect non-PASID IOMMUs will remain common on low end
hardware (like peoples' laptops) for some time.

> Oh, I hadn't spent time thinking about any of those.. It is messy but
> it can 

RE: [RFC] /dev/ioasid uAPI proposal

2021-06-08 Thread Parav Pandit
Hi Jaocb,

Sorry for the late response. Was on PTO on Friday last week.
Please see comments below.

> From: Jacob Pan 
> Sent: Friday, June 4, 2021 2:28 AM
> 
> Hi Parav,
> 
> On Tue, 1 Jun 2021 17:30:51 +, Parav Pandit  wrote:
> 
> > > From: Tian, Kevin 
> > > Sent: Thursday, May 27, 2021 1:28 PM
> >
> > > 5.6. I/O page fault
> > > +++
> > >
> > > (uAPI is TBD. Here is just about the high-level flow from host IOMMU
> > > driver to guest IOMMU driver and backwards).
> > >
> > > -   Host IOMMU driver receives a page request with raw fault_data {rid,
> > > pasid, addr};
> > >
> > > -   Host IOMMU driver identifies the faulting I/O page table according
> > > to information registered by IOASID fault handler;
> > >
> > > -   IOASID fault handler is called with raw fault_data (rid, pasid,
> > > addr), which is saved in ioasid_data->fault_data (used for
> > > response);
> > >
> > > -   IOASID fault handler generates an user fault_data (ioasid, addr),
> > > links it to the shared ring buffer and triggers eventfd to
> > > userspace;
> > >
> > > -   Upon received event, Qemu needs to find the virtual routing
> > > information (v_rid + v_pasid) of the device attached to the faulting
> > > ioasid. If there are multiple, pick a random one. This should be
> > > fine since the purpose is to fix the I/O page table on the guest;
> > >
> > > -   Qemu generates a virtual I/O page fault through vIOMMU into guest,
> > > carrying the virtual fault data (v_rid, v_pasid, addr);
> > >
> > Why does it have to be through vIOMMU?
> I think this flow is for fully emulated IOMMU, the same IOMMU and device
> drivers run in the host and guest. Page request interrupt is reported by the
> IOMMU, thus reporting to vIOMMU in the guest.
In non-emulated case, how will the page fault of guest will be handled?
If I take Intel example, I thought FL page table entry still need to be handled 
by guest, which in turn fills up 2nd level page table entries.
No?

> 
> > For a VFIO PCI device, have you considered to reuse the same PRI
> > interface to inject page fault in the guest? This eliminates any new
> > v_rid. It will also route the page fault request and response through
> > the right vfio device.
> >
> I am curious how would PCI PRI can be used to inject fault. Are you talking
> about PCI config PRI extended capability structure? 
PCI PRI capability is only to expose page fault support.
Page fault injection/response cannot happen through the pci cap anyway.
This requires a side channel.
I was suggesting to emulate pci_endpoint->rc->iommu->iommu_irq path of 
hypervisor, as

vmm->guest_emuated_pri_device->pri_req/rsp queue(s).


> The control is very
> limited, only enable and reset. Can you explain how would page fault
> handled in generic PCI cap?
Not via pci cap.
Through more generic interface without attaching to viommu.

> Some devices may have device specific way to handle page faults, but I guess
> this is not the PCI PRI method you are referring to?
This was my next question that if page fault reporting and response interface 
is generic, it will be more scalable given that PCI PRI is limited to single 
page requests.
And additionally VT-d seems to funnel all the page fault interrupts via single 
IRQ.
And 3rdly, its requirement to always come through the hypervisor intermediatory.

Having a generic mechanism, will help to overcome above limitations as Jean 
already pointed out that page fault is a hot path.

> 
> > > -   Guest IOMMU driver fixes up the fault, updates the I/O page table,
> > > and then sends a page response with virtual completion data (v_rid,
> > > v_pasid, response_code) to vIOMMU;
> > >
> > What about fixing up the fault for mmu page table as well in guest?
> > Or you meant both when above you said "updates the I/O page table"?
> >
> > It is unclear to me that if there is single nested page table
> > maintained or two (one for cr3 references and other for iommu). Can
> > you please clarify?
> >
> I think it is just one, at least for VT-d, guest cr3 in GPA is stored in the 
> host
> iommu. Guest iommu driver calls handle_mm_fault to fix the mmu page
> tables which is shared by the iommu.
> 
So if guest has touched the page data, FL and SL entries of mmu should be 
populated and IOMMU side should not even reach a point of raising the PRI.
(ATS should be enough).
Because IOMMU side share the same FL and SL table entries referred by the 
scalable-mode PASID-table entry format described in Section 9.6.
Is that correct?

> > > -   Qemu finds the pending fault event, converts virtual completion data
> > > into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
> > > complete the pending fault;
> > >
> > For VFIO PCI device a virtual PRI request response interface is done,
> > it can be generic interface among multiple vIOMMUs.
> >
> same question above, not sure how this works in terms of interrupts and
> response queuing etc.
> 
Citing "VFIO PCI device" was wrong on my part.
Was 

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread David Gibson
On Tue, Jun 01, 2021 at 04:22:25PM -0600, Alex Williamson wrote:
> On Tue, 1 Jun 2021 07:01:57 +
> "Tian, Kevin"  wrote:
> > 
> > I summarized five opens here, about:
> > 
> > 1)  Finalizing the name to replace /dev/ioasid;
> > 2)  Whether one device is allowed to bind to multiple IOASID fd's;
> > 3)  Carry device information in invalidation/fault reporting uAPI;
> > 4)  What should/could be specified when allocating an IOASID;
> > 5)  The protocol between vfio group and kvm;
> > 
> ...
> > 
> > For 5), I'd expect Alex to chime in. Per my understanding looks the
> > original purpose of this protocol is not about I/O address space. It's
> > for KVM to know whether any device is assigned to this VM and then
> > do something special (e.g. posted interrupt, EPT cache attribute, etc.).
> 
> Right, the original use case was for KVM to determine whether it needs
> to emulate invlpg, so it needs to be aware when an assigned device is
> present and be able to test if DMA for that device is cache coherent.
> The user, QEMU, creates a KVM "pseudo" device representing the vfio
> group, providing the file descriptor of that group to show ownership.
> The ugly symbol_get code is to avoid hard module dependencies, ie. the
> kvm module should not pull in or require the vfio module, but vfio will
> be present if attempting to register this device.
> 
> With kvmgt, the interface also became a way to register the kvm pointer
> with vfio for the translation mentioned elsewhere in this thread.
> 
> The PPC/SPAPR support allows KVM to associate a vfio group to an IOMMU
> page table so that it can handle iotlb programming from pre-registered
> memory without trapping out to userspace.

To clarify that's a guest side logical vIOMMU page table which is
partially managed by KVM.  This is an optimization - things can work
without it, but it means guest iomap/unmap becomes a hot path because
each map/unmap hypercall has to go
guest -> KVM -> qemu -> VFIO

So there are multiple context transitions.

> > Because KVM deduces some policy based on the fact of assigned device, 
> > it needs to hold a reference to related vfio group. this part is irrelevant
> > to this RFC. 
> 
> All of these use cases are related to the IOMMU, whether DMA is
> coherent, translating device IOVA to GPA, and an acceleration path to
> emulate IOMMU programming in kernel... they seem pretty relevant.
> 
> > But ARM's VMID usage is related to I/O address space thus needs some
> > consideration. Another strange thing is about PPC. Looks it also leverages
> > this protocol to do iommu group attach: kvm_spapr_tce_attach_iommu_
> > group. I don't know why it's done through KVM instead of VFIO uAPI in
> > the first place.
> 
> AIUI, IOMMU programming on PPC is done through hypercalls, so KVM needs
> to know how to handle those for in-kernel acceleration.  Thanks,

For PAPR guests, which is the common case, yes.  Bare metal POWER
hosts have their own page table format.  And probably some of the
newer embedded ppc models have some different IOMMU model entirely,
but I'm not familiar with it.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread David Gibson
On Thu, Jun 03, 2021 at 06:49:20AM +, Tian, Kevin wrote:
> > From: David Gibson
> > Sent: Thursday, June 3, 2021 1:09 PM
> [...]
> > > > In this way the SW mode is the same as a HW mode with an infinite
> > > > cache.
> > > >
> > > > The collaposed shadow page table is really just a cache.
> > > >
> > >
> > > OK. One additional thing is that we may need a 'caching_mode"
> > > thing reported by /dev/ioasid, indicating whether invalidation is
> > > required when changing non-present to present. For hardware
> > > nesting it's not reported as the hardware IOMMU will walk the
> > > guest page table in cases of iotlb miss. For software nesting
> > > caching_mode is reported so the user must issue invalidation
> > > upon any change in guest page table so the kernel can update
> > > the shadow page table timely.
> > 
> > For the fist cut, I'd have the API assume that invalidates are
> > *always* required.  Some bypass to avoid them in cases where they're
> > not needed can be an additional extension.
> > 
> 
> Isn't a typical TLB semantics is that non-present entries are not
> cached thus invalidation is not required when making non-present
> to present?

Usually, but not necessarily.

> It's true to both CPU TLB and IOMMU TLB.

I don't think it's entirely true of the CPU TLB on all ppc MMU models
(of which there are far too many).

> In reality
> I feel there are more usages built on hardware nesting than software
> nesting thus making default following hardware TLB behavior makes
> more sense...

I'm arguing for always-require-invalidate because it's strictly more
general.  Requiring the invalidate will support models that don't
require it in all cases; we just make the invalidate a no-op.  The
reverse is not true, so we should tackle the general case first, then
optimize.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread David Gibson
On Fri, Jun 04, 2021 at 09:30:54AM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 04, 2021 at 12:44:28PM +0200, Enrico Weigelt, metux IT consult 
> wrote:
> > On 02.06.21 19:24, Jason Gunthorpe wrote:
> > 
> > Hi,
> > 
> > >> If I understand this correctly, /dev/ioasid is a kind of "common
> > supplier"
> > >> to other APIs / devices. Why can't the fd be acquired by the
> > >> consumer APIs (eg. kvm, vfio, etc) ?
> > >
> > > /dev/ioasid would be similar to /dev/vfio, and everything already
> > > deals with exposing /dev/vfio and /dev/vfio/N together
> > >
> > > I don't see it as a problem, just more work.
> > 
> > One of the problems I'm seeing is in container environments: when
> > passing in an vfio device, we now also need to pass in /dev/ioasid,
> > thus increasing the complexity in container setup (or orchestration).
> 
> Containers already needed to do this today. Container orchestration is
> hard.

Right to use VFIO a container already needs both /dev/vfio and one or
more /dev/vfio/NNN group devices.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread David Gibson
On Thu, Jun 03, 2021 at 08:52:24AM -0300, Jason Gunthorpe wrote:
> On Thu, Jun 03, 2021 at 03:13:44PM +1000, David Gibson wrote:
> 
> > > We can still consider it a single "address space" from the IOMMU
> > > perspective. What has happened is that the address table is not just a
> > > 64 bit IOVA, but an extended ~80 bit IOVA formed by "PASID, IOVA".
> > 
> > True.  This does complexify how we represent what IOVA ranges are
> > valid, though.  I'll bet you most implementations don't actually
> > implement a full 64-bit IOVA, which means we effectively have a large
> > number of windows from (0..max IOVA) for each valid pasid.  This adds
> > another reason I don't think my concept of IOVA windows is just a
> > power specific thing.
> 
> Yes
> 
> Things rapidly get into weird hardware specific stuff though, the
> request will be for things like:
>   "ARM PASID page table format from SMMU IP block vXX"

So, I'm happy enough for picking a user-managed pagetable format to
imply the set of valid IOVA ranges (though a query might be nice).

I'm mostly thinking of representing (and/or choosing) valid IOVA
ranges as something for the kernel-managed pagetable style
(MAP/UNMAP).

> Which may have a bunch of (possibly very weird!) format specific data
> to describe and/or configure it.
> 
> The uAPI needs to be suitably general here. :(
> 
> > > If we are already going in the direction of having the IOASID specify
> > > the page table format and other details, specifying that the page
> > > tabnle format is the 80 bit "PASID, IOVA" format is a fairly small
> > > step.
> > 
> > Well, rather I think userspace needs to request what page table format
> > it wants and the kernel tells it whether it can oblige or not.
> 
> Yes, this is what I ment.
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Jason Wang


在 2021/6/8 上午3:41, Alex Williamson 写道:

On Mon, 7 Jun 2021 16:08:02 -0300
Jason Gunthorpe  wrote:


On Mon, Jun 07, 2021 at 12:59:46PM -0600, Alex Williamson wrote:


It is up to qemu if it wants to proceed or not. There is no issue with
allowing the use of no-snoop and blocking wbinvd, other than some
drivers may malfunction. If the user is certain they don't have
malfunctioning drivers then no issue to go ahead.

A driver that knows how to use the device in a coherent way can
certainly proceed, but I suspect that's not something we can ask of
QEMU.  QEMU has no visibility to the in-use driver and sketchy ability
to virtualize the no-snoop enable bit to prevent non-coherent DMA from
the device.  There might be an experimental ("x-" prefixed) QEMU device
option to allow user override, but QEMU should disallow the possibility
of malfunctioning drivers by default.  If we have devices that probe as
supporting no-snoop, but actually can't generate such traffic, we might
need a quirk list somewhere.

Compatibility is important, but when I look in the kernel code I see
very few places that call wbinvd(). Basically all DRM for something
relavent to qemu.

That tells me that the vast majority of PCI devices do not generate
no-snoop traffic.

Unfortunately, even just looking at devices across a couple laptops
most devices do support and have NoSnoop+ set by default.  I don't
notice anything in the kernel that actually tries to set this enable (a
handful that actively disable), so I assume it's done by the firmware.



I wonder whether or not it was done via ACPI:

"

6.2.17 _CCA (Cache Coherency Attribute) The _CCA object returns whether 
or not a bus-master device supports hardware managed cache coherency. 
Expected values are 0 to indicate it is not supported, and 1 to indicate 
that it is supported. All other values are reserved.


...

On Intel platforms, if the _CCA object is not supplied, the OSPM will 
assume the devices are hardware cache coherent.


"

Thanks



It's not safe for QEMU to make an assumption that only GPUs will
actually make use of it.


I think it makes the software design much simpler if the security
check is very simple. Possessing a suitable device in an ioasid fd
container is enough to flip on the feature and we don't need to track
changes from that point on. We don't need to revoke wbinvd if the
ioasid fd changes, for instance. Better to keep the kernel very simple
in this regard.

You're suggesting that a user isn't forced to give up wbinvd emulation
if they lose access to their device?

Sure, why do we need to be stricter? It is the same logic I gave
earlier, once an attacker process has access to wbinvd an attacker can
just keep its access indefinitely.

The main use case for revokation assumes that qemu would be
compromised after a device is hot-unplugged and you want to block off
wbinvd. But I have a hard time seeing that as useful enough to justify
all the complicated code to do it...

It's currently just a matter of the kvm-vfio device holding a reference
to the group so that it cannot be used elsewhere so long as it's being
used to elevate privileges on a given KVM instance.  If we conclude that
access to a device with the right capability is required to gain a
privilege, I don't really see how we can wave aside that the privilege
isn't lost with the device.


For KVM qemu can turn on/off on hot plug events as it requires to give
VM security. It doesn't need to rely on the kernel to control this.

Yes, QEMU can reject a hot-unplug event, but then QEMU retains the
privilege that the device grants it.  Releasing the device and
retaining the privileged gained by it seems wrong.  Thanks,

Alex



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Jason Wang


在 2021/6/3 上午1:21, Jason Gunthorpe 写道:

On Wed, Jun 02, 2021 at 04:54:26PM +0800, Jason Wang wrote:

在 2021/6/2 上午1:31, Jason Gunthorpe 写道:

On Tue, Jun 01, 2021 at 04:47:15PM +0800, Jason Wang wrote:

We can open up to ~0U file descriptors, I don't see why we need to restrict
it in uAPI.

There are significant problems with such large file descriptor
tables. High FD numbers man things like select don't work at all
anymore and IIRC there are more complications.


I don't see how much difference for IOASID and other type of fds. People can
choose to use poll or epoll.

Not really, once one thing in an applicate uses a large number FDs the
entire application is effected. If any open() can return 'very big
number' then nothing in the process is allowed to ever use select.

It is not a trivial thing to ask for


And with the current proposal, (assuming there's a N:1 ioasid to ioasid). I
wonder how select can work for the specific ioasid.

pagefault events are one thing that comes to mind. Bundling them all
together into a single ring buffer is going to be necessary. Multifds
just complicate this too

Jason



Well, this sounds like a re-invention of io_uring which has already 
worked for multifds.


Thanks


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Shenming Lu
On 2021/6/7 20:19, Liu, Yi L wrote:
>> From: Shenming Lu 
>> Sent: Friday, June 4, 2021 10:03 AM
>>
>> On 2021/6/4 2:19, Jacob Pan wrote:
>>> Hi Shenming,
>>>
>>> On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu
>> 
>>> wrote:
>>>
 On 2021/6/2 1:33, Jason Gunthorpe wrote:
> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
>
>> The drivers register per page table fault handlers to /dev/ioasid which
>> will then register itself to iommu core to listen and route the per-
>> device I/O page faults.
>
> I'm still confused why drivers need fault handlers at all?

 Essentially it is the userspace that needs the fault handlers,
 one case is to deliver the faults to the vIOMMU, and another
 case is to enable IOPF on the GPA address space for on-demand
 paging, it seems that both could be specified in/through the
 IOASID_ALLOC ioctl?

>>> I would think IOASID_BIND_PGTABLE is where fault handler should be
>>> registered. There wouldn't be any IO page fault without the binding
>> anyway.
>>
>> Yeah, I also proposed this before, registering the handler in the
>> BIND_PGTABLE
>> ioctl does make sense for the guest page faults. :-)
>>
>> But how about the page faults from the GPA address space (it's page table is
>> mapped through the MAP_DMA ioctl)? From your point of view, it seems
>> that we should register the handler for the GPA address space in the (first)
>> MAP_DMA ioctl.
> 
> under new proposal, I think the page fault handler is also registered
> per ioasid object. The difference compared with guest page table case
> is there is no need to inject the fault to VM.

Yeah.  And there are some issues specific to the GPA address space case
which have been discussed with Alex..  Thanks,

Shenming
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Jason Wang


在 2021/6/7 下午10:14, Jason Gunthorpe 写道:

On Mon, Jun 07, 2021 at 11:18:33AM +0800, Jason Wang wrote:


Note that no drivers call these things doesn't meant it was not
supported by the spec.

Of course it does. If the spec doesn't define exactly when the driver
should call the cache flushes for no-snoop transactions then the
protocol doesn't support no-soop.



Just to make sure we are in the same page. What I meant is, if the DMA 
behavior like (no-snoop) is device specific. There's no need to mandate 
a virtio general attributes. We can describe it per device. The devices 
implemented in the current spec does not use non-coherent DMA doesn't 
mean any future devices won't do that. The driver could choose to use 
transport (e.g PCI), platform (ACPI) or device specific (general virtio 
command) way to detect and flush cache when necessary.





no-snoop is only used in very specific sequences of operations, like
certain GPU usages, because regaining coherence on x86 is incredibly
expensive.

ie I wouldn't ever expect a NIC to use no-snoop because NIC's expect
packets to be processed by the CPU.



For NIC yes. But virtio is more that just NIC. We've already supported 
GPU and crypto devices. In this case, no-snoop will be useful since the 
data is not necessarily expected to be processed by CPU.


And a lot of other type of devices are being proposed.

Thanks




"non-coherent DMA" is some general euphemism that evokes images of
embedded platforms that don't have coherent DMA at all and have low
cost ways to regain coherence. This is not at all what we are talking
about here at all.
  
Jason




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Alex Williamson
On Mon, 7 Jun 2021 20:03:53 -0300
Jason Gunthorpe  wrote:

> On Mon, Jun 07, 2021 at 01:41:28PM -0600, Alex Williamson wrote:
> 
> > > Compatibility is important, but when I look in the kernel code I see
> > > very few places that call wbinvd(). Basically all DRM for something
> > > relavent to qemu.
> > > 
> > > That tells me that the vast majority of PCI devices do not generate
> > > no-snoop traffic.  
> > 
> > Unfortunately, even just looking at devices across a couple laptops
> > most devices do support and have NoSnoop+ set by default.
> 
> Yes, mine too, but that doesn't mean the device is issuing nosnoop
> transactions, it just means the OS is allowing it to do so if it wants.
> 
> As I said, without driver support the feature cannot be used, and
> there is no driver support in Linux outside DRM, unless it is
> hidden.. Certainly I've never run into it..
> 
> Even mlx5 is setting the nosnoop bit, but I have a fairly high
> confidence that we don't set the TLP bit for anything Linux does.
> 
> > It's not safe for QEMU to make an assumption that only GPUs will
> > actually make use of it.  
> 
> Not 100% safe, but if you know you are running Linux OS in the VM you
> can look at the drivers the devices need and make a determination.

QEMU doesn't know what guest it's running or what driver the guest is
using.  QEMU can only create safe configurations by default, the same
as done now using vfio.  Anything outside of that scope would require
experimental opt-in support by the user or a guarantee from the device
vendor that the device cannot ever (not just for the existing drivers)
create non-coherent TLPs. Thanks,

Alex

> > Yes, QEMU can reject a hot-unplug event, but then QEMU retains the
> > privilege that the device grants it.  Releasing the device and
> > retaining the privileged gained by it seems wrong.  Thanks,  
> 
> It is not completely ideal, but it is such a simplification, and I
> can't really see a drawback..
> 
> Jason
> 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Jason Gunthorpe
On Mon, Jun 07, 2021 at 01:41:28PM -0600, Alex Williamson wrote:

> > Compatibility is important, but when I look in the kernel code I see
> > very few places that call wbinvd(). Basically all DRM for something
> > relavent to qemu.
> > 
> > That tells me that the vast majority of PCI devices do not generate
> > no-snoop traffic.
> 
> Unfortunately, even just looking at devices across a couple laptops
> most devices do support and have NoSnoop+ set by default.  

Yes, mine too, but that doesn't mean the device is issuing nosnoop
transactions, it just means the OS is allowing it to do so if it wants.

As I said, without driver support the feature cannot be used, and
there is no driver support in Linux outside DRM, unless it is
hidden.. Certainly I've never run into it..

Even mlx5 is setting the nosnoop bit, but I have a fairly high
confidence that we don't set the TLP bit for anything Linux does.

> It's not safe for QEMU to make an assumption that only GPUs will
> actually make use of it.

Not 100% safe, but if you know you are running Linux OS in the VM you
can look at the drivers the devices need and make a determination.

> Yes, QEMU can reject a hot-unplug event, but then QEMU retains the
> privilege that the device grants it.  Releasing the device and
> retaining the privileged gained by it seems wrong.  Thanks,

It is not completely ideal, but it is such a simplification, and I
can't really see a drawback..

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Alex Williamson
On Mon, 7 Jun 2021 16:08:02 -0300
Jason Gunthorpe  wrote:

> On Mon, Jun 07, 2021 at 12:59:46PM -0600, Alex Williamson wrote:
> 
> > > It is up to qemu if it wants to proceed or not. There is no issue with
> > > allowing the use of no-snoop and blocking wbinvd, other than some
> > > drivers may malfunction. If the user is certain they don't have
> > > malfunctioning drivers then no issue to go ahead.  
> > 
> > A driver that knows how to use the device in a coherent way can
> > certainly proceed, but I suspect that's not something we can ask of
> > QEMU.  QEMU has no visibility to the in-use driver and sketchy ability
> > to virtualize the no-snoop enable bit to prevent non-coherent DMA from
> > the device.  There might be an experimental ("x-" prefixed) QEMU device
> > option to allow user override, but QEMU should disallow the possibility
> > of malfunctioning drivers by default.  If we have devices that probe as
> > supporting no-snoop, but actually can't generate such traffic, we might
> > need a quirk list somewhere.  
> 
> Compatibility is important, but when I look in the kernel code I see
> very few places that call wbinvd(). Basically all DRM for something
> relavent to qemu.
> 
> That tells me that the vast majority of PCI devices do not generate
> no-snoop traffic.

Unfortunately, even just looking at devices across a couple laptops
most devices do support and have NoSnoop+ set by default.  I don't
notice anything in the kernel that actually tries to set this enable (a
handful that actively disable), so I assume it's done by the firmware.
It's not safe for QEMU to make an assumption that only GPUs will
actually make use of it.

> > > I think it makes the software design much simpler if the security
> > > check is very simple. Possessing a suitable device in an ioasid fd
> > > container is enough to flip on the feature and we don't need to track
> > > changes from that point on. We don't need to revoke wbinvd if the
> > > ioasid fd changes, for instance. Better to keep the kernel very simple
> > > in this regard.  
> > 
> > You're suggesting that a user isn't forced to give up wbinvd emulation
> > if they lose access to their device?
> 
> Sure, why do we need to be stricter? It is the same logic I gave
> earlier, once an attacker process has access to wbinvd an attacker can
> just keep its access indefinitely.
> 
> The main use case for revokation assumes that qemu would be
> compromised after a device is hot-unplugged and you want to block off
> wbinvd. But I have a hard time seeing that as useful enough to justify
> all the complicated code to do it...

It's currently just a matter of the kvm-vfio device holding a reference
to the group so that it cannot be used elsewhere so long as it's being
used to elevate privileges on a given KVM instance.  If we conclude that
access to a device with the right capability is required to gain a
privilege, I don't really see how we can wave aside that the privilege
isn't lost with the device.

> For KVM qemu can turn on/off on hot plug events as it requires to give
> VM security. It doesn't need to rely on the kernel to control this.

Yes, QEMU can reject a hot-unplug event, but then QEMU retains the
privilege that the device grants it.  Releasing the device and
retaining the privileged gained by it seems wrong.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Jason Gunthorpe
On Mon, Jun 07, 2021 at 12:59:46PM -0600, Alex Williamson wrote:

> > It is up to qemu if it wants to proceed or not. There is no issue with
> > allowing the use of no-snoop and blocking wbinvd, other than some
> > drivers may malfunction. If the user is certain they don't have
> > malfunctioning drivers then no issue to go ahead.
> 
> A driver that knows how to use the device in a coherent way can
> certainly proceed, but I suspect that's not something we can ask of
> QEMU.  QEMU has no visibility to the in-use driver and sketchy ability
> to virtualize the no-snoop enable bit to prevent non-coherent DMA from
> the device.  There might be an experimental ("x-" prefixed) QEMU device
> option to allow user override, but QEMU should disallow the possibility
> of malfunctioning drivers by default.  If we have devices that probe as
> supporting no-snoop, but actually can't generate such traffic, we might
> need a quirk list somewhere.

Compatibility is important, but when I look in the kernel code I see
very few places that call wbinvd(). Basically all DRM for something
relavent to qemu.

That tells me that the vast majority of PCI devices do not generate
no-snoop traffic.

> > I think it makes the software design much simpler if the security
> > check is very simple. Possessing a suitable device in an ioasid fd
> > container is enough to flip on the feature and we don't need to track
> > changes from that point on. We don't need to revoke wbinvd if the
> > ioasid fd changes, for instance. Better to keep the kernel very simple
> > in this regard.
> 
> You're suggesting that a user isn't forced to give up wbinvd emulation
> if they lose access to their device?  

Sure, why do we need to be stricter? It is the same logic I gave
earlier, once an attacker process has access to wbinvd an attacker can
just keep its access indefinitely.

The main use case for revokation assumes that qemu would be
compromised after a device is hot-unplugged and you want to block off
wbinvd. But I have a hard time seeing that as useful enough to justify
all the complicated code to do it...

For KVM qemu can turn on/off on hot plug events as it requires to give
VM security. It doesn't need to rely on the kernel to control this.

But I think it is all fine tuning, the basic idea seems like it could
work, so we are not blocked here on kvm interactions.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Alex Williamson
On Mon, 7 Jun 2021 15:18:58 -0300
Jason Gunthorpe  wrote:

> On Mon, Jun 07, 2021 at 09:41:48AM -0600, Alex Williamson wrote:
> > You're calling this an admin knob, which to me suggests a global module
> > option, so are you trying to implement both an administrator and a user
> > policy?  ie. the user can create scenarios where access to wbinvd might
> > be justified by hardware/IOMMU configuration, but can be limited by the
> > admin?  
> 
> Could be a per-device sysfs too. I'm not really sure what is useful
> here.
> 
> > For example I proposed that the ioasidfd would bear the responsibility
> > of a wbinvd ioctl and therefore validate the user's access to enable
> > wbinvd emulation w/ KVM, so I'm assuming this module option lives
> > there.
> 
> Right, this is what I was thinking
> 
> > What then is "automatic" mode?  The user cannot create a non-coherent
> > IOASID with a non-coherent device if the IOMMU supports no-snoop
> > blocking?  Do they get a failure?  Does it get silently promoted to
> > coherent?  
> 
> "automatic" was just a way to keep the API the same as today. Today if
> the IOMMU can block no-snoop then vfio disables wbinvd. To get the
> same level of security automatic mode would detect that vfio would
> have blocked wbinvd because the IOMMU can do it, and then always block
> it.
> 
> It makes sense if there is an admin knob, as the admin could then move
> to an explict enable/disable to get functionality they can't get
> today.
> 
> > In "disable" mode, I think we're just narrowing the restriction
> > further, a non-coherent capable device cannot be used except in a
> > forced coherent IOASID.  
> 
> I wouldn't say "cannot be used" - just you can't get access to
> wbinvd. 
> 
> It is up to qemu if it wants to proceed or not. There is no issue with
> allowing the use of no-snoop and blocking wbinvd, other than some
> drivers may malfunction. If the user is certain they don't have
> malfunctioning drivers then no issue to go ahead.

A driver that knows how to use the device in a coherent way can
certainly proceed, but I suspect that's not something we can ask of
QEMU.  QEMU has no visibility to the in-use driver and sketchy ability
to virtualize the no-snoop enable bit to prevent non-coherent DMA from
the device.  There might be an experimental ("x-" prefixed) QEMU device
option to allow user override, but QEMU should disallow the possibility
of malfunctioning drivers by default.  If we have devices that probe as
supporting no-snoop, but actually can't generate such traffic, we might
need a quirk list somewhere.

> The current vfio arrangement (automatic) maximized compatability. The
> enable/disable options provide for max performance and max security as
> alternative targets.
> 
> > > It is the strenth of Paolo's model that KVM should not be able to do
> > > optionally less, not more than the process itself can do.  
> > 
> > I think my previous reply was working towards those guidelines.  I feel
> > like we're mostly in agreement, but perhaps reading past each other.  
> 
> Yes, I think I said we were agreeing :)
> 
> > Nothing here convinced me against my previous proposal that the
> > ioasidfd bears responsibility for managing access to a wbinvd ioctl,
> > and therefore the equivalent KVM access.  Whether wbinvd is allowed or
> > no-op'd when the use has access to a non-coherent device in a
> > configuration where the IOMMU prevents non-coherent DMA is maybe still
> > a matter of personal preference.  
> 
> I think it makes the software design much simpler if the security
> check is very simple. Possessing a suitable device in an ioasid fd
> container is enough to flip on the feature and we don't need to track
> changes from that point on. We don't need to revoke wbinvd if the
> ioasid fd changes, for instance. Better to keep the kernel very simple
> in this regard.

You're suggesting that a user isn't forced to give up wbinvd emulation
if they lose access to their device?  I suspect that like we do today,
we'll want to re-evaluate the need for wbinvd on most device changes.
I think this is why the kvm-vfio device holds a vfio group reference;
to make sure a given group can't elevate privileges for multiple
processes.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Jason Gunthorpe
On Mon, Jun 07, 2021 at 09:41:48AM -0600, Alex Williamson wrote:
> You're calling this an admin knob, which to me suggests a global module
> option, so are you trying to implement both an administrator and a user
> policy?  ie. the user can create scenarios where access to wbinvd might
> be justified by hardware/IOMMU configuration, but can be limited by the
> admin?

Could be a per-device sysfs too. I'm not really sure what is useful
here.

> For example I proposed that the ioasidfd would bear the responsibility
> of a wbinvd ioctl and therefore validate the user's access to enable
> wbinvd emulation w/ KVM, so I'm assuming this module option lives
> there.  

Right, this is what I was thinking

> What then is "automatic" mode?  The user cannot create a non-coherent
> IOASID with a non-coherent device if the IOMMU supports no-snoop
> blocking?  Do they get a failure?  Does it get silently promoted to
> coherent?

"automatic" was just a way to keep the API the same as today. Today if
the IOMMU can block no-snoop then vfio disables wbinvd. To get the
same level of security automatic mode would detect that vfio would
have blocked wbinvd because the IOMMU can do it, and then always block
it.

It makes sense if there is an admin knob, as the admin could then move
to an explict enable/disable to get functionality they can't get
today.

> In "disable" mode, I think we're just narrowing the restriction
> further, a non-coherent capable device cannot be used except in a
> forced coherent IOASID.

I wouldn't say "cannot be used" - just you can't get access to
wbinvd. 

It is up to qemu if it wants to proceed or not. There is no issue with
allowing the use of no-snoop and blocking wbinvd, other than some
drivers may malfunction. If the user is certain they don't have
malfunctioning drivers then no issue to go ahead.

The current vfio arrangement (automatic) maximized compatability. The
enable/disable options provide for max performance and max security as
alternative targets.

> > It is the strenth of Paolo's model that KVM should not be able to do
> > optionally less, not more than the process itself can do.
> 
> I think my previous reply was working towards those guidelines.  I feel
> like we're mostly in agreement, but perhaps reading past each other.

Yes, I think I said we were agreeing :)

> Nothing here convinced me against my previous proposal that the
> ioasidfd bears responsibility for managing access to a wbinvd ioctl,
> and therefore the equivalent KVM access.  Whether wbinvd is allowed or
> no-op'd when the use has access to a non-coherent device in a
> configuration where the IOMMU prevents non-coherent DMA is maybe still
> a matter of personal preference.

I think it makes the software design much simpler if the security
check is very simple. Possessing a suitable device in an ioasid fd
container is enough to flip on the feature and we don't need to track
changes from that point on. We don't need to revoke wbinvd if the
ioasid fd changes, for instance. Better to keep the kernel very simple
in this regard.

Seems agreeable enough that there is something here to explore in
patches when the time comes

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Jason Gunthorpe
On Mon, Jun 07, 2021 at 03:30:21PM +0200, Enrico Weigelt, metux IT consult 
wrote:
> On 02.06.21 19:21, Jason Gunthorpe wrote:
> 
> Hi,
> 
> > Not really, once one thing in an applicate uses a large number FDs the
> > entire application is effected. If any open() can return 'very big
> > number' then nothing in the process is allowed to ever use select.
> 
> isn't that a bug in select() ?

 it is what it is, select has a fixed size bitmap of FD #s and
a hard upper bound on that size as part of the glibc ABI - can't be
fixed.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Jason Gunthorpe
On Mon, Jun 07, 2021 at 08:51:42AM +0200, Paolo Bonzini wrote:
> On 07/06/21 05:25, Tian, Kevin wrote:
> > Per Intel SDM wbinvd is a privileged instruction. A process on the
> > host has no privilege to execute it.
> 
> (Half of) the point of the kernel is to do privileged tasks on the
> processes' behalf.  There are good reasons why a process that uses VFIO
> (without KVM) could want to use wbinvd, so VFIO lets them do it with a ioctl
> and adequate checks around the operation.

Yes, exactly.

You cannot write a correct VFIO application for hardware that uses the
no-snoop bit without access to wbinvd.

KVM or not does not matter.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Jason Gunthorpe
On Sat, Jun 05, 2021 at 08:22:27AM +0200, Paolo Bonzini wrote:
> On 04/06/21 19:22, Jason Gunthorpe wrote:
> >   4) The KVM interface is the very simple enable/disable WBINVD.
> >  Possessing a FD that can do IOMMU_EXECUTE_WBINVD is required
> >  to enable WBINVD at KVM.
> 
> The KVM interface is the same kvm-vfio device that exists already.  The
> userspace API does not need to change at all: adding one VFIO file
> descriptor with WBINVD enabled to the kvm-vfio device lets the VM use WBINVD
> functionality (see kvm_vfio_update_coherency).

The problem is we are talking about adding a new /dev/ioasid FD and it
won't fit into the existing KVM VFIO FD interface. There are lots of
options here, one is to add new ioctls that specifically use the new
FD, the other is to somehow use VFIO as a proxy to carry things to the
/dev/ioasid FD code.

> Alternatively you can add a KVM_DEV_IOASID_{ADD,DEL} pair of ioctls. But it
> seems useless complication compared to just using what we have now, at least
> while VMs only use IOASIDs via VFIO.

The simplest is KVM_ENABLE_WBINVD() and be done
with it.

I don't need to keep track things in KVM, just flip one flag on/off
under user control.

> Either way, there should be no policy attached to the add/delete operations.
> KVM users want to add the VFIO (or IOASID) file descriptors to the device
> independent of WBINVD.  If userspace wants/needs to apply its own policy on
> whether to enable WBINVD or not, they can do it on the VFIO/IOASID side:

Why does KVM need to know abut IOASID's? I don't think it can do
anything with this general information.

> >  1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
> > it communicates its no-snoop configuration:
> >  - 0 enable, allow WBINVD
> >  - 1 automatic disable, block WBINVD if the platform
> >IOMMU can police it (what we do today)
> >  - 2 force disable, do not allow BINVD ever
> 
> Though, like Alex, it's also not clear to me whether force-disable is
> useful.  Instead userspace can query the IOMMU or the device to ensure it's
> not enabled.

"force disable" would be a way for the device to signal to whatever
query you imagine that it is not enabled. Maybe I should have called
it "no-snoop is never used"

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Jason Gunthorpe
On Fri, Jun 04, 2021 at 11:10:53PM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Friday, June 4, 2021 8:09 PM
> > 
> > On Fri, Jun 04, 2021 at 06:37:26AM +, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe
> > > > Sent: Thursday, June 3, 2021 9:05 PM
> > > >
> > > > > >
> > > > > > 3) Device accepts any PASIDs from the guest. No
> > > > > >vPASID/pPASID translation is possible. (classic vfio_pci)
> > > > > > 4) Device accepts any PASID from the guest and has an
> > > > > >internal vPASID/pPASID translation (enhanced vfio_pci)
> > > > >
> > > > > what is enhanced vfio_pci? In my writing this is for mdev
> > > > > which doesn't support ENQCMD
> > > >
> > > > This is a vfio_pci that mediates some element of the device interface
> > > > to communicate the vPASID/pPASID table to the device, using Max's
> > > > series for vfio_pci drivers to inject itself into VFIO.
> > > >
> > > > For instance a device might send a message through the PF that the VF
> > > > has a certain vPASID/pPASID translation table. This would be useful
> > > > for devices that cannot use ENQCMD but still want to support migration
> > > > and thus need vPASID.
> > >
> > > I still don't quite get. If it's a PCI device why is PASID translation 
> > > required?
> > > Just delegate the per-RID PASID space to user as type-3 then migrating the
> > > vPASID space is just straightforward.
> > 
> > This is only possible if we get rid of the global pPASID allocation
> > (honestly is my preference as it makes the HW a lot simpler)
> > 
> 
> In this proposal global vs. per-RID allocation is a per-device policy.
> for vfio-pci it can always use per-RID (regardless of whether the
> device is partially mediated or not) and no vPASID/pPASID conversion. 
> Even for mdev if no ENQCMD we can still do per-RID conversion.
> only for mdev which has ENQCMD we need global pPASID allocation.
> 
> I think this is the motivation you explained earlier that it's not good
> to have one global PASID allocator in the kernel. per-RID vs. global
> should be selected per device.

I thought we concluded this wasn't possible because the guest could
choose to bind the same vPASID to a RID and to a ENQCMD device and
then we run into trouble? Are are you saying that a RID device gets a
complete dedicated table and can always have a vPASID == pPASID?

In any event it needs clear explanation in the next RFC

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Alex Williamson
On Fri, 4 Jun 2021 20:01:08 -0300
Jason Gunthorpe  wrote:

> On Fri, Jun 04, 2021 at 03:29:18PM -0600, Alex Williamson wrote:
> > On Fri, 4 Jun 2021 14:22:07 -0300
> > Jason Gunthorpe  wrote:
> >   
> > > On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote:  
> > > > On 04/06/21 18:03, Jason Gunthorpe wrote:
> > > > > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:
> > > > > > I don't want a security proof myself; I want to trust VFIO to make 
> > > > > > the right
> > > > > > judgment and I'm happy to defer to it (via the KVM-VFIO device).
> > > > > > 
> > > > > > Given how KVM is just a device driver inside Linux, VMs should be a 
> > > > > > slightly
> > > > > > more roundabout way to do stuff that is accessible to bare metal; 
> > > > > > not a way
> > > > > > to gain extra privilege.
> > > > > 
> > > > > Okay, fine, lets turn the question on its head then.
> > > > > 
> > > > > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO
> > > > > application can make use of no-snoop optimizations. The ability of KVM
> > > > > to execute wbinvd should be tied to the ability of that IOCTL to run
> > > > > in a normal process context.
> > > > > 
> > > > > So, under what conditions do we want to allow VFIO to giave a process
> > > > > elevated access to the CPU:
> > > > 
> > > > Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e.
> > > > #2+#3 would be worse than what we have today), but IIUC the proposal 
> > > > (was it
> > > > yours or Kevin's?) was to keep #2 and add #1 with an enable/disable 
> > > > ioctl,
> > > > which then would be on VFIO and not on KVM.  
> > > 
> > > At the end of the day we need an ioctl with two arguments:
> > >  - The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever)
> > >  - The KVM FD to control wbinvd support on
> > > 
> > > Philosophically it doesn't matter too much which subsystem that ioctl
> > > lives, but we have these obnoxious cross module dependencies to
> > > consider.. 
> > > 
> > > Framing the question, as you have, to be about the process, I think
> > > explains why KVM doesn't really care what is decided, so long as the
> > > process and the VM have equivalent rights.
> > > 
> > > Alex, how about a more fleshed out suggestion:
> > > 
> > >  1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
> > > it communicates its no-snoop configuration:  
> > 
> > Communicates to whom?  
> 
> To the /dev/iommu FD which will have to maintain a list of devices
> attached to it internally.
> 
> > >  - 0 enable, allow WBINVD
> > >  - 1 automatic disable, block WBINVD if the platform
> > >IOMMU can police it (what we do today)
> > >  - 2 force disable, do not allow BINVD ever  
> > 
> > The only thing we know about the device is whether or not Enable
> > No-snoop is hard wired to zero, ie. it either can't generate no-snoop
> > TLPs ("coherent-only") or it might ("assumed non-coherent").
> 
> Here I am outlining the choice an also imagining we might want an
> admin knob to select the three.

You're calling this an admin knob, which to me suggests a global module
option, so are you trying to implement both an administrator and a user
policy?  ie. the user can create scenarios where access to wbinvd might
be justified by hardware/IOMMU configuration, but can be limited by the
admin?

For example I proposed that the ioasidfd would bear the responsibility
of a wbinvd ioctl and therefore validate the user's access to enable
wbinvd emulation w/ KVM, so I'm assuming this module option lives
there.  I essentially described the "enable" behavior in my previous
reply, user has access to wbinvd if owning a non-coherent capable
device managed in a non-coherent IOASID.  Yes, the user IOASID
configuration controls the latter half of this.

What then is "automatic" mode?  The user cannot create a non-coherent
IOASID with a non-coherent device if the IOMMU supports no-snoop
blocking?  Do they get a failure?  Does it get silently promoted to
coherent?

In "disable" mode, I think we're just narrowing the restriction
further, a non-coherent capable device cannot be used except in a
forced coherent IOASID.

> > If we're putting the policy decision in the hands of userspace they
> > should have access to wbinvd if they own a device that is assumed
> > non-coherent AND it's attached to an IOMMU (page table) that is not
> > blocking no-snoop (a "non-coherent IOASID").  
> 
> There are two parts here, like Paolo was leading too. If the process
> has access to WBINVD and then if such an allowed process tells KVM to
> turn on WBINVD in the guest.
> 
> If the process has a device and it has a way to create a non-coherent
> IOASID, then that process has access to WBINVD.
> 
> For security it doesn't matter if the process actually creates the
> non-coherent IOASID or not. An attacker will simply do the steps that
> give access to WBINVD.

Yes, at this point the user has 

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Jason Gunthorpe
On Mon, Jun 07, 2021 at 11:18:33AM +0800, Jason Wang wrote:

> Note that no drivers call these things doesn't meant it was not
> supported by the spec.

Of course it does. If the spec doesn't define exactly when the driver
should call the cache flushes for no-snoop transactions then the
protocol doesn't support no-soop. 

no-snoop is only used in very specific sequences of operations, like
certain GPU usages, because regaining coherence on x86 is incredibly
expensive.

ie I wouldn't ever expect a NIC to use no-snoop because NIC's expect
packets to be processed by the CPU.

"non-coherent DMA" is some general euphemism that evokes images of
embedded platforms that don't have coherent DMA at all and have low
cost ways to regain coherence. This is not at all what we are talking
about here at all.
 
Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Enrico Weigelt, metux IT consult

On 02.06.21 19:21, Jason Gunthorpe wrote:

Hi,


Not really, once one thing in an applicate uses a large number FDs the
entire application is effected. If any open() can return 'very big
number' then nothing in the process is allowed to ever use select.


isn't that a bug in select() ?

--mtx

--
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
i...@metux.net -- +49-151-27565287
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Liu, Yi L
> From: Shenming Lu 
> Sent: Friday, June 4, 2021 10:03 AM
> 
> On 2021/6/4 2:19, Jacob Pan wrote:
> > Hi Shenming,
> >
> > On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu
> 
> > wrote:
> >
> >> On 2021/6/2 1:33, Jason Gunthorpe wrote:
> >>> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
> >>>
>  The drivers register per page table fault handlers to /dev/ioasid which
>  will then register itself to iommu core to listen and route the per-
>  device I/O page faults.
> >>>
> >>> I'm still confused why drivers need fault handlers at all?
> >>
> >> Essentially it is the userspace that needs the fault handlers,
> >> one case is to deliver the faults to the vIOMMU, and another
> >> case is to enable IOPF on the GPA address space for on-demand
> >> paging, it seems that both could be specified in/through the
> >> IOASID_ALLOC ioctl?
> >>
> > I would think IOASID_BIND_PGTABLE is where fault handler should be
> > registered. There wouldn't be any IO page fault without the binding
> anyway.
> 
> Yeah, I also proposed this before, registering the handler in the
> BIND_PGTABLE
> ioctl does make sense for the guest page faults. :-)
> 
> But how about the page faults from the GPA address space (it's page table is
> mapped through the MAP_DMA ioctl)? From your point of view, it seems
> that we should register the handler for the GPA address space in the (first)
> MAP_DMA ioctl.

under new proposal, I think the page fault handler is also registered
per ioasid object. The difference compared with guest page table case
is there is no need to inject the fault to VM.
 
Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-07 Thread Paolo Bonzini

On 07/06/21 05:25, Tian, Kevin wrote:

Per Intel SDM wbinvd is a privileged instruction. A process on the
host has no privilege to execute it.


(Half of) the point of the kernel is to do privileged tasks on the 
processes' behalf.  There are good reasons why a process that uses VFIO 
(without KVM) could want to use wbinvd, so VFIO lets them do it with a 
ioctl and adequate checks around the operation.


Paolo

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-06 Thread Tian, Kevin
> From: Paolo Bonzini 
> Sent: Saturday, June 5, 2021 2:22 PM
> 
> On 04/06/21 19:22, Jason Gunthorpe wrote:
> >   4) The KVM interface is the very simple enable/disable WBINVD.
> >  Possessing a FD that can do IOMMU_EXECUTE_WBINVD is required
> >  to enable WBINVD at KVM.
> 
> The KVM interface is the same kvm-vfio device that exists already.  The
> userspace API does not need to change at all: adding one VFIO file
> descriptor with WBINVD enabled to the kvm-vfio device lets the VM use
> WBINVD functionality (see kvm_vfio_update_coherency).
> 
> Alternatively you can add a KVM_DEV_IOASID_{ADD,DEL} pair of ioctls.
> But it seems useless complication compared to just using what we have
> now, at least while VMs only use IOASIDs via VFIO.
> 

A new IOASID variation may make more sense in case non-vfio subsystems
want to handle similar coherency problem. Per other discussions looks 
it's still an open whether vDPA wants it or not. and there could be other
passthrough frameworks in the future. Having them all use vfio-naming
sounds not very clean. Anyway the coherency attribute must be configured
on IOASID in the end, then it looks reasonable for KVM to learn the info
from an unified place.

Just FYI we are also planning new IOASID-specific ioctl in KVM for other
usages. Future Intel platforms support a new ENQCMD instruction for
scalable work submission to the device. This instruction includes a 64-
bytes payload plus a PASID retrieved from a CPU MSR register (covered
by xsave). When supporting this instruction in the guest, the value in
the MSR is a guest PASID which must be translated to a host PASID. 
A new VMCS structure (PASID translation table) is introduced for this
purpose. In this /dev/ioasid proposal, we propose VFIO_{UN}MAP_
IOASID for user to update the VMCS structure properly. The user is
expected to provide {ioasid_fd, ioasid, vPASID} to KVM which then
calls ioasid helper function to figure out the corresponding hPASID
to update the specified entry.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-06 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Saturday, June 5, 2021 5:29 AM
> 
> On Fri, 4 Jun 2021 14:22:07 -0300
> Jason Gunthorpe  wrote:
> 
> > On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote:
> > > On 04/06/21 18:03, Jason Gunthorpe wrote:
> > > > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:
> > > > > I don't want a security proof myself; I want to trust VFIO to make the
> right
> > > > > judgment and I'm happy to defer to it (via the KVM-VFIO device).
> > > > >
> > > > > Given how KVM is just a device driver inside Linux, VMs should be a
> slightly
> > > > > more roundabout way to do stuff that is accessible to bare metal; not
> a way
> > > > > to gain extra privilege.
> > > >
> > > > Okay, fine, lets turn the question on its head then.
> > > >
> > > > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace
> VFIO
> > > > application can make use of no-snoop optimizations. The ability of KVM
> > > > to execute wbinvd should be tied to the ability of that IOCTL to run
> > > > in a normal process context.
> > > >
> > > > So, under what conditions do we want to allow VFIO to giave a process
> > > > elevated access to the CPU:
> > >
> > > Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e.
> > > #2+#3 would be worse than what we have today), but IIUC the proposal
> (was it
> > > yours or Kevin's?) was to keep #2 and add #1 with an enable/disable ioctl,
> > > which then would be on VFIO and not on KVM.
> >
> > At the end of the day we need an ioctl with two arguments:
> >  - The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever)
> >  - The KVM FD to control wbinvd support on
> >
> > Philosophically it doesn't matter too much which subsystem that ioctl
> > lives, but we have these obnoxious cross module dependencies to
> > consider..
> >
> > Framing the question, as you have, to be about the process, I think
> > explains why KVM doesn't really care what is decided, so long as the
> > process and the VM have equivalent rights.
> >
> > Alex, how about a more fleshed out suggestion:

Possibly just a naming thing, but I feel it's better to just talk about
no-snoop or non-coherent in the uAPI. Per Intel SDM wbinvd is a
privileged instruction. A process on the host has no privilege to 
execute it. Only when this process holds a VM, this instruction matters
as there are guest privilege levels. But having VFIO uAPI (which is
userspace oriented) to explicitly deal with a CPU instruction which
makes sense only in a virtualization context sounds a bit weird...

> >
> >  1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
> > it communicates its no-snoop configuration:
> 
> Communicates to whom?
> 
> >  - 0 enable, allow WBINVD
> >  - 1 automatic disable, block WBINVD if the platform
> >IOMMU can police it (what we do today)
> >  - 2 force disable, do not allow BINVD ever
> 
> The only thing we know about the device is whether or not Enable
> No-snoop is hard wired to zero, ie. it either can't generate no-snoop
> TLPs ("coherent-only") or it might ("assumed non-coherent").  If
> we're putting the policy decision in the hands of userspace they should
> have access to wbinvd if they own a device that is assumed
> non-coherent AND it's attached to an IOMMU (page table) that is not
> blocking no-snoop (a "non-coherent IOASID").
> 
> I think that means that the IOASID needs to be created (IOASID_ALLOC)
> with a flag that specifies whether this address space is coherent
> (IOASID_GET_INFO probably needs a flag/cap to expose if the system
> supports this).  All mappings in this IOASID would use IOMMU_CACHE and

Yes, this sounds a cleaner way than specifying this attribute late in
VFIO_ATTACH_IOASID. Following Jason's proposal v2 will move to
the scheme requiring user to specify format info when creating an
IOASID. Leaving coherent out of that box just adds some trickiness, 
e.g. whether allowing user to update page table between ALLOC 
and ATTACH.

> and devices attached to it would be required to be backed by an IOMMU
> capable of IOMMU_CAP_CACHE_COHERENCY (attach fails otherwise).  If only
> these IOASIDs exist, access to wbinvd would not be provided.  (How does
> a user provided page table work? - reserved bit set, user error?)
> 
> Conversely, a user could create a non-coherent IOASID and attach any
> device to it, regardless of IOMMU backing capabilities.  Only if an
> assumed non-coherent device is attached would the wbinvd be allowed.
> 
> I think that means that an EXECUTE_WBINVD ioctl lives on the IOASIDFD
> and the IOASID world needs to understand the device's ability to
> generate non-coherent DMA.  This wbinvd ioctl would be a no-op (or
> some known errno) unless a non-coherent IOASID exists with a potentially
> non-coherent device attached.
> 
> > vfio_pci may want to take this from an admin configuration knob
> > someplace. It allows the admin to customize if they want.
> >
> > If we can figure 

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-06 Thread Jason Wang


在 2021/6/4 下午7:58, Jason Gunthorpe 写道:

On Fri, Jun 04, 2021 at 09:11:03AM +0800, Jason Wang wrote:

nor do any virtio drivers implement the required platform specific
cache flushing to make no-snoop TLPs work.

I don't get why virtio drivers needs to do that. I think DMA API should hide
those arch/platform specific stuffs from us.

It is not arch/platform stuff. If the device uses no-snoop then a
very platform specific recovery is required in the device driver.

It is not part of the normal DMA API, it is side APIs like
flush_agp_cache() or wbinvd() that are used by GPU drivers only.



Yes and virtio doesn't support AGP.




If drivers/virtio doesn't explicitly call these things it doesn't
support no-snoop - hence no VDPA device can ever use no-snoop.



Note that no drivers call these things doesn't meant it was not 
supported by the spec.


Actually, spec doesn't forbid the non coherent DMA, anyway we can raise 
a new thread in the virtio mailing list to discuss about that.


But consider virtio has already supported GPU, crypto and sound device, 
and the devices like codec and video are being proposed. It doesn't help 
if we mandate coherent DMA now.


Thanks




Since VIRTIO_F_ACCESS_PLATFORM doesn't trigger wbinvd on x86 it has
nothing to do with no-snoop.

Jason



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-06-05 Thread Paolo Bonzini

On 04/06/21 19:22, Jason Gunthorpe wrote:

  4) The KVM interface is the very simple enable/disable WBINVD.
 Possessing a FD that can do IOMMU_EXECUTE_WBINVD is required
 to enable WBINVD at KVM.


The KVM interface is the same kvm-vfio device that exists already.  The 
userspace API does not need to change at all: adding one VFIO file 
descriptor with WBINVD enabled to the kvm-vfio device lets the VM use 
WBINVD functionality (see kvm_vfio_update_coherency).


Alternatively you can add a KVM_DEV_IOASID_{ADD,DEL} pair of ioctls. 
But it seems useless complication compared to just using what we have 
now, at least while VMs only use IOASIDs via VFIO.


Either way, there should be no policy attached to the add/delete 
operations.  KVM users want to add the VFIO (or IOASID) file descriptors 
to the device independent of WBINVD.  If userspace wants/needs to apply 
its own policy on whether to enable WBINVD or not, they can do it on the 
VFIO/IOASID side:



 1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
it communicates its no-snoop configuration:
 - 0 enable, allow WBINVD
 - 1 automatic disable, block WBINVD if the platform
   IOMMU can police it (what we do today)
 - 2 force disable, do not allow BINVD ever


Though, like Alex, it's also not clear to me whether force-disable is 
useful.  Instead userspace can query the IOMMU or the device to ensure 
it's not enabled.


Paolo

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Friday, June 4, 2021 8:34 PM
> 
> On Fri, Jun 04, 2021 at 06:08:28AM +, Tian, Kevin wrote:
> 
> > In Qemu case the problem is that it doesn't know the list of devices
> > that will be attached to an IOASID when it's created. This is a guest-
> > side knowledge which is conveyed one device at a time to Qemu
> > though vIOMMU.
> 
> At least for the guest side it is alot simpler because the vIOMMU
> being emulated will define nearly everything.
> 
> qemu will just have to ask the kernel for whatever it is the guest is
> doing. If the kernel can't do it then qemu has to SW emulate.
> 
> The no-snoop block may be the only thing that is under qemu's control
> because it is transparent to the guest.
> 
> This will probably become clearer as people start to define what the
> get_info should return.
> 

Sure. Just to clarify my comment that it is for " Perhaps creating an 
IOASID should pass in a list of the device labels that the IOASID will 
be used with". My point is that Qemu doesn't know this fact before
the guest completes binding page table to all relevant devices, while
IOASID must be created when the table is bound to first device. So
Qemu just needs to create IOASID with format that is required for the
current device. Incompatibility will be detected when attaching other
devices later.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Friday, June 4, 2021 8:09 PM
> 
> On Fri, Jun 04, 2021 at 06:37:26AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe
> > > Sent: Thursday, June 3, 2021 9:05 PM
> > >
> > > > >
> > > > > 3) Device accepts any PASIDs from the guest. No
> > > > >vPASID/pPASID translation is possible. (classic vfio_pci)
> > > > > 4) Device accepts any PASID from the guest and has an
> > > > >internal vPASID/pPASID translation (enhanced vfio_pci)
> > > >
> > > > what is enhanced vfio_pci? In my writing this is for mdev
> > > > which doesn't support ENQCMD
> > >
> > > This is a vfio_pci that mediates some element of the device interface
> > > to communicate the vPASID/pPASID table to the device, using Max's
> > > series for vfio_pci drivers to inject itself into VFIO.
> > >
> > > For instance a device might send a message through the PF that the VF
> > > has a certain vPASID/pPASID translation table. This would be useful
> > > for devices that cannot use ENQCMD but still want to support migration
> > > and thus need vPASID.
> >
> > I still don't quite get. If it's a PCI device why is PASID translation 
> > required?
> > Just delegate the per-RID PASID space to user as type-3 then migrating the
> > vPASID space is just straightforward.
> 
> This is only possible if we get rid of the global pPASID allocation
> (honestly is my preference as it makes the HW a lot simpler)
> 

In this proposal global vs. per-RID allocation is a per-device policy.
for vfio-pci it can always use per-RID (regardless of whether the
device is partially mediated or not) and no vPASID/pPASID conversion. 
Even for mdev if no ENQCMD we can still do per-RID conversion.
only for mdev which has ENQCMD we need global pPASID allocation.

I think this is the motivation you explained earlier that it's not good
to have one global PASID allocator in the kernel. per-RID vs. global
should be selected per device.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] /dev/ioasid uAPI proposal

2021-06-04 Thread Jason Gunthorpe
On Fri, Jun 04, 2021 at 03:29:18PM -0600, Alex Williamson wrote:
> On Fri, 4 Jun 2021 14:22:07 -0300
> Jason Gunthorpe  wrote:
> 
> > On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote:
> > > On 04/06/21 18:03, Jason Gunthorpe wrote:  
> > > > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:  
> > > > > I don't want a security proof myself; I want to trust VFIO to make 
> > > > > the right
> > > > > judgment and I'm happy to defer to it (via the KVM-VFIO device).
> > > > > 
> > > > > Given how KVM is just a device driver inside Linux, VMs should be a 
> > > > > slightly
> > > > > more roundabout way to do stuff that is accessible to bare metal; not 
> > > > > a way
> > > > > to gain extra privilege.  
> > > > 
> > > > Okay, fine, lets turn the question on its head then.
> > > > 
> > > > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO
> > > > application can make use of no-snoop optimizations. The ability of KVM
> > > > to execute wbinvd should be tied to the ability of that IOCTL to run
> > > > in a normal process context.
> > > > 
> > > > So, under what conditions do we want to allow VFIO to giave a process
> > > > elevated access to the CPU:  
> > > 
> > > Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e.
> > > #2+#3 would be worse than what we have today), but IIUC the proposal (was 
> > > it
> > > yours or Kevin's?) was to keep #2 and add #1 with an enable/disable ioctl,
> > > which then would be on VFIO and not on KVM.
> > 
> > At the end of the day we need an ioctl with two arguments:
> >  - The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever)
> >  - The KVM FD to control wbinvd support on
> > 
> > Philosophically it doesn't matter too much which subsystem that ioctl
> > lives, but we have these obnoxious cross module dependencies to
> > consider.. 
> > 
> > Framing the question, as you have, to be about the process, I think
> > explains why KVM doesn't really care what is decided, so long as the
> > process and the VM have equivalent rights.
> > 
> > Alex, how about a more fleshed out suggestion:
> > 
> >  1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
> > it communicates its no-snoop configuration:
> 
> Communicates to whom?

To the /dev/iommu FD which will have to maintain a list of devices
attached to it internally.

> >  - 0 enable, allow WBINVD
> >  - 1 automatic disable, block WBINVD if the platform
> >IOMMU can police it (what we do today)
> >  - 2 force disable, do not allow BINVD ever
> 
> The only thing we know about the device is whether or not Enable
> No-snoop is hard wired to zero, ie. it either can't generate no-snoop
> TLPs ("coherent-only") or it might ("assumed non-coherent").  

Here I am outlining the choice an also imagining we might want an
admin knob to select the three.

> If we're putting the policy decision in the hands of userspace they
> should have access to wbinvd if they own a device that is assumed
> non-coherent AND it's attached to an IOMMU (page table) that is not
> blocking no-snoop (a "non-coherent IOASID").

There are two parts here, like Paolo was leading too. If the process
has access to WBINVD and then if such an allowed process tells KVM to
turn on WBINVD in the guest.

If the process has a device and it has a way to create a non-coherent
IOASID, then that process has access to WBINVD.

For security it doesn't matter if the process actually creates the
non-coherent IOASID or not. An attacker will simply do the steps that
give access to WBINVD.

The important detail is that access to WBINVD does not compell the
process to tell KVM to turn on WBINVD. So a qemu with access to WBINVD
can still choose to create a secure guest by always using IOMMU_CACHE
in its page tables and not asking KVM to enable WBINVD.

This propsal shifts this policy decision from the kernel to userspace.
qemu is responsible to determine if KVM should enable wbinvd or not
based on if it was able to create IOASID's with IOMMU_CACHE.

> Conversely, a user could create a non-coherent IOASID and attach any
> device to it, regardless of IOMMU backing capabilities.  Only if an
> assumed non-coherent device is attached would the wbinvd be allowed.

Right, this is exactly the point. Since the user gets to pick if the
IOASID is coherent or not then an attacker can always reach WBINVD
using only the device FD. Additional checks don't add to the security
of the process.

The additional checks you are describing add to the security of the
guest, however qemu is capable of doing them without more help from the
kernel.

It is the strenth of Paolo's model that KVM should not be able to do
optionally less, not more than the process itself can do.

> > It is pretty simple from a /dev/ioasid perpsective, covers todays
> > compat requirement, gives some future option to allow the no-snoop
> > optimization, and gives a new option for qemu to totally block wbinvd
> > no 

  1   2   3   >