Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-06-08 Thread Jason Gunthorpe
On Tue, Jun 08, 2021 at 10:44:31AM +1000, David Gibson wrote:

> When you say "not using a drivers/iommu IOMMU interface" do you
> basically mean the device doesn't do DMA? 

No, I mean the device doesn't use iommu_map() to manage the DMA
mappings.

vfio_iommu_type1 has a special code path that mdev triggers that
doesn't allocate an IOMMU domain and doesn't call iommu_map() or
anything related to that.

Instead a mdev driver calls vfio_pin_pages() which "reads" a fake page
table and returns back the CPU pages for the mdev to DMA map however
it likes.

> Now, we could represent those different sorts of isolation separately,
> but at the time our thinking was that we should group together devices
> that can't be safely isolated for *any* reason, since the practical
> upshot is the same: you can't safely split those devices between
> different owners.

It is fine, but the direction is going the other way, devices have
perfect ioslation and rely on special interactions with the iommu to
get it.
 
> > What I don't like is forcing certain things depending on how the
> > vfio_device was created - for instance forcing a IOMMU group as part
> > and forcing an ugly "SW IOMMU" mode in the container only as part of
> > mdev_device.
> 
> I don't really see how this is depending on how the device is
> created.

static int vfio_iommu_type1_attach_group(void *iommu_data,
 struct iommu_group
 *iommu_group)
{
if (vfio_bus_is_mdev(bus)) {

What the iommu code does depends on how the device was created. This
is really ugly.

This is happening becaus the three objects in the model:
driver/group/domain are not being linked together in a way that
reflects the modern world.

The group has no idea what the driver wants but is in charge of
creating the domain on behalf of the device.

And so people have been created complicated hackery to pass
information from the driver to the group, through the device, so that
the group can create the right domain.

I want to see the driver simply create the right domain directly. It
is much simpler and scales to more domain complexity.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-06-07 Thread David Gibson
On Tue, Jun 01, 2021 at 09:57:12AM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 01, 2021 at 02:03:33PM +1000, David Gibson wrote:
> > On Thu, May 27, 2021 at 03:48:47PM -0300, Jason Gunthorpe wrote:
> > > On Thu, May 27, 2021 at 02:58:30PM +1000, David Gibson wrote:
> > > > On Tue, May 25, 2021 at 04:52:57PM -0300, Jason Gunthorpe wrote:
> > > > > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote:
> > > > > 
> > > > > > 2. iommu backed mdev devices for SRIOV where mdev device is created 
> > > > > > per
> > > > > > VF (mdev device == VF device) then that mdev device has same iommu
> > > > > > protection scope as VF associated to it. 
> > > > > 
> > > > > This doesn't require, and certainly shouldn't create, a fake group.
> > > > 
> > > > It's only fake if you start with a narrow view of what a group is. 
> > > 
> > > A group is connected to drivers/iommu. A group object without *any*
> > > relation to drivers/iommu is just a complete fiction, IMHO.
> > 
> > That might be where we differ.  As I've said, my group I'm primarily
> > meaning the fundamental hardware unit of isolation.  *Usually* that's
> > determined by the capabilities of an IOMMU, but in some cases it might
> > not be.  In either case, the boundaries still matter.
> 
> As in my other email we absolutely need a group concept, it is just a
> question of how the user API is designed around it.
> 
> > > The group mdev implicitly creates is just a fake proxy that comes
> > > along with mdev API. It doesn't do anything and it doesn't mean
> > > anything.
> > 
> > But.. the case of multiple mdevs managed by a single PCI device with
> > an internal IOMMU also exists, and then the mdev groups are *not*
> > proxies but true groups independent of the parent device.  Which means
> > that the group structure of mdevs can vary, which is an argument *for*
> > keeping it, not against.
> 
> If VFIO becomes more "vfio_device" centric then the vfio_device itself
> has some properties. One of those can be "is it inside a drivers/iommu
> group, or not?".
> 
> If the vfio_device is not using a drivers/iommu IOMMU interface then
> it can just have no group at all - no reason to lie. This would mean
> that the device has perfect isolation.

When you say "not using a drivers/iommu IOMMU interface" do you
basically mean the device doesn't do DMA?  I can see some benefit to
that, but some drawbacks too.  The *main* form of isolation (or lack
thereof) that groups is about the IOMMU, but groups can also represent
other forms of isolation failure: e.g. a multifunction device, where
function 0 has some debug registers which affect other functions.
That's relevant whether or not any of those functions use DMA.

Now, we could represent those different sorts of isolation separately,
but at the time our thinking was that we should group together devices
that can't be safely isolated for *any* reason, since the practical
upshot is the same: you can't safely split those devices between
different owners.

> What I don't like is forcing certain things depending on how the
> vfio_device was created - for instance forcing a IOMMU group as part
> and forcing an ugly "SW IOMMU" mode in the container only as part of
> mdev_device.

I don't really see how this is depending on how the device is created.
The current VFIO model is that every device always belongs to a group
- but that group might be a singleton.  That seems less complicated to
me that some devices do and some don't have a group.

> These should all be properties of the vfio_device itself.
> 
> Again this is all about the group fd - and how to fit in with the
> /dev/ioasid proposal from Kevin:
> 
> https://lore.kernel.org/kvm/mwhpr11mb1886422d4839b372c6ab245f8c...@mwhpr11mb1886.namprd11.prod.outlook.com/
> 
> Focusing on vfio_device and skipping the group fd smooths out some
> rough edges.
> 
> Code wise we are not quite there, but I have mapped out eliminating
> the group from the vfio_device centric API and a few other places it
> has crept in.
> 
> The group can exist in the background to enforce security without
> being a cornerstone of the API design.
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-06-01 Thread Jason Gunthorpe
On Tue, Jun 01, 2021 at 02:03:33PM +1000, David Gibson wrote:
> On Thu, May 27, 2021 at 03:48:47PM -0300, Jason Gunthorpe wrote:
> > On Thu, May 27, 2021 at 02:58:30PM +1000, David Gibson wrote:
> > > On Tue, May 25, 2021 at 04:52:57PM -0300, Jason Gunthorpe wrote:
> > > > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote:
> > > > 
> > > > > 2. iommu backed mdev devices for SRIOV where mdev device is created 
> > > > > per
> > > > > VF (mdev device == VF device) then that mdev device has same iommu
> > > > > protection scope as VF associated to it. 
> > > > 
> > > > This doesn't require, and certainly shouldn't create, a fake group.
> > > 
> > > It's only fake if you start with a narrow view of what a group is. 
> > 
> > A group is connected to drivers/iommu. A group object without *any*
> > relation to drivers/iommu is just a complete fiction, IMHO.
> 
> That might be where we differ.  As I've said, my group I'm primarily
> meaning the fundamental hardware unit of isolation.  *Usually* that's
> determined by the capabilities of an IOMMU, but in some cases it might
> not be.  In either case, the boundaries still matter.

As in my other email we absolutely need a group concept, it is just a
question of how the user API is designed around it.

> > The group mdev implicitly creates is just a fake proxy that comes
> > along with mdev API. It doesn't do anything and it doesn't mean
> > anything.
> 
> But.. the case of multiple mdevs managed by a single PCI device with
> an internal IOMMU also exists, and then the mdev groups are *not*
> proxies but true groups independent of the parent device.  Which means
> that the group structure of mdevs can vary, which is an argument *for*
> keeping it, not against.

If VFIO becomes more "vfio_device" centric then the vfio_device itself
has some properties. One of those can be "is it inside a drivers/iommu
group, or not?".

If the vfio_device is not using a drivers/iommu IOMMU interface then
it can just have no group at all - no reason to lie. This would mean
that the device has perfect isolation.

What I don't like is forcing certain things depending on how the
vfio_device was created - for instance forcing a IOMMU group as part
and forcing an ugly "SW IOMMU" mode in the container only as part of
mdev_device.

These should all be properties of the vfio_device itself.

Again this is all about the group fd - and how to fit in with the
/dev/ioasid proposal from Kevin:

https://lore.kernel.org/kvm/mwhpr11mb1886422d4839b372c6ab245f8c...@mwhpr11mb1886.namprd11.prod.outlook.com/

Focusing on vfio_device and skipping the group fd smooths out some
rough edges.

Code wise we are not quite there, but I have mapped out eliminating
the group from the vfio_device centric API and a few other places it
has crept in.

The group can exist in the background to enforce security without
being a cornerstone of the API design.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-06-01 Thread David Gibson
On Thu, May 27, 2021 at 11:55:00PM +0530, Kirti Wankhede wrote:
> 
> 
> On 5/27/2021 10:30 AM, David Gibson wrote:
> > On Wed, May 26, 2021 at 02:48:03AM +0530, Kirti Wankhede wrote:
> > > 
> > > 
> > > On 5/26/2021 1:22 AM, Jason Gunthorpe wrote:
> > > > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote:
> > > > 
> > > > > 2. iommu backed mdev devices for SRIOV where mdev device is created 
> > > > > per
> > > > > VF (mdev device == VF device) then that mdev device has same iommu
> > > > > protection scope as VF associated to it.
> > > > 
> > > > This doesn't require, and certainly shouldn't create, a fake group.
> > > > 
> > > > Only the VF's real IOMMU group should be used to model an iommu domain
> > > > linked to a VF. Injecting fake groups that are proxies for real groups
> > > > only opens the possibility of security problems like David is
> > > > concerned with.
> > > > 
> > > 
> > > I think this security issue should be addressed by letting mdev device
> > > inherit its parent's iommu_group, i.e. VF's iommu_group here.
> > 
> > No, that doesn't work.  AIUI part of the whole point of mdevs is to
> > allow chunks of a single PCI function to be handed out to different
> > places, because they're isolated from each other not by the system
> > IOMMU, but by a combination of MMU hardware in the hardware (e.g. in a
> > GPU card) and software in the mdev driver.
> 
> That's correct for non-iommu backed mdev devices.
> 
> > If mdevs inherited the
> > group of their parent device they wouldn't count as isolated from each
> > other, which they should.
> > 
> 
> For iommu backed mdev devices for SRIOV, where there can be single mdev
> device for its parent, here parent device is VF, there can't be multiple
> mdev devices associated with that VF. In this case mdev can inherit the
> group of parent device.

Ah, yes, if there's just one mdev for the PCI function, and the
function doesn't have an internal memory protection unit then this
makes sense.

Which means we *do* have at least two meaningfully different group
configurations for mdev:
  * mdev is in a singleton group independent of the parent PCI device
  * mdev shares a group with its parent PCI device

Which means even in the case of mdevs, the group structure is *not* a
meaningless fiction.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-06-01 Thread David Gibson
On Thu, May 27, 2021 at 04:06:20PM -0300, Jason Gunthorpe wrote:
> On Thu, May 27, 2021 at 02:53:42PM +1000, David Gibson wrote:
> 
> > > > If the physical device had a bug which meant the mdevs *weren't*
> > > > properly isolated from each other, then those mdevs would share a
> > > > group, and you *would* care about it.  Depending on how the isolation
> > > > failed the mdevs might or might not also share a group with the parent
> > > > physical device.
> > > 
> > > That isn't a real scenario.. mdevs that can't be isolated just
> > > wouldn't be useful to exist
> > 
> > Really?  So what do you do when you discover some mdevs you thought
> > were isolated actually aren't due to a hardware bug?  Drop support
> > from the driver entirely?  In which case what do you say to the people
> > who understandably complain "but... we had all the mdevs in one guest
> > anyway, we don't care if they're not isolated"?
> 
> I've never said to eliminate groups entirely. 
> 
> What I'm saying is that all the cases we have for mdev today do not
> require groups, but are forced to create a fake group anyhow just to
> satisfy the odd VFIO requirement to have a group FD.
> 
> If some future mdev needs groups then sure, add the appropriate group
> stuff.
> 
> But that doesn't effect the decision to have a VFIO group FD, or not.
> 
> > > > It ensures that they're parked at the moment the group moves from
> > > > kernel to userspace ownership, but it can't prevent dpdk from
> > > > accessing and unparking those devices via peer to peer DMA.
> > > 
> > > Right, and adding all this group stuff did nothing to alert the poor
> > > admin that is running DPDK to this risk.
> > 
> > Didn't it?  Seems to me the admin that in order to give the group to
> > DPDK, the admin had to find and unbind all the things in it... so is
> > therefore aware that they're giving everything in it to DPDK.
> 
> Again, I've never said the *group* should be removed. I'm only
> concerned about the *group FD*

Ok, that wasn't really clear to me.

I still wouldn't say the group for mdevs is a fiction though.. rather
that the group device used for (no internal IOMMU case) mdevs is just
plain wrong.

> When the admin found and unbound they didn't use the *group FD* in any
> way.

No, they are likely to have changed permissions on the group device
node as part of the process, though.

> > > You put the same security labels you'd put on the group to the devices
> > > that consitute the group. It is only more tricky in the sense that the
> > > script that would have to do this will need to do more than ID the
> > > group to label but also ID the device members of the group and label
> > > their char nodes.
> > 
> > Well, I guess, if you take the view that root is allowed to break the
> > kernel.  I tend to prefer that although root can obviously break the
> > kernel if they intend do, we should make it hard to do by accident -
> > which in this case would mean the kernel *enforcing* that the devices
> > in the group have the same security labels, which I can't really see
> > how to do without an exposed group.
> 
> How is this "break the kernel"? It has nothing to do with the
> kernel. Security labels are a user space concern.

*thinks*... yeah, ok, that was much too strong an assertion.  What I
was thinking of is the fact that this means that guarantees you'd
normally expect the kernel to enforce can be obviated by bad
configuration: chown-ing a device to root doesn't actually protect it
if there's another device in the same group exposed to other users.

But I guess you could say the same about, say, an unauthenticated nbd
export of a root-owned block device, so I guess that's not something
the kernel can reasonably enforce.


Ok.. you might be finally convincing me, somewhat.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-06-01 Thread David Gibson
On Thu, May 27, 2021 at 03:48:47PM -0300, Jason Gunthorpe wrote:
> On Thu, May 27, 2021 at 02:58:30PM +1000, David Gibson wrote:
> > On Tue, May 25, 2021 at 04:52:57PM -0300, Jason Gunthorpe wrote:
> > > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote:
> > > 
> > > > 2. iommu backed mdev devices for SRIOV where mdev device is created per
> > > > VF (mdev device == VF device) then that mdev device has same iommu
> > > > protection scope as VF associated to it. 
> > > 
> > > This doesn't require, and certainly shouldn't create, a fake group.
> > 
> > It's only fake if you start with a narrow view of what a group is. 
> 
> A group is connected to drivers/iommu. A group object without *any*
> relation to drivers/iommu is just a complete fiction, IMHO.

That might be where we differ.  As I've said, my group I'm primarily
meaning the fundamental hardware unit of isolation.  *Usually* that's
determined by the capabilities of an IOMMU, but in some cases it might
not be.  In either case, the boundaries still matter.

> > > Only the VF's real IOMMU group should be used to model an iommu domain
> > > linked to a VF. Injecting fake groups that are proxies for real groups
> > > only opens the possibility of security problems like David is
> > > concerned with.
> > 
> > It's not a proxy for a real group, it's a group of its own.  If you
> > discover that (due to a hardware bug, for example) the mdev is *not*
> 
> What Kirti is talking about here is the case where a mdev is wrapped
> around a VF and the DMA isolation stems directly from the SRIOV VF's
> inherent DMA isolation, not anything the mdev wrapper did.
> 
> The group providing the isolation is the VF's group.

Yes, in that case the mdev absolutely should be in the VF's group -
having its own group is not just messy but incorrect.

> The group mdev implicitly creates is just a fake proxy that comes
> along with mdev API. It doesn't do anything and it doesn't mean
> anything.

But.. the case of multiple mdevs managed by a single PCI device with
an internal IOMMU also exists, and then the mdev groups are *not*
proxies but true groups independent of the parent device.  Which means
that the group structure of mdevs can vary, which is an argument *for*
keeping it, not against.

> > properly isolated from its parent PCI device, then both the mdev
> > virtual device *and* the physical PCI device are in the same group.
> > Groups including devices of different types and on different buses
> > were considered from the start, and are precedented, if rare.
> 
> This is far too theoretical for me. A security broken mdev is
> functionally useless.

Is it, though?  Again, I'm talking about the case of multiple mdevs
with a single parent device (because that's the only case I was aware
of until recently).  Isolation comes from a device-internal
IOMMU... that turns out to be broken.  But if your security domain
happens to include all the mdevs on the device anyway, then you don't
care.

Are you really going to say people can't use their fancy hardware in
this mode because it has a security flaw that's not relevant to their
usecase?


And then.. there's Kirti's case.  In that case the mdev should belong
to its parent PCI device's group since that's what's providing
isolation.  But in that case the parent device can be in a
multi-device group for any of the usual reasons (PCIe-to-PCI bridge,
PCIe switch with broken ACS, multifunction device with crosstalk).
Which means the mdev also shares a group with those other device.  So
again, the group structure matters and is not a fiction.

> We don't need to support it, and we don't need complicated software to
> model it.
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-27 Thread Jason Gunthorpe
On Thu, May 27, 2021 at 02:53:42PM +1000, David Gibson wrote:

> > > If the physical device had a bug which meant the mdevs *weren't*
> > > properly isolated from each other, then those mdevs would share a
> > > group, and you *would* care about it.  Depending on how the isolation
> > > failed the mdevs might or might not also share a group with the parent
> > > physical device.
> > 
> > That isn't a real scenario.. mdevs that can't be isolated just
> > wouldn't be useful to exist
> 
> Really?  So what do you do when you discover some mdevs you thought
> were isolated actually aren't due to a hardware bug?  Drop support
> from the driver entirely?  In which case what do you say to the people
> who understandably complain "but... we had all the mdevs in one guest
> anyway, we don't care if they're not isolated"?

I've never said to eliminate groups entirely. 

What I'm saying is that all the cases we have for mdev today do not
require groups, but are forced to create a fake group anyhow just to
satisfy the odd VFIO requirement to have a group FD.

If some future mdev needs groups then sure, add the appropriate group
stuff.

But that doesn't effect the decision to have a VFIO group FD, or not.

> > > It ensures that they're parked at the moment the group moves from
> > > kernel to userspace ownership, but it can't prevent dpdk from
> > > accessing and unparking those devices via peer to peer DMA.
> > 
> > Right, and adding all this group stuff did nothing to alert the poor
> > admin that is running DPDK to this risk.
> 
> Didn't it?  Seems to me the admin that in order to give the group to
> DPDK, the admin had to find and unbind all the things in it... so is
> therefore aware that they're giving everything in it to DPDK.

Again, I've never said the *group* should be removed. I'm only
concerned about the *group FD*

When the admin found and unbound they didn't use the *group FD* in any
way.

> > You put the same security labels you'd put on the group to the devices
> > that consitute the group. It is only more tricky in the sense that the
> > script that would have to do this will need to do more than ID the
> > group to label but also ID the device members of the group and label
> > their char nodes.
> 
> Well, I guess, if you take the view that root is allowed to break the
> kernel.  I tend to prefer that although root can obviously break the
> kernel if they intend do, we should make it hard to do by accident -
> which in this case would mean the kernel *enforcing* that the devices
> in the group have the same security labels, which I can't really see
> how to do without an exposed group.

How is this "break the kernel"? It has nothing to do with the
kernel. Security labels are a user space concern.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-27 Thread Jason Gunthorpe
On Thu, May 27, 2021 at 02:58:30PM +1000, David Gibson wrote:
> On Tue, May 25, 2021 at 04:52:57PM -0300, Jason Gunthorpe wrote:
> > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote:
> > 
> > > 2. iommu backed mdev devices for SRIOV where mdev device is created per
> > > VF (mdev device == VF device) then that mdev device has same iommu
> > > protection scope as VF associated to it. 
> > 
> > This doesn't require, and certainly shouldn't create, a fake group.
> 
> It's only fake if you start with a narrow view of what a group is. 

A group is connected to drivers/iommu. A group object without *any*
relation to drivers/iommu is just a complete fiction, IMHO.

> > Only the VF's real IOMMU group should be used to model an iommu domain
> > linked to a VF. Injecting fake groups that are proxies for real groups
> > only opens the possibility of security problems like David is
> > concerned with.
> 
> It's not a proxy for a real group, it's a group of its own.  If you
> discover that (due to a hardware bug, for example) the mdev is *not*

What Kirti is talking about here is the case where a mdev is wrapped
around a VF and the DMA isolation stems directly from the SRIOV VF's
inherent DMA isolation, not anything the mdev wrapper did.

The group providing the isolation is the VF's group.

The group mdev implicitly creates is just a fake proxy that comes
along with mdev API. It doesn't do anything and it doesn't mean
anything.

> properly isolated from its parent PCI device, then both the mdev
> virtual device *and* the physical PCI device are in the same group.
> Groups including devices of different types and on different buses
> were considered from the start, and are precedented, if rare.

This is far too theoretical for me. A security broken mdev is
functionally useless.

We don't need to support it, and we don't need complicated software to
model it.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-27 Thread Kirti Wankhede




On 5/27/2021 10:30 AM, David Gibson wrote:

On Wed, May 26, 2021 at 02:48:03AM +0530, Kirti Wankhede wrote:



On 5/26/2021 1:22 AM, Jason Gunthorpe wrote:

On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote:


2. iommu backed mdev devices for SRIOV where mdev device is created per
VF (mdev device == VF device) then that mdev device has same iommu
protection scope as VF associated to it.


This doesn't require, and certainly shouldn't create, a fake group.

Only the VF's real IOMMU group should be used to model an iommu domain
linked to a VF. Injecting fake groups that are proxies for real groups
only opens the possibility of security problems like David is
concerned with.



I think this security issue should be addressed by letting mdev device
inherit its parent's iommu_group, i.e. VF's iommu_group here.


No, that doesn't work.  AIUI part of the whole point of mdevs is to
allow chunks of a single PCI function to be handed out to different
places, because they're isolated from each other not by the system
IOMMU, but by a combination of MMU hardware in the hardware (e.g. in a
GPU card) and software in the mdev driver. 


That's correct for non-iommu backed mdev devices.


If mdevs inherited the
group of their parent device they wouldn't count as isolated from each
other, which they should.



For iommu backed mdev devices for SRIOV, where there can be single mdev 
device for its parent, here parent device is VF, there can't be multiple 
mdev devices associated with that VF. In this case mdev can inherit the 
group of parent device.


Kirti
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-27 Thread David Gibson
On Mon, May 24, 2021 at 08:37:44PM -0300, Jason Gunthorpe wrote:
> On Mon, May 24, 2021 at 05:52:58PM +1000, David Gibson wrote:
> 
> > > > I don't really see a semantic distinction between "always one-device
> > > > groups" and "groups don't matter".  Really the only way you can afford
> > > > to not care about groups is if they're singletons.
> > > 
> > > The kernel driver under the mdev may not be in an "always one-device"
> > > group.
> > 
> > I don't really understand what you mean by that.
> 
> I mean the group of the mdev's actual DMA device may have multiple
> things in it.
>  
> > > It is a kernel driver so the only thing we know and care about is that
> > > all devices in the HW group are bound to kernel drivers.
> > > 
> > > The vfio device that spawns from this kernel driver is really a
> > > "groups don't matter" vfio device because at the IOMMU layer it should
> > > be riding on the physical group of the kernel driver.  At the VFIO
> > > layer we no longer care about the group abstraction because the system
> > > guarentees isolation in some other way.
> > 
> > Uh.. I don't really know how mdevs are isolated from each other.  I
> > thought it was because the physical device providing the mdevs
> > effectively had an internal IOMMU (or at least DMA permissioning) to
> > isolate the mdevs, even though the physical device may not be fully
> > isolated.
> > 
> > In that case the virtual mdev is effectively in a singleton group,
> > which is different from the group of its parent device.
> 
> That is one way to view it, but it means creating a whole group
> infrastructure and abusing the IOMMU stack just to create this
> nonsense fiction.

It's a nonsense fiction until it's not, at which point it will bite
you in the arse.

> We also abuse the VFIO container stuff to hackily
> create several different types pf IOMMU uAPIs for the mdev - all of
> which are unrelated to drivers/iommu.
> 
> Basically, there is no drivers/iommu thing involved, thus is no really
> iommu group, for mdev it is all a big hacky lie.

Well, "iommu" group might not be the best name, but hardware isolation
is still a real concern here, even if it's not entirely related to the
IOMMU.

> > If the physical device had a bug which meant the mdevs *weren't*
> > properly isolated from each other, then those mdevs would share a
> > group, and you *would* care about it.  Depending on how the isolation
> > failed the mdevs might or might not also share a group with the parent
> > physical device.
> 
> That isn't a real scenario.. mdevs that can't be isolated just
> wouldn't be useful to exist

Really?  So what do you do when you discover some mdevs you thought
were isolated actually aren't due to a hardware bug?  Drop support
from the driver entirely?  In which case what do you say to the people
who understandably complain "but... we had all the mdevs in one guest
anyway, we don't care if they're not isolated"?

> > > This is today's model, yes. When you run dpdk on a multi-group device
> > > vfio already ensures that all the device groups remained parked and
> > > inaccessible.
> > 
> > I'm not really following what you're saying there.
> > 
> > If you have a multi-device group, and dpdk is using one device in it,
> > VFIO *does not* (and cannot) ensure that other devices in the group
> > are parked and inaccessible.  
> 
> I mean in the sense that no other user space can open those devices
> and no kernel driver can later be attached to them.

Ok.

> > It ensures that they're parked at the moment the group moves from
> > kernel to userspace ownership, but it can't prevent dpdk from
> > accessing and unparking those devices via peer to peer DMA.
> 
> Right, and adding all this group stuff did nothing to alert the poor
> admin that is running DPDK to this risk.

Didn't it?  Seems to me the admin that in order to give the group to
DPDK, the admin had to find and unbind all the things in it... so is
therefore aware that they're giving everything in it to DPDK.

> > > If the administator configures the system with different security
> > > labels for different VFIO devices then yes removing groups makes this
> > > more tricky as all devices in the group should have the same label.
> > 
> > That seems a bigger problem than "more tricky".  How would you propose
> > addressing this with your device-first model?
> 
> You put the same security labels you'd put on the group to the devices
> that consitute the group. It is only more tricky in the sense that the
> script that would have to do this will need to do more than ID the
> group to label but also ID the device members of the group and label
> their char nodes.

Well, I guess, if you take the view that root is allowed to break the
kernel.  I tend to prefer that although root can obviously break the
kernel if they intend do, we should make it hard to do by accident -
which in this case would mean the kernel *enforcing* that the devices
in the group have the same security labels, which I 

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-27 Thread David Gibson
On Tue, May 25, 2021 at 04:52:57PM -0300, Jason Gunthorpe wrote:
> On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote:
> 
> > 2. iommu backed mdev devices for SRIOV where mdev device is created per
> > VF (mdev device == VF device) then that mdev device has same iommu
> > protection scope as VF associated to it. 
> 
> This doesn't require, and certainly shouldn't create, a fake group.

It's only fake if you start with a narrow view of what a group is.  A
group is a set of devices (in the kernel sense of "device", not
necessarily the hardware sense) which can't be isolated from each
other.  The mdev device is a kernel device, and if working as intended
it can be isolated from everything else, and is therefore in an
absolute bona fide group of its own.

> Only the VF's real IOMMU group should be used to model an iommu domain
> linked to a VF. Injecting fake groups that are proxies for real groups
> only opens the possibility of security problems like David is
> concerned with.

It's not a proxy for a real group, it's a group of its own.  If you
discover that (due to a hardware bug, for example) the mdev is *not*
properly isolated from its parent PCI device, then both the mdev
virtual device *and* the physical PCI device are in the same group.
Groups including devices of different types and on different buses
were considered from the start, and are precedented, if rare.

> Max's series approaches this properly by fully linking the struct
> pci_device of the VF throughout the entire VFIO scheme, including the
> group and container, while still allowing override of various VFIO
> operations.
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-27 Thread David Gibson
On Wed, May 26, 2021 at 02:48:03AM +0530, Kirti Wankhede wrote:
> 
> 
> On 5/26/2021 1:22 AM, Jason Gunthorpe wrote:
> > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote:
> > 
> > > 2. iommu backed mdev devices for SRIOV where mdev device is created per
> > > VF (mdev device == VF device) then that mdev device has same iommu
> > > protection scope as VF associated to it.
> > 
> > This doesn't require, and certainly shouldn't create, a fake group.
> > 
> > Only the VF's real IOMMU group should be used to model an iommu domain
> > linked to a VF. Injecting fake groups that are proxies for real groups
> > only opens the possibility of security problems like David is
> > concerned with.
> > 
> 
> I think this security issue should be addressed by letting mdev device
> inherit its parent's iommu_group, i.e. VF's iommu_group here.

No, that doesn't work.  AIUI part of the whole point of mdevs is to
allow chunks of a single PCI function to be handed out to different
places, because they're isolated from each other not by the system
IOMMU, but by a combination of MMU hardware in the hardware (e.g. in a
GPU card) and software in the mdev driver.  If mdevs inherited the
group of their parent device they wouldn't count as isolated from each
other, which they should.

> 
> Kirti
> 
> > Max's series approaches this properly by fully linking the struct
> > pci_device of the VF throughout the entire VFIO scheme, including the
> > group and container, while still allowing override of various VFIO
> > operations.
> > 
> > Jason
> > 
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-26 Thread Jason Gunthorpe
On Wed, May 26, 2021 at 12:59:05PM -0600, Alex Williamson wrote:

> A driver provided sysfs attribute would obviously fill the short
> term gap, long term maybe this would be standardized via netlink.  It
> seems a bit analogous to setting the MAC address for a VF on an SR-IOV
> NIC or VF namespace configuration for an SR-IOV NVMe device.  Thanks,

We have been doing a lot of work in netlink/devlink to program the VF
settings before starting the VF driver

I've long thought it would be good to standardize a VF lifecycle in
Linux eg VFs start their life in some 'unconfigured' state and drivers
don't bind, then userspace can use the PF to perform configuration,
finally the driver can start on a configured VF

We are already doing this model for the mlx5 'sub function' interfaces
which have alot of analogs to both VFs and Intel's SIOV ADIs. It seems
to be working.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-26 Thread Alex Williamson
On Wed, 26 May 2021 23:40:02 +0530
Kirti Wankhede  wrote:

> On 5/26/2021 4:22 AM, Alex Williamson wrote:
> > On Wed, 26 May 2021 00:56:30 +0530
> > Kirti Wankhede  wrote:
> >   
> >> On 5/25/2021 5:07 AM, Jason Gunthorpe wrote:  
> >>> On Mon, May 24, 2021 at 05:52:58PM +1000, David Gibson wrote:
> >>>  
> >> I don't really see a semantic distinction between "always one-device
> >> groups" and "groups don't matter".  Really the only way you can afford
> >> to not care about groups is if they're singletons.  
> >
> > The kernel driver under the mdev may not be in an "always one-device"
> > group.  
> 
>  I don't really understand what you mean by that.  
> >>>
> >>> I mean the group of the mdev's actual DMA device may have multiple
> >>> things in it.
> >>>
> > It is a kernel driver so the only thing we know and care about is that
> > all devices in the HW group are bound to kernel drivers.
> >
> > The vfio device that spawns from this kernel driver is really a
> > "groups don't matter" vfio device because at the IOMMU layer it should
> > be riding on the physical group of the kernel driver.  At the VFIO
> > layer we no longer care about the group abstraction because the system
> > guarentees isolation in some other way.  
> 
>  Uh.. I don't really know how mdevs are isolated from each other.  I
>  thought it was because the physical device providing the mdevs
>  effectively had an internal IOMMU (or at least DMA permissioning) to
>  isolate the mdevs, even though the physical device may not be fully
>  isolated.
> 
>  In that case the virtual mdev is effectively in a singleton group,
>  which is different from the group of its parent device.  
> >>>  
> >>
> >> That's correct.
> >>  
> >>> That is one way to view it, but it means creating a whole group
> >>> infrastructure and abusing the IOMMU stack just to create this
> >>> nonsense fiction.  
> >>
> >> I really didn't get how this abuse the IOMMU stack.
> >> mdev can be used in 3 different ways:
> >> 1. non-iommu backed mdev devices where mdev vendor driver takes care to
> >> DMA map (iommu_map) and isolation is through device hardware internal
> >> MMU. Here vfio_iommu_type1 module provides a way to validate and pin
> >> pages required by mdev device for DMA mapping. Then IOMMU mapping is
> >> done by mdev vendor driver which is owner driver of physical device.
> >>
> >> 2. iommu backed mdev devices for SRIOV where mdev device is created per
> >> VF (mdev device == VF device) then that mdev device has same iommu
> >> protection scope as VF associated to it. Here mdev device is virtual
> >> device which uses features of mdev and represents underlying VF device,
> >> same as vfio-pci but with additional mdev features.  
> > 
> > What features would those be?  There are no mdev specific parts of the
> > vfio uAPI.
> > 
> > The mdev device is a virtual device, by why it it virtual in this case?
> > Aren't we effectively assigning the VF itself (mdev device == VF device)
> > with a bunch of extra support code to fill in the gaps of the VF
> > implementing the complete device model in hardware?
> > 
> > We're effectively creating this virtual device, creating a fake IOMMU
> > group, and trying to create this association of this virtual device to
> > the real VF in order to shoehorn it into the mdev model.  What do we
> > get from that model other than lifecycle management (ie. type selection)
> > and re-use of a bunch of code from the driver supporting the 1) model
> > above?
> >   
> 
> Yes, the lifecycle management which is in mdev is not in vfio-pci variant.
> 
> > This specific model seems better served by a device specific peer
> > driver to vfio-pci (ie. a "vfio-pci variant").  You effectively already
> > have the code for this driver, it's just in the format of an mdev
> > driver rather than a vfio "bus driver".  The work Jason references
> > relative to Max aims to make these kinds of drivers easier to implement
> > through re-use of vfio-pci code.
> > 
> > There are certainly other solutions we could come up with for selecting
> > a specific device type for a vfio-pci variant driver to implement other
> > than pretending this model actually belongs in mdev, right?  Thanks,
> >   
> 
> Sure and would like to see type selection mechanism to be implemented in 
> vfio-pci variant.


A driver provided sysfs attribute would obviously fill the short
term gap, long term maybe this would be standardized via netlink.  It
seems a bit analogous to setting the MAC address for a VF on an SR-IOV
NIC or VF namespace configuration for an SR-IOV NVMe device.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-26 Thread Kirti Wankhede




On 5/26/2021 4:22 AM, Alex Williamson wrote:

On Wed, 26 May 2021 00:56:30 +0530
Kirti Wankhede  wrote:


On 5/25/2021 5:07 AM, Jason Gunthorpe wrote:

On Mon, May 24, 2021 at 05:52:58PM +1000, David Gibson wrote:
   

I don't really see a semantic distinction between "always one-device
groups" and "groups don't matter".  Really the only way you can afford
to not care about groups is if they're singletons.


The kernel driver under the mdev may not be in an "always one-device"
group.


I don't really understand what you mean by that.


I mean the group of the mdev's actual DMA device may have multiple
things in it.
 

It is a kernel driver so the only thing we know and care about is that
all devices in the HW group are bound to kernel drivers.

The vfio device that spawns from this kernel driver is really a
"groups don't matter" vfio device because at the IOMMU layer it should
be riding on the physical group of the kernel driver.  At the VFIO
layer we no longer care about the group abstraction because the system
guarentees isolation in some other way.


Uh.. I don't really know how mdevs are isolated from each other.  I
thought it was because the physical device providing the mdevs
effectively had an internal IOMMU (or at least DMA permissioning) to
isolate the mdevs, even though the physical device may not be fully
isolated.

In that case the virtual mdev is effectively in a singleton group,
which is different from the group of its parent device.
   


That's correct.


That is one way to view it, but it means creating a whole group
infrastructure and abusing the IOMMU stack just to create this
nonsense fiction.


I really didn't get how this abuse the IOMMU stack.
mdev can be used in 3 different ways:
1. non-iommu backed mdev devices where mdev vendor driver takes care to
DMA map (iommu_map) and isolation is through device hardware internal
MMU. Here vfio_iommu_type1 module provides a way to validate and pin
pages required by mdev device for DMA mapping. Then IOMMU mapping is
done by mdev vendor driver which is owner driver of physical device.

2. iommu backed mdev devices for SRIOV where mdev device is created per
VF (mdev device == VF device) then that mdev device has same iommu
protection scope as VF associated to it. Here mdev device is virtual
device which uses features of mdev and represents underlying VF device,
same as vfio-pci but with additional mdev features.


What features would those be?  There are no mdev specific parts of the
vfio uAPI.

The mdev device is a virtual device, by why it it virtual in this case?
Aren't we effectively assigning the VF itself (mdev device == VF device)
with a bunch of extra support code to fill in the gaps of the VF
implementing the complete device model in hardware?

We're effectively creating this virtual device, creating a fake IOMMU
group, and trying to create this association of this virtual device to
the real VF in order to shoehorn it into the mdev model.  What do we
get from that model other than lifecycle management (ie. type selection)
and re-use of a bunch of code from the driver supporting the 1) model
above?



Yes, the lifecycle management which is in mdev is not in vfio-pci variant.


This specific model seems better served by a device specific peer
driver to vfio-pci (ie. a "vfio-pci variant").  You effectively already
have the code for this driver, it's just in the format of an mdev
driver rather than a vfio "bus driver".  The work Jason references
relative to Max aims to make these kinds of drivers easier to implement
through re-use of vfio-pci code.

There are certainly other solutions we could come up with for selecting
a specific device type for a vfio-pci variant driver to implement other
than pretending this model actually belongs in mdev, right?  Thanks,



Sure and would like to see type selection mechanism to be implemented in 
vfio-pci variant.


Thanks,
Kirti

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-25 Thread Alex Williamson
On Wed, 26 May 2021 00:56:30 +0530
Kirti Wankhede  wrote:

> On 5/25/2021 5:07 AM, Jason Gunthorpe wrote:
> > On Mon, May 24, 2021 at 05:52:58PM +1000, David Gibson wrote:
> >   
>  I don't really see a semantic distinction between "always one-device
>  groups" and "groups don't matter".  Really the only way you can afford
>  to not care about groups is if they're singletons.  
> >>>
> >>> The kernel driver under the mdev may not be in an "always one-device"
> >>> group.  
> >>
> >> I don't really understand what you mean by that.  
> > 
> > I mean the group of the mdev's actual DMA device may have multiple
> > things in it.
> > 
> >>> It is a kernel driver so the only thing we know and care about is that
> >>> all devices in the HW group are bound to kernel drivers.
> >>>
> >>> The vfio device that spawns from this kernel driver is really a
> >>> "groups don't matter" vfio device because at the IOMMU layer it should
> >>> be riding on the physical group of the kernel driver.  At the VFIO
> >>> layer we no longer care about the group abstraction because the system
> >>> guarentees isolation in some other way.  
> >>
> >> Uh.. I don't really know how mdevs are isolated from each other.  I
> >> thought it was because the physical device providing the mdevs
> >> effectively had an internal IOMMU (or at least DMA permissioning) to
> >> isolate the mdevs, even though the physical device may not be fully
> >> isolated.
> >>
> >> In that case the virtual mdev is effectively in a singleton group,
> >> which is different from the group of its parent device.  
> >   
> 
> That's correct.
> 
> > That is one way to view it, but it means creating a whole group
> > infrastructure and abusing the IOMMU stack just to create this
> > nonsense fiction.  
> 
> I really didn't get how this abuse the IOMMU stack.
> mdev can be used in 3 different ways:
> 1. non-iommu backed mdev devices where mdev vendor driver takes care to
> DMA map (iommu_map) and isolation is through device hardware internal
> MMU. Here vfio_iommu_type1 module provides a way to validate and pin
> pages required by mdev device for DMA mapping. Then IOMMU mapping is
> done by mdev vendor driver which is owner driver of physical device.
> 
> 2. iommu backed mdev devices for SRIOV where mdev device is created per
> VF (mdev device == VF device) then that mdev device has same iommu
> protection scope as VF associated to it. Here mdev device is virtual
> device which uses features of mdev and represents underlying VF device,
> same as vfio-pci but with additional mdev features.

What features would those be?  There are no mdev specific parts of the
vfio uAPI.

The mdev device is a virtual device, by why it it virtual in this case?
Aren't we effectively assigning the VF itself (mdev device == VF device)
with a bunch of extra support code to fill in the gaps of the VF
implementing the complete device model in hardware?

We're effectively creating this virtual device, creating a fake IOMMU
group, and trying to create this association of this virtual device to
the real VF in order to shoehorn it into the mdev model.  What do we
get from that model other than lifecycle management (ie. type selection)
and re-use of a bunch of code from the driver supporting the 1) model
above?

This specific model seems better served by a device specific peer
driver to vfio-pci (ie. a "vfio-pci variant").  You effectively already
have the code for this driver, it's just in the format of an mdev
driver rather than a vfio "bus driver".  The work Jason references
relative to Max aims to make these kinds of drivers easier to implement
through re-use of vfio-pci code.

There are certainly other solutions we could come up with for selecting
a specific device type for a vfio-pci variant driver to implement other
than pretending this model actually belongs in mdev, right?  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-25 Thread Kirti Wankhede




On 5/26/2021 1:22 AM, Jason Gunthorpe wrote:

On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote:


2. iommu backed mdev devices for SRIOV where mdev device is created per
VF (mdev device == VF device) then that mdev device has same iommu
protection scope as VF associated to it.


This doesn't require, and certainly shouldn't create, a fake group.

Only the VF's real IOMMU group should be used to model an iommu domain
linked to a VF. Injecting fake groups that are proxies for real groups
only opens the possibility of security problems like David is
concerned with.



I think this security issue should be addressed by letting mdev device 
inherit its parent's iommu_group, i.e. VF's iommu_group here.


Kirti


Max's series approaches this properly by fully linking the struct
pci_device of the VF throughout the entire VFIO scheme, including the
group and container, while still allowing override of various VFIO
operations.

Jason


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-25 Thread Jason Gunthorpe
On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote:

> 2. iommu backed mdev devices for SRIOV where mdev device is created per
> VF (mdev device == VF device) then that mdev device has same iommu
> protection scope as VF associated to it. 

This doesn't require, and certainly shouldn't create, a fake group.

Only the VF's real IOMMU group should be used to model an iommu domain
linked to a VF. Injecting fake groups that are proxies for real groups
only opens the possibility of security problems like David is
concerned with.

Max's series approaches this properly by fully linking the struct
pci_device of the VF throughout the entire VFIO scheme, including the
group and container, while still allowing override of various VFIO
operations.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-25 Thread Kirti Wankhede




On 5/25/2021 5:07 AM, Jason Gunthorpe wrote:

On Mon, May 24, 2021 at 05:52:58PM +1000, David Gibson wrote:


I don't really see a semantic distinction between "always one-device
groups" and "groups don't matter".  Really the only way you can afford
to not care about groups is if they're singletons.


The kernel driver under the mdev may not be in an "always one-device"
group.


I don't really understand what you mean by that.


I mean the group of the mdev's actual DMA device may have multiple
things in it.
  

It is a kernel driver so the only thing we know and care about is that
all devices in the HW group are bound to kernel drivers.

The vfio device that spawns from this kernel driver is really a
"groups don't matter" vfio device because at the IOMMU layer it should
be riding on the physical group of the kernel driver.  At the VFIO
layer we no longer care about the group abstraction because the system
guarentees isolation in some other way.


Uh.. I don't really know how mdevs are isolated from each other.  I
thought it was because the physical device providing the mdevs
effectively had an internal IOMMU (or at least DMA permissioning) to
isolate the mdevs, even though the physical device may not be fully
isolated.

In that case the virtual mdev is effectively in a singleton group,
which is different from the group of its parent device.




That's correct.


That is one way to view it, but it means creating a whole group
infrastructure and abusing the IOMMU stack just to create this
nonsense fiction.


I really didn't get how this abuse the IOMMU stack.
mdev can be used in 3 different ways:
1. non-iommu backed mdev devices where mdev vendor driver takes care to
DMA map (iommu_map) and isolation is through device hardware internal
MMU. Here vfio_iommu_type1 module provides a way to validate and pin
pages required by mdev device for DMA mapping. Then IOMMU mapping is
done by mdev vendor driver which is owner driver of physical device.

2. iommu backed mdev devices for SRIOV where mdev device is created per
VF (mdev device == VF device) then that mdev device has same iommu
protection scope as VF associated to it. Here mdev device is virtual
device which uses features of mdev and represents underlying VF device,
same as vfio-pci but with additional mdev features.

3. iommu backed mdev devices for PASID with aux feature. I would not
comment on this, there has been a long discussion on this.

I don't think this is abusing IOMMU stack, atleast for 1 and 2 above.

Thanks,
Kirti



We also abuse the VFIO container stuff to hackily
create several different types pf IOMMU uAPIs for the mdev - all of
which are unrelated to drivers/iommu.

Basically, there is no drivers/iommu thing involved, thus is no really
iommu group, for mdev it is all a big hacky lie.


If the physical device had a bug which meant the mdevs *weren't*
properly isolated from each other, then those mdevs would share a
group, and you *would* care about it.  Depending on how the isolation
failed the mdevs might or might not also share a group with the parent
physical device.


That isn't a real scenario.. mdevs that can't be isolated just
wouldn't be useful to exist


This is today's model, yes. When you run dpdk on a multi-group device
vfio already ensures that all the device groups remained parked and
inaccessible.


I'm not really following what you're saying there.

If you have a multi-device group, and dpdk is using one device in it,
VFIO *does not* (and cannot) ensure that other devices in the group
are parked and inaccessible.


I mean in the sense that no other user space can open those devices
and no kernel driver can later be attached to them.


It ensures that they're parked at the moment the group moves from
kernel to userspace ownership, but it can't prevent dpdk from
accessing and unparking those devices via peer to peer DMA.


Right, and adding all this group stuff did nothing to alert the poor
admin that is running DPDK to this risk.


If the administator configures the system with different security
labels for different VFIO devices then yes removing groups makes this
more tricky as all devices in the group should have the same label.


That seems a bigger problem than "more tricky".  How would you propose
addressing this with your device-first model?


You put the same security labels you'd put on the group to the devices
that consitute the group. It is only more tricky in the sense that the
script that would have to do this will need to do more than ID the
group to label but also ID the device members of the group and label
their char nodes.

Jason


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-24 Thread Jason Gunthorpe
On Mon, May 24, 2021 at 05:52:58PM +1000, David Gibson wrote:

> > > I don't really see a semantic distinction between "always one-device
> > > groups" and "groups don't matter".  Really the only way you can afford
> > > to not care about groups is if they're singletons.
> > 
> > The kernel driver under the mdev may not be in an "always one-device"
> > group.
> 
> I don't really understand what you mean by that.

I mean the group of the mdev's actual DMA device may have multiple
things in it.
 
> > It is a kernel driver so the only thing we know and care about is that
> > all devices in the HW group are bound to kernel drivers.
> > 
> > The vfio device that spawns from this kernel driver is really a
> > "groups don't matter" vfio device because at the IOMMU layer it should
> > be riding on the physical group of the kernel driver.  At the VFIO
> > layer we no longer care about the group abstraction because the system
> > guarentees isolation in some other way.
> 
> Uh.. I don't really know how mdevs are isolated from each other.  I
> thought it was because the physical device providing the mdevs
> effectively had an internal IOMMU (or at least DMA permissioning) to
> isolate the mdevs, even though the physical device may not be fully
> isolated.
> 
> In that case the virtual mdev is effectively in a singleton group,
> which is different from the group of its parent device.

That is one way to view it, but it means creating a whole group
infrastructure and abusing the IOMMU stack just to create this
nonsense fiction. We also abuse the VFIO container stuff to hackily
create several different types pf IOMMU uAPIs for the mdev - all of
which are unrelated to drivers/iommu.

Basically, there is no drivers/iommu thing involved, thus is no really
iommu group, for mdev it is all a big hacky lie.

> If the physical device had a bug which meant the mdevs *weren't*
> properly isolated from each other, then those mdevs would share a
> group, and you *would* care about it.  Depending on how the isolation
> failed the mdevs might or might not also share a group with the parent
> physical device.

That isn't a real scenario.. mdevs that can't be isolated just
wouldn't be useful to exist

> > This is today's model, yes. When you run dpdk on a multi-group device
> > vfio already ensures that all the device groups remained parked and
> > inaccessible.
> 
> I'm not really following what you're saying there.
> 
> If you have a multi-device group, and dpdk is using one device in it,
> VFIO *does not* (and cannot) ensure that other devices in the group
> are parked and inaccessible.  

I mean in the sense that no other user space can open those devices
and no kernel driver can later be attached to them.

> It ensures that they're parked at the moment the group moves from
> kernel to userspace ownership, but it can't prevent dpdk from
> accessing and unparking those devices via peer to peer DMA.

Right, and adding all this group stuff did nothing to alert the poor
admin that is running DPDK to this risk.

> > If the administator configures the system with different security
> > labels for different VFIO devices then yes removing groups makes this
> > more tricky as all devices in the group should have the same label.
> 
> That seems a bigger problem than "more tricky".  How would you propose
> addressing this with your device-first model?

You put the same security labels you'd put on the group to the devices
that consitute the group. It is only more tricky in the sense that the
script that would have to do this will need to do more than ID the
group to label but also ID the device members of the group and label
their char nodes.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-24 Thread David Gibson
On Thu, May 13, 2021 at 10:59:38AM -0300, Jason Gunthorpe wrote:
> On Thu, May 13, 2021 at 03:48:19PM +1000, David Gibson wrote:
> > On Mon, May 03, 2021 at 01:15:18PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Apr 29, 2021 at 01:04:05PM +1000, David Gibson wrote:
> > > > Again, I don't know enough about VDPA to make sense of that.  Are we
> > > > essentially talking non-PCI virtual devices here?  In which case you
> > > > could define the VDPA "bus" to always have one-device groups.
> > > 
> > > It is much worse than that.
> > > 
> > > What these non-PCI devices need is for the kernel driver to be part of
> > > the IOMMU group of the underlying PCI device but tell VFIO land that
> > > "groups don't matter"
> > 
> > I don't really see a semantic distinction between "always one-device
> > groups" and "groups don't matter".  Really the only way you can afford
> > to not care about groups is if they're singletons.
> 
> The kernel driver under the mdev may not be in an "always one-device"
> group.

I don't really understand what you mean by that.

> It is a kernel driver so the only thing we know and care about is that
> all devices in the HW group are bound to kernel drivers.
> 
> The vfio device that spawns from this kernel driver is really a
> "groups don't matter" vfio device because at the IOMMU layer it should
> be riding on the physical group of the kernel driver.  At the VFIO
> layer we no longer care about the group abstraction because the system
> guarentees isolation in some other way.

Uh.. I don't really know how mdevs are isolated from each other.  I
thought it was because the physical device providing the mdevs
effectively had an internal IOMMU (or at least DMA permissioning) to
isolate the mdevs, even though the physical device may not be fully
isolated.

In that case the virtual mdev is effectively in a singleton group,
which is different from the group of its parent device.

If the physical device had a bug which meant the mdevs *weren't*
properly isolated from each other, then those mdevs would share a
group, and you *would* care about it.  Depending on how the isolation
failed the mdevs might or might not also share a group with the parent
physical device.

> The issue is a software one of tightly coupling IOMMU HW groups to
> VFIO's API and then introducing an entire class of VFIO mdev devices
> that no longer care about IOMMU HW groups at all.

The don't necessarily care about the IOMMU groups of the parent
physical hardware, but they have their own IOMMU groups as virtual
hardware devices.

> Currently mdev tries to trick this by creating singleton groups, but
> it is very ugly and very tightly coupled to a specific expectation of
> the few existing mdev drivers. Trying to add PASID made it alot worse.
> 
> > Aside: I'm primarily using "group" to mean the underlying hardware
> > unit, not the vfio construct on top of it, I'm not sure that's been
> > clear throughout.
> 
> Sure, that is obviously fixed, but I'm not interested in that.
> 
> I'm interested in having a VFIO API that makes sense for vfio-pci
> which has a tight coupling to the HW notion of a IOMMU and also vfio
> mdev's that have no concept of a HW IOMMU group.
> 
> > So.. your model assumes that every device has a safe quiescent state
> > where it won't do any harm until poked, whether its group is
> > currently kernel owned, or owned by a userspace that doesn't know
> > anything about it.
> 
> This is today's model, yes. When you run dpdk on a multi-group device
> vfio already ensures that all the device groups remained parked and
> inaccessible.

I'm not really following what you're saying there.

If you have a multi-device group, and dpdk is using one device in it,
VFIO *does not* (and cannot) ensure that other devices in the group
are parked and inaccessible.  It ensures that they're parked at the
moment the group moves from kernel to userspace ownership, but it
can't prevent dpdk from accessing and unparking those devices via peer
to peer DMA.

> > At minimum this does mean that in order to use one device in the group
> > you must have permission to use *all* the devices in the group -
> > otherwise you may be able to operate a device you don't have
> > permission to by DMAing to its registers from a device you do have
> > permission to.
> 
> If the administator configures the system with different security
> labels for different VFIO devices then yes removing groups makes this
> more tricky as all devices in the group should have the same label.

That seems a bigger problem than "more tricky".  How would you propose
addressing this with your device-first model?

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-24 Thread David Gibson
On Thu, May 13, 2021 at 10:50:30AM -0300, Jason Gunthorpe wrote:
> On Thu, May 13, 2021 at 04:07:07PM +1000, David Gibson wrote:
> > On Wed, May 05, 2021 at 01:39:02PM -0300, Jason Gunthorpe wrote:
> > > On Wed, May 05, 2021 at 02:28:53PM +1000, Alexey Kardashevskiy wrote:
> > > 
> > > > This is a good feature in general when let's say there is a linux 
> > > > supported
> > > > device which has a proprietary device firmware update tool which only 
> > > > exists
> > > > as an x86 binary and your hardware is not x86 - running qemu + vfio in 
> > > > full
> > > > emulation would provide a way to run the tool to update a physical 
> > > > device.
> > > 
> > > That specific use case doesn't really need a vIOMMU though, does it?
> > 
> > Possibly not, but the mechanics needed to do vIOMMU on different host
> > IOMMU aren't really different from what you need for a no-vIOMMU
> > guest.  
> 
> For very simple vIOMMUs this might be true, but this new features of nesting
> PASID, migration, etc, etc all make the vIOMMU complicated and
> emuluating it completely alot harder.

Well, sure, emulating a complex vIOMMU is complex. But "very simple
vIOMMUs" covers the vast majority of currently deployed hardware, and
several are already emulated by qemu.

> Stuffing a vfio-pci into a guest and creating a physical map using a
> single IOASID is comparably trivial.

Note that for PAPR (POWER guest) systems this is not an option: the
PAPR platform *always* has a vIOMMU.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-13 Thread Jason Gunthorpe
On Thu, May 13, 2021 at 03:48:19PM +1000, David Gibson wrote:
> On Mon, May 03, 2021 at 01:15:18PM -0300, Jason Gunthorpe wrote:
> > On Thu, Apr 29, 2021 at 01:04:05PM +1000, David Gibson wrote:
> > > Again, I don't know enough about VDPA to make sense of that.  Are we
> > > essentially talking non-PCI virtual devices here?  In which case you
> > > could define the VDPA "bus" to always have one-device groups.
> > 
> > It is much worse than that.
> > 
> > What these non-PCI devices need is for the kernel driver to be part of
> > the IOMMU group of the underlying PCI device but tell VFIO land that
> > "groups don't matter"
> 
> I don't really see a semantic distinction between "always one-device
> groups" and "groups don't matter".  Really the only way you can afford
> to not care about groups is if they're singletons.

The kernel driver under the mdev may not be in an "always one-device"
group.

It is a kernel driver so the only thing we know and care about is that
all devices in the HW group are bound to kernel drivers.

The vfio device that spawns from this kernel driver is really a
"groups don't matter" vfio device because at the IOMMU layer it should
be riding on the physical group of the kernel driver.  At the VFIO
layer we no longer care about the group abstraction because the system
guarentees isolation in some other way.

The issue is a software one of tightly coupling IOMMU HW groups to
VFIO's API and then introducing an entire class of VFIO mdev devices
that no longer care about IOMMU HW groups at all.

Currently mdev tries to trick this by creating singleton groups, but
it is very ugly and very tightly coupled to a specific expectation of
the few existing mdev drivers. Trying to add PASID made it alot worse.

> Aside: I'm primarily using "group" to mean the underlying hardware
> unit, not the vfio construct on top of it, I'm not sure that's been
> clear throughout.

Sure, that is obviously fixed, but I'm not interested in that.

I'm interested in having a VFIO API that makes sense for vfio-pci
which has a tight coupling to the HW notion of a IOMMU and also vfio
mdev's that have no concept of a HW IOMMU group.

> So.. your model assumes that every device has a safe quiescent state
> where it won't do any harm until poked, whether its group is
> currently kernel owned, or owned by a userspace that doesn't know
> anything about it.

This is today's model, yes. When you run dpdk on a multi-group device
vfio already ensures that all the device groups remained parked and
inaccessible.
 
> At minimum this does mean that in order to use one device in the group
> you must have permission to use *all* the devices in the group -
> otherwise you may be able to operate a device you don't have
> permission to by DMAing to its registers from a device you do have
> permission to.

If the administator configures the system with different security
labels for different VFIO devices then yes removing groups makes this
more tricky as all devices in the group should have the same label.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-13 Thread Jason Gunthorpe
On Thu, May 13, 2021 at 04:07:07PM +1000, David Gibson wrote:
> On Wed, May 05, 2021 at 01:39:02PM -0300, Jason Gunthorpe wrote:
> > On Wed, May 05, 2021 at 02:28:53PM +1000, Alexey Kardashevskiy wrote:
> > 
> > > This is a good feature in general when let's say there is a linux 
> > > supported
> > > device which has a proprietary device firmware update tool which only 
> > > exists
> > > as an x86 binary and your hardware is not x86 - running qemu + vfio in 
> > > full
> > > emulation would provide a way to run the tool to update a physical device.
> > 
> > That specific use case doesn't really need a vIOMMU though, does it?
> 
> Possibly not, but the mechanics needed to do vIOMMU on different host
> IOMMU aren't really different from what you need for a no-vIOMMU
> guest.  

For very simple vIOMMUs this might be true, but this new features of nesting
PASID, migration, etc, etc all make the vIOMMU complicated and
emuluating it completely alot harder.

Stuffing a vfio-pci into a guest and creating a physical map using a
single IOASID is comparably trivial.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-13 Thread Jason Gunthorpe
On Thu, May 13, 2021 at 04:01:20PM +1000, David Gibson wrote:

> But.. even if you're exposing page tables to userspace.. with hardware
> that has explicit support for nesting you can probably expose the hw
> tables directly which is great for the cases that works for.  But
> surely for older IOMMUs which don't do nesting you must have some way
> of shadowing guest IO page tables to host IO page tables to translate
> GPA to HPA at least?

I expect this would be in quemu and would be part of the expensive
emulation I suggested. Converting the guest's page table structure
into a sequence of map/unmaps to a non-nestable IOASID.

> If you're doing that, I don't see that converting page table format
> is really any harder

It isn't, but it is a completely different flow and custom from the
normal HW accelerated nesting.

> It might not be a theoretically complete emulation of the vIOMMU, but
> it can support in-practice usage.  In particular it works pretty well
> if your backend has a nice big IOVA range (like x86 IOMMUS) but your
> guest platform typically uses relatively small IOVA windows.  PAPR on
> x86 is exactly that... well.. possibly not the 64-bit window, but
> because of old PAPR platforms that didn't support that, we can choose
> not to advertise that and guests will cope.

So maybe this multi-window thing is generic API somehow. You'll have
to check what Kevin comes up with to ensure it fits in

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-13 Thread Tian, Kevin
> From: David Gibson 
> Sent: Thursday, May 13, 2021 2:01 PM
> >
> > But this definitely all becomes HW specific.
> >
> > For instance I want to have an ARM vIOMMU driver it needs to do some
> >
> >  ret = ioctl(ioasid_fd, CREATE_NESTED_IOASID, [page table format is
> ARMvXXX])
> >  if (ret == -EOPNOTSUPP)
> >  ret = ioctl(ioasid_fd, CREATE_NORMAL_IOASID, ..)
> >  // and do completely different and more expensive emulation
> >
> > I can get a little bit of generality, but at the end of the day the
> > IOMMU must create a specific HW layout of the nested page table, if it
> > can't, it can't.
> 
> Erm.. I don't really know how your IOASID interface works here.  I'm
> thinking about the VFIO interface where maps and unmaps are via
> explicit ioctl()s, which provides an obvious point to do translation
> between page table formats.
> 

We are working on a draft uAPI proposal based on the discussions in this
thread. expect to send it out next week. 

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-13 Thread David Gibson
On Wed, May 05, 2021 at 01:39:02PM -0300, Jason Gunthorpe wrote:
> On Wed, May 05, 2021 at 02:28:53PM +1000, Alexey Kardashevskiy wrote:
> 
> > This is a good feature in general when let's say there is a linux supported
> > device which has a proprietary device firmware update tool which only exists
> > as an x86 binary and your hardware is not x86 - running qemu + vfio in full
> > emulation would provide a way to run the tool to update a physical device.
> 
> That specific use case doesn't really need a vIOMMU though, does it?

Possibly not, but the mechanics needed to do vIOMMU on different host
IOMMU aren't really different from what you need for a no-vIOMMU
guest.  With a vIOMMU you need to map guest IOVA space into the host
IOVA space.  With no no-vIOMMU you need to map guest physical
addresses into the host IOVA space.  In either case the GPA/gIOVA to
userspace and userspace to HPA mappings are basically arbitrary.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-13 Thread David Gibson
On Mon, May 03, 2021 at 01:15:18PM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 29, 2021 at 01:04:05PM +1000, David Gibson wrote:
> > Again, I don't know enough about VDPA to make sense of that.  Are we
> > essentially talking non-PCI virtual devices here?  In which case you
> > could define the VDPA "bus" to always have one-device groups.
> 
> It is much worse than that.
> 
> What these non-PCI devices need is for the kernel driver to be part of
> the IOMMU group of the underlying PCI device but tell VFIO land that
> "groups don't matter"

I don't really see a semantic distinction between "always one-device
groups" and "groups don't matter".  Really the only way you can afford
to not care about groups is if they're singletons.

> Today mdev tries to fake this by using singleton iommu groups, but it
> is really horrible and direcly hacks up the VFIO IOMMU code to
> understand these special cases. Intel was proposing more special
> hacking in the VFIO IOMMU code to extend this to PASID.

At this stage I don't really understand why that would end up so
horrible.

> When we get to a /dev/ioasid this is all nonsense. The kernel device
> driver is going to have to tell drivers/iommu exactly what kind of
> ioasid it can accept, be it a PASID inside a kernel owned group, a SW
> emulated 'mdev' ioasid, or whatever.
> 
> In these cases the "group" idea has become a fiction that just creates
> a pain.

I don't see how the group is a fiction in this instance.  You can
still have devices that can't be isolated, therefore you can have
non-singleton groups.

> "Just reorganize VDPA to do something insane with the driver
> core so we can create a dummy group to satisfy an unnecessary uAPI
> restriction" is not a very compelling argument.
> 
> So if the nonsensical groups goes away for PASID/mdev, where does it
> leave the uAPI in other cases?
> 
> > I don't think simplified-but-wrong is a good goal.  The thing about
> > groups is that if they're there, you can't just "not care" about them,
> > they affect you whether you like it or not.
> 
> You really can. If one thing claims the group then all the other group
> devices become locked out.

Aside: I'm primarily using "group" to mean the underlying hardware
unit, not the vfio construct on top of it, I'm not sure that's been
clear throughout.

So.. your model assumes that every device has a safe quiescent state
where it won't do any harm until poked, whether its group is
currently kernel owned, or owned by a userspace that doesn't know
anything about it.

At minimum this does mean that in order to use one device in the group
you must have permission to use *all* the devices in the group -
otherwise you may be able to operate a device you don't have
permission to by DMAing to its registers from a device you do have
permission to.

Whatever scripts are managing ownership of devices also need to know
about groups, because they need to put all the devices into that
quiescent state before the group can change ownership.

> The main point to understand is that groups are NOT an application
> restriction! It is a whole system restriction that the operator needs
> to understand and deal with. This is why things like dpdk don't care
> about the group at all - there is nothing they can do with the
> information.
> 
> If the operator says to run dpdk on a specific device then the
> operator is the one that has to deal with all the other devices in the
> group getting locked out.

Ok, I think I see your point there.

> At best the application can make it more obvious that the operator is
> doing something dangerous, but the current kernel API doesn't seem to
> really support that either.
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-13 Thread David Gibson
On Tue, May 04, 2021 at 03:15:37PM -0300, Jason Gunthorpe wrote:
> On Tue, May 04, 2021 at 01:54:55PM +1000, David Gibson wrote:
> > On Mon, May 03, 2021 at 01:05:30PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Apr 29, 2021 at 01:20:22PM +1000, David Gibson wrote:
> > > > > There is a certain appeal to having some
> > > > > 'PPC_TCE_CREATE_SPECIAL_IOASID' entry point that has a wack of extra
> > > > > information like windows that can be optionally called by the viommu
> > > > > driver and it remains well defined and described.
> > > > 
> > > > Windows really aren't ppc specific.  They're absolutely there on x86
> > > > and everything else as well - it's just that people are used to having
> > > > a window at 0.. that you can often get away with
> > > > treating it sloppily.
> > > 
> > > My point is this detailed control seems to go on to more than just
> > > windows. As you say the vIOMMU is emulating specific HW that needs to
> > > have kernel interfaces to match it exactly.
> > 
> > It's really not that bad.  The case of emulating the PAPR vIOMMU on
> > something else is relatively easy, because all updates to the IO page
> > tables go through hypercalls.  So, as long as the backend IOMMU can
> > map all the IOVAs that the guest IOMMU can, then qemu's implementation
> > of those hypercalls just needs to put an equivalent mapping in the
> > backend, which it can do with a generic VFIO_DMA_MAP.
> 
> So you also want the PAPR vIOMMU driver to run on, say, an ARM IOMMU?

Well, I don't want to preclude it in the API.  I'm not sure about that
specific example, but in most cases it should be possible to run the
PAPR vIOMMU on an x86 IOMMU backend.  Obviously only something you'd
want to do for testing and experimentation, but it could be quite
useful for that.

> > vIOMMUs with page tables in guest memory are harder, but only really
> > in the usual ways that a vIOMMU of that type is harder (needs cache
> > mode or whatever).  At whatever point you need to shadow from the
> > guest IO page tables to the host backend, you can again do that with
> > generic maps, as long as the backend supports the necessary IOVAs, and
> > has an IO page size that's equal to or a submultiple of the vIOMMU
> > page size.
> 
> But this definitely all becomes HW specific.
> 
> For instance I want to have an ARM vIOMMU driver it needs to do some
> 
>  ret = ioctl(ioasid_fd, CREATE_NESTED_IOASID, [page table format is ARMvXXX])
>  if (ret == -EOPNOTSUPP)
>  ret = ioctl(ioasid_fd, CREATE_NORMAL_IOASID, ..)
>  // and do completely different and more expensive emulation
> 
> I can get a little bit of generality, but at the end of the day the
> IOMMU must create a specific HW layout of the nested page table, if it
> can't, it can't.

Erm.. I don't really know how your IOASID interface works here.  I'm
thinking about the VFIO interface where maps and unmaps are via
explicit ioctl()s, which provides an obvious point to do translation
between page table formats.

But.. even if you're exposing page tables to userspace.. with hardware
that has explicit support for nesting you can probably expose the hw
tables directly which is great for the cases that works for.  But
surely for older IOMMUs which don't do nesting you must have some way
of shadowing guest IO page tables to host IO page tables to translate
GPA to HPA at least?  If you're doing that, I don't see that
converting page table format is really any harder

> > > I'm remarking that trying to unify every HW IOMMU implementation that
> > > ever has/will exist into a generic API complete enough to allow the
> > > vIOMMU to be created is likely to result in an API too complicated to
> > > understand..
> > 
> > Maybe not every one, but I think we can get a pretty wide range with a
> > reasonable interface.  
> 
> It sounds like a reasonable guideline is if the feature is actually
> general to all IOMMUs and can be used by qemu as part of a vIOMMU
> emulation when compatible vIOMMU HW is not available.
> 
> Having 'requested window' support that isn't actually implemented in
> every IOMMU is going to mean the PAPR vIOMMU emulation won't work,
> defeating the whole point of making things general?

The trick is that you don't necessarily need dynamic window support in
the backend to emulate it in the vIOMMU.  If your backend has fixed
windows, then you emulate request window as:
if (requested window is within backend windows)
no-op;
else
return ERROR;

It might not be a theoretically complete emulation of the vIOMMU, but
it can support in-practice usage.  In particular it works pretty well
if your backend has a nice big IOVA range (like x86 IOMMUS) but your
guest platform typically uses relatively small IOVA windows.  PAPR on
x86 is exactly that... well.. possibly not the 64-bit window, but
because of old PAPR platforms that didn't support that, we can choose
not to advertise that and guests will cope.

-- 
David Gibson  

RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-11 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, May 12, 2021 8:25 AM
> 
> On Wed, May 12, 2021 at 12:21:24AM +, Tian, Kevin wrote:
> 
> > > Basically each RID knows based on its kernel drivers if it is a local
> > > or global RID and the ioasid knob can further fine tune this for any
> > > other specialty cases.
> >
> > It's fine if you insist on this way. Then we leave it to userspace to
> > ensure same split range is used across devices when vIOMMU is
> > concerned.
> 
> I'm still confused why there is a split range needed.

a device could support both ENQCMD and non-ENQCMD submissions.

for ENQCMD path, CPU provides a PASID translation mechanism (from
guest PASID to host PASID)

for non-ENQCMD path, guest driver directly programs untranslated 
guest PASID to the device MMIO register.

the host kernel only setups host PASID entry which is hwid for a said
ioasid page table.

If we don't split range, we have to assume guest PASID == host PASID
otherwise non-ENQCMD path will fail. But expose host PASID to guest
breaks migration.

If we want to allow migration, then need support guest PASID != host
PASID and make sure both entries point to the same page table so
ENQCMD (host PASID) and non-ENQCMD (guest PASID) can both work.
It requires range split to avoid conflict between host/guest PASIDs
in the same space.

> 
> > Please note such range split has to be enforced through
> > vIOMMU which (e.g. on VT-d) includes a register to report available
> > PASID space size (applying to all devices behind this vIOMMU) to
> > the guest. The kernel just follows per-RID split info. If anything broken,
> > the userspace just shoots its own foot.
> 
> Is it because this specific vIOMMU protocol is limiting things?

When range split is enabled, we need a way to tell the guest about the
local range size so guest PASID is allocated only within this range. Then
we use vIOMMU to expose such information.

> 
> > > > > It does need some user visible difference because SIOV/mdev is not
> > > > > migratable. Only the kernel can select a PASID, userspace (and hence
> > > > > the guest) shouldn't have the option to force a specific PASID as the
> > > > > PASID space is shared across the entire RID to all VMs using the mdev.
> > > >
> > > > not migratable only when you choose exposing host-allocated PASID
> > > > into guest. However in the entire this proposal we actually virtualize
> > > > PASIDs, letting the guest manage its own PASID space in all
> > > > scenarios
> > >
> > > PASID cannot be virtualized without also using ENQCMD.
> > >
> > > A mdev that is using PASID without ENQCMD is non-migratable and this
> > > needs to be make visiable in the uAPI.
> >
> > No. without ENQCMD the PASID must be programmed to a mdev MMIO
> > register. This operation is mediated then mdev driver can translate the
> > PASID from virtual to real.
> 
> That is probably unworkable with real devices, but if you do this you
> need to explicitly expose the vPASID to the mdev API somehow, and still
> the device needs to declare if it supports this, and devices that
> don't should still work in a non-migratable mode.
> 

It's not necessary. For real devices we use alias mapping for both guest/host
PASID as explained above. Then we can have the guest to always register its 
vPASID with ioasid (just like map/unmap GPA to HVA), and then let host drivers 
to figure out whether that vPASID can be used as a real hwid. When it's 
considered
virtual and a real hwid is allocated by the host, the mapping is saved under 
this
ioasid to be queried by device drivers if translation is required. 
 
>From this angle, the previous IOASID_SET_HWID possibly should be renamed 
to IOASID_SET_USER_HWID.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-11 Thread Jason Gunthorpe
On Wed, May 12, 2021 at 12:21:24AM +, Tian, Kevin wrote:

> > Basically each RID knows based on its kernel drivers if it is a local
> > or global RID and the ioasid knob can further fine tune this for any
> > other specialty cases.
> 
> It's fine if you insist on this way. Then we leave it to userspace to
> ensure same split range is used across devices when vIOMMU is
> concerned. 

I'm still confused why there is a split range needed.

> Please note such range split has to be enforced through
> vIOMMU which (e.g. on VT-d) includes a register to report available
> PASID space size (applying to all devices behind this vIOMMU) to 
> the guest. The kernel just follows per-RID split info. If anything broken,
> the userspace just shoots its own foot.

Is it because this specific vIOMMU protocol is limiting things?

> > > > It does need some user visible difference because SIOV/mdev is not
> > > > migratable. Only the kernel can select a PASID, userspace (and hence
> > > > the guest) shouldn't have the option to force a specific PASID as the
> > > > PASID space is shared across the entire RID to all VMs using the mdev.
> > >
> > > not migratable only when you choose exposing host-allocated PASID
> > > into guest. However in the entire this proposal we actually virtualize
> > > PASIDs, letting the guest manage its own PASID space in all
> > > scenarios
> > 
> > PASID cannot be virtualized without also using ENQCMD.
> > 
> > A mdev that is using PASID without ENQCMD is non-migratable and this
> > needs to be make visiable in the uAPI.
> 
> No. without ENQCMD the PASID must be programmed to a mdev MMIO
> register. This operation is mediated then mdev driver can translate the
> PASID from virtual to real.

That is probably unworkable with real devices, but if you do this you
need to explicitly expose the vPASID to the mdev API somehow, and still
the device needs to declare if it supports this, and devices that
don't should still work in a non-migratable mode.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-11 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, May 12, 2021 7:40 AM
> 
> On Tue, May 11, 2021 at 10:51:40PM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Tuesday, May 11, 2021 10:39 PM
> > >
> > > On Tue, May 11, 2021 at 09:10:03AM +, Tian, Kevin wrote:
> > >
> > > > 3) SRIOV, ENQCMD (Intel):
> > > > - "PASID global" with host-allocated PASIDs;
> > > > - PASID table managed by host (in HPA space);
> > > > - all RIDs bound to this ioasid_fd use the global pool;
> > > > - however, exposing global PASID into guest breaks migration;
> > > > - hybrid scheme: split local PASID range and global PASID range;
> > > > - force guest to use only local PASID range (through vIOMMU);
> > > > - for ENQCMD, configure CPU to translate local->global;
> > > > - for non-ENQCMD, setup both local/global pasid entries;
> > > > - uAPI for range split and CPU pasid mapping:
> > > >
> > > > // set to "PASID global"
> > > > ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_GLOBAL);
> > > >
> > > > // split local/global range, applying to all RIDs in this fd
> > > > // Example: local [0, 1024), global [1024, max)
> > > > // local PASID range is managed by guest and migrated as VM state
> > > > // global PASIDs are re-allocated and mapped to local PASIDs post
> > > migration
> > > > ioctl(ioasid_fd, IOASID_HWID_SET_GLOBAL_MIN, 1024);
> > >
> > > I'm still not sold that ranges are the best idea here, it just adds
> > > more state that has to match during migration. Keeping the
> > > global/local split per RID seems much cleaner to me
> >
> > With ENQCMD the PASID is kept in CPU MSR, making it a process
> > context within the guest. When a guest process is bound to two
> > devices, the same local PASID must be usable on both devices.
> > Having per RID split cannot guarantee it.
> 
> That is only for ENQCMD. All drivers know if they are ENQCMD
> compatible drivers and can ensure they use the global allocator
> consistently for their RIDs.
> 
> Basically each RID knows based on its kernel drivers if it is a local
> or global RID and the ioasid knob can further fine tune this for any
> other specialty cases.

It's fine if you insist on this way. Then we leave it to userspace to
ensure same split range is used across devices when vIOMMU is
concerned. Please note such range split has to be enforced through
vIOMMU which (e.g. on VT-d) includes a register to report available
PASID space size (applying to all devices behind this vIOMMU) to 
the guest. The kernel just follows per-RID split info. If anything broken,
the userspace just shoots its own foot.

> 
> > > It does need some user visible difference because SIOV/mdev is not
> > > migratable. Only the kernel can select a PASID, userspace (and hence
> > > the guest) shouldn't have the option to force a specific PASID as the
> > > PASID space is shared across the entire RID to all VMs using the mdev.
> >
> > not migratable only when you choose exposing host-allocated PASID
> > into guest. However in the entire this proposal we actually virtualize
> > PASIDs, letting the guest manage its own PASID space in all
> > scenarios
> 
> PASID cannot be virtualized without also using ENQCMD.
> 
> A mdev that is using PASID without ENQCMD is non-migratable and this
> needs to be make visiable in the uAPI.
> 

No. without ENQCMD the PASID must be programmed to a mdev MMIO
register. This operation is mediated then mdev driver can translate the
PASID from virtual to real.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-11 Thread Jason Gunthorpe
On Tue, May 11, 2021 at 10:51:40PM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Tuesday, May 11, 2021 10:39 PM
> > 
> > On Tue, May 11, 2021 at 09:10:03AM +, Tian, Kevin wrote:
> > 
> > > 3) SRIOV, ENQCMD (Intel):
> > >   - "PASID global" with host-allocated PASIDs;
> > >   - PASID table managed by host (in HPA space);
> > >   - all RIDs bound to this ioasid_fd use the global pool;
> > >   - however, exposing global PASID into guest breaks migration;
> > >   - hybrid scheme: split local PASID range and global PASID range;
> > >   - force guest to use only local PASID range (through vIOMMU);
> > >   - for ENQCMD, configure CPU to translate local->global;
> > >   - for non-ENQCMD, setup both local/global pasid entries;
> > >   - uAPI for range split and CPU pasid mapping:
> > >
> > > // set to "PASID global"
> > > ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_GLOBAL);
> > >
> > > // split local/global range, applying to all RIDs in this fd
> > > // Example: local [0, 1024), global [1024, max)
> > > // local PASID range is managed by guest and migrated as VM state
> > > // global PASIDs are re-allocated and mapped to local PASIDs post
> > migration
> > > ioctl(ioasid_fd, IOASID_HWID_SET_GLOBAL_MIN, 1024);
> > 
> > I'm still not sold that ranges are the best idea here, it just adds
> > more state that has to match during migration. Keeping the
> > global/local split per RID seems much cleaner to me
> 
> With ENQCMD the PASID is kept in CPU MSR, making it a process
> context within the guest. When a guest process is bound to two
> devices, the same local PASID must be usable on both devices.
> Having per RID split cannot guarantee it.

That is only for ENQCMD. All drivers know if they are ENQCMD
compatible drivers and can ensure they use the global allocator
consistently for their RIDs.

Basically each RID knows based on its kernel drivers if it is a local
or global RID and the ioasid knob can further fine tune this for any
other specialty cases.

> > It does need some user visible difference because SIOV/mdev is not
> > migratable. Only the kernel can select a PASID, userspace (and hence
> > the guest) shouldn't have the option to force a specific PASID as the
> > PASID space is shared across the entire RID to all VMs using the mdev.
> 
> not migratable only when you choose exposing host-allocated PASID
> into guest. However in the entire this proposal we actually virtualize
> PASIDs, letting the guest manage its own PASID space in all
> scenarios

PASID cannot be virtualized without also using ENQCMD.

A mdev that is using PASID without ENQCMD is non-migratable and this
needs to be make visiable in the uAPI.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-11 Thread Tian, Kevin
> From: Liu Yi L 
> Sent: Tuesday, May 11, 2021 9:25 PM
> 
> On Tue, 11 May 2021 09:10:03 +, Tian, Kevin wrote:
> 
> > > From: Jason Gunthorpe
> > > Sent: Monday, May 10, 2021 8:37 PM
> > >
> > [...]
> > > > gPASID!=hPASID has a problem when assigning a physical device which
> > > > supports both shared work queue (ENQCMD with PASID in MSR)
> > > > and dedicated work queue (PASID in device register) to a guest
> > > > process which is associated to a gPASID. Say the host kernel has setup
> > > > the hPASID entry with nested translation though /dev/ioasid. For
> > > > shared work queue the CPU is configured to translate gPASID in MSR
> > > > into **hPASID** before the payload goes out to the wire. However
> > > > for dedicated work queue the device MMIO register is directly mapped
> > > > to and programmed by the guest, thus containing a **gPASID** value
> > > > implying DMA requests through this interface will hit IOMMU faults
> > > > due to invalid gPASID entry. Having gPASID==hPASID is a simple
> > > > workaround here. mdev doesn't have this problem because the
> > > > PASID register is in emulated control-path thus can be translated
> > > > to hPASID manually by mdev driver.
> > >
> > > This all must be explicit too.
> > >
> > > If a PASID is allocated and it is going to be used with ENQCMD then
> > > everything needs to know it is actually quite different than a PASID
> > > that was allocated to be used with a normal SRIOV device, for
> > > instance.
> > >
> > > The former case can accept that the guest PASID is virtualized, while
> > > the lattter can not.
> > >
> > > This is also why PASID per RID has to be an option. When I assign a
> > > full SRIOV function to the guest then that entire RID space needs to
> > > also be assigned to the guest. Upon migration I need to take all the
> > > physical PASIDs and rebuild them in another hypervisor exactly as is.
> > >
> > > If you force all RIDs into a global PASID pool then normal SRIOV
> > > migration w/PASID becomes impossible. ie ENQCMD breaks everything
> else
> > > that should work.
> > >
> > > This is why you need to sort all this out and why it feels like some
> > > of the specs here have been mis-designed.
> > >
> > > I'm not sure carving out ranges is really workable for migration.
> > >
> > > I think the real answer is to carve out entire RIDs as being in the
> > > global pool or not. Then the ENQCMD HW can be bundled together and
> > > everything else can live in the natural PASID per RID world.
> > >
> >
> > OK. Here is the revised scheme by making it explicitly.
> >
> > There are three scenarios to be considered:
> >
> > 1) SR-IOV (AMD/ARM):
> > - "PASID per RID" with guest-allocated PASIDs;
> > - PASID table managed by guest (in GPA space);
> > - the entire PASID space delegated to guest;
> > - no need to explicitly register guest-allocated PASIDs to host;
> > - uAPI for attaching PASID table:
> >
> > // set to "PASID per RID"
> > ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_LOCAL);
> >
> > // When Qemu captures a new PASID table through vIOMMU;
> > pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> > ioctl(device_fd, VFIO_ATTACH_IOASID, pasidtbl_ioasid);
> >
> > // Set the PASID table to the RID associated with pasidtbl_ioasid;
> > ioctl(ioasid_fd, IOASID_SET_PASID_TABLE, pasidtbl_ioasid, gpa_addr);
> >
> > 2) SR-IOV, no ENQCMD (Intel):
> > - "PASID per RID" with guest-allocated PASIDs;
> > - PASID table managed by host (in HPA space);
> > - the entire PASID space delegated to guest too;
> > - host must be explicitly notified for guest-allocated PASIDs;
> > - uAPI for binding user-allocated PASIDs:
> >
> > // set to "PASID per RID"
> > ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_LOCAL);
> >
> > // When Qemu captures a new PASID allocated through vIOMMU;
> 
> Is this achieved by VCMD or by capturing guest's PASID cache invalidation?

The latter one

> 
> > pgtbl_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> > ioctl(device_fd, VFIO_ATTACH_IOASID, pgtbl_ioasid);
> >
> > // Tell the kernel to associate pasid to pgtbl_ioasid in internal 
> > structure;
> > //  being a pointer due to a requirement in scenario-3
> > ioctl(ioasid_fd, IOASID_SET_HWID, pgtbl_ioasid, );
> >
> > // Set guest page table to the RID+pasid associated to pgtbl_ioasid
> > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, pgtbl_ioasid, gpa_addr);
> >
> > 3) SRIOV, ENQCMD (Intel):
> > - "PASID global" with host-allocated PASIDs;
> > - PASID table managed by host (in HPA space);
> > - all RIDs bound to this ioasid_fd use the global pool;
> > - however, exposing global PASID into guest breaks migration;
> > - hybrid scheme: split local PASID range and global PASID range;
> > - force guest to use only local PASID range (through vIOMMU);
> > - for ENQCMD, configure CPU to translate local->global;
> > - for non-ENQCMD, setup both 

RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-11 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, May 11, 2021 10:39 PM
> 
> On Tue, May 11, 2021 at 09:10:03AM +, Tian, Kevin wrote:
> 
> > 3) SRIOV, ENQCMD (Intel):
> > - "PASID global" with host-allocated PASIDs;
> > - PASID table managed by host (in HPA space);
> > - all RIDs bound to this ioasid_fd use the global pool;
> > - however, exposing global PASID into guest breaks migration;
> > - hybrid scheme: split local PASID range and global PASID range;
> > - force guest to use only local PASID range (through vIOMMU);
> > - for ENQCMD, configure CPU to translate local->global;
> > - for non-ENQCMD, setup both local/global pasid entries;
> > - uAPI for range split and CPU pasid mapping:
> >
> > // set to "PASID global"
> > ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_GLOBAL);
> >
> > // split local/global range, applying to all RIDs in this fd
> > // Example: local [0, 1024), global [1024, max)
> > // local PASID range is managed by guest and migrated as VM state
> > // global PASIDs are re-allocated and mapped to local PASIDs post
> migration
> > ioctl(ioasid_fd, IOASID_HWID_SET_GLOBAL_MIN, 1024);
> 
> I'm still not sold that ranges are the best idea here, it just adds
> more state that has to match during migration. Keeping the
> global/local split per RID seems much cleaner to me

With ENQCMD the PASID is kept in CPU MSR, making it a process
context within the guest. When a guest process is bound to two
devices, the same local PASID must be usable on both devices.
Having per RID split cannot guarantee it.

> 
> This is also why I don't really like having the global/local be global
> to the ioasid either. It would be better to specify global/local as
> part of each VFIO_ATTACH_IOASID so each device is moved to the correct
> allocator.

this was my original thought. But for above reason this has to be
a global enforcement in this ioasid fd.

> 
> > When considering SIOV/mdev there is no change to above uAPI sequence.
> > It's n/a for 1) as SIOV requires PASID table in HPA space, nor does it
> > cause any change to 3) regarding to the split range scheme. The only
> >  conceptual change is in 2), where although it's still "PASID per RID" the
> > PASIDs must be managed by host because the parent driver also allocates
> > PASIDs from per-RID space to mark mdev (RID+PASID). But this difference
> > doesn't change the uAPI flow - just treat user-provisioned PASID as 
> > 'virtual'
> > and then allocate a 'real' PASID at IOASID_SET_HWID. Later always use the
> > real one when programming PASID entry (IOASID_BIND_PGTABLE) or
> device
> > PASID register (converted in the mediation path).
> 
> It does need some user visible difference because SIOV/mdev is not
> migratable. Only the kernel can select a PASID, userspace (and hence
> the guest) shouldn't have the option to force a specific PASID as the
> PASID space is shared across the entire RID to all VMs using the mdev.

not migratable only when you choose exposing host-allocated PASID
into guest. However in the entire this proposal we actually virtualize
PASIDs, letting the guest manage its own PASID space in all scenarios
(being SR-IOV or SIOV) though the size of PASID space might be different. 
The PASID chosen by guest may be used as the hw PASID when the 
PASID space is delegated to guest (e.g. SR-IOV in scenario 1), or is 
mapped to a different PASID allocated by guest (e.g. in this mdev 
case or ENQCMD in scenario-3). From uAPI p.o.v the userspace just 
needs to attach its own pasid to ioasid while the kernel will decide 
the real hwid underlyingly (being same or different one). Migration 
only needs cover guest-allocated PASIDs, with all host-side PASIDs 
are hidden from userspace and reconstructed in the new machine 
post migration (following common virtualization practice).

The only exception where we return host-allocated PASID to userspace
in scenario-3 is because Qemu needs such information to update CPU
PASID translation table through KVM. Earlier you suggested that this
must be explicitly done through userspace instead of implicit notification
between ioasid and kvm in kernel.

> 
> I don't see any alternative to telling every part if the PASID is
> going to be used by ENQCMD or not, too many important decisions rest
> on this detail.
> 
> Jason

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-11 Thread Jason Gunthorpe
On Tue, May 11, 2021 at 09:10:03AM +, Tian, Kevin wrote:

> 3) SRIOV, ENQCMD (Intel):
>   - "PASID global" with host-allocated PASIDs;
>   - PASID table managed by host (in HPA space);
>   - all RIDs bound to this ioasid_fd use the global pool;
>   - however, exposing global PASID into guest breaks migration;
>   - hybrid scheme: split local PASID range and global PASID range;
>   - force guest to use only local PASID range (through vIOMMU);
>   - for ENQCMD, configure CPU to translate local->global;
>   - for non-ENQCMD, setup both local/global pasid entries;
>   - uAPI for range split and CPU pasid mapping:
> 
> // set to "PASID global"
> ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_GLOBAL);
> 
> // split local/global range, applying to all RIDs in this fd
> // Example: local [0, 1024), global [1024, max)
> // local PASID range is managed by guest and migrated as VM state
> // global PASIDs are re-allocated and mapped to local PASIDs post 
> migration
> ioctl(ioasid_fd, IOASID_HWID_SET_GLOBAL_MIN, 1024);

I'm still not sold that ranges are the best idea here, it just adds
more state that has to match during migration. Keeping the
global/local split per RID seems much cleaner to me

This is also why I don't really like having the global/local be global
to the ioasid either. It would be better to specify global/local as
part of each VFIO_ATTACH_IOASID so each device is moved to the correct
allocator.

> When considering SIOV/mdev there is no change to above uAPI sequence. 
> It's n/a for 1) as SIOV requires PASID table in HPA space, nor does it
> cause any change to 3) regarding to the split range scheme. The only
>  conceptual change is in 2), where although it's still "PASID per RID" the 
> PASIDs must be managed by host because the parent driver also allocates 
> PASIDs from per-RID space to mark mdev (RID+PASID). But this difference 
> doesn't change the uAPI flow - just treat user-provisioned PASID as 'virtual' 
> and then allocate a 'real' PASID at IOASID_SET_HWID. Later always use the 
> real one when programming PASID entry (IOASID_BIND_PGTABLE) or device 
> PASID register (converted in the mediation path).

It does need some user visible difference because SIOV/mdev is not
migratable. Only the kernel can select a PASID, userspace (and hence
the guest) shouldn't have the option to force a specific PASID as the
PASID space is shared across the entire RID to all VMs using the mdev.

I don't see any alternative to telling every part if the PASID is
going to be used by ENQCMD or not, too many important decisions rest
on this detail.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-11 Thread Liu Yi L
On Tue, 11 May 2021 09:10:03 +, Tian, Kevin wrote:

> > From: Jason Gunthorpe
> > Sent: Monday, May 10, 2021 8:37 PM
> >   
> [...] 
> > > gPASID!=hPASID has a problem when assigning a physical device which
> > > supports both shared work queue (ENQCMD with PASID in MSR)
> > > and dedicated work queue (PASID in device register) to a guest
> > > process which is associated to a gPASID. Say the host kernel has setup
> > > the hPASID entry with nested translation though /dev/ioasid. For
> > > shared work queue the CPU is configured to translate gPASID in MSR
> > > into **hPASID** before the payload goes out to the wire. However
> > > for dedicated work queue the device MMIO register is directly mapped
> > > to and programmed by the guest, thus containing a **gPASID** value
> > > implying DMA requests through this interface will hit IOMMU faults
> > > due to invalid gPASID entry. Having gPASID==hPASID is a simple
> > > workaround here. mdev doesn't have this problem because the
> > > PASID register is in emulated control-path thus can be translated
> > > to hPASID manually by mdev driver.  
> > 
> > This all must be explicit too.
> > 
> > If a PASID is allocated and it is going to be used with ENQCMD then
> > everything needs to know it is actually quite different than a PASID
> > that was allocated to be used with a normal SRIOV device, for
> > instance.
> > 
> > The former case can accept that the guest PASID is virtualized, while
> > the lattter can not.
> > 
> > This is also why PASID per RID has to be an option. When I assign a
> > full SRIOV function to the guest then that entire RID space needs to
> > also be assigned to the guest. Upon migration I need to take all the
> > physical PASIDs and rebuild them in another hypervisor exactly as is.
> > 
> > If you force all RIDs into a global PASID pool then normal SRIOV
> > migration w/PASID becomes impossible. ie ENQCMD breaks everything else
> > that should work.
> > 
> > This is why you need to sort all this out and why it feels like some
> > of the specs here have been mis-designed.
> > 
> > I'm not sure carving out ranges is really workable for migration.
> > 
> > I think the real answer is to carve out entire RIDs as being in the
> > global pool or not. Then the ENQCMD HW can be bundled together and
> > everything else can live in the natural PASID per RID world.
> >   
> 
> OK. Here is the revised scheme by making it explicitly.
> 
> There are three scenarios to be considered:
> 
> 1) SR-IOV (AMD/ARM):
>   - "PASID per RID" with guest-allocated PASIDs;
>   - PASID table managed by guest (in GPA space);
>   - the entire PASID space delegated to guest;
>   - no need to explicitly register guest-allocated PASIDs to host;
>   - uAPI for attaching PASID table:
> 
> // set to "PASID per RID"
> ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_LOCAL);
> 
> // When Qemu captures a new PASID table through vIOMMU;
> pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> ioctl(device_fd, VFIO_ATTACH_IOASID, pasidtbl_ioasid);
> 
> // Set the PASID table to the RID associated with pasidtbl_ioasid;
> ioctl(ioasid_fd, IOASID_SET_PASID_TABLE, pasidtbl_ioasid, gpa_addr);
> 
> 2) SR-IOV, no ENQCMD (Intel):
>   - "PASID per RID" with guest-allocated PASIDs;
>   - PASID table managed by host (in HPA space);
>   - the entire PASID space delegated to guest too;
>   - host must be explicitly notified for guest-allocated PASIDs;
>   - uAPI for binding user-allocated PASIDs:
> 
> // set to "PASID per RID"
> ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_LOCAL);
> 
> // When Qemu captures a new PASID allocated through vIOMMU;

Is this achieved by VCMD or by capturing guest's PASID cache invalidation?

> pgtbl_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> ioctl(device_fd, VFIO_ATTACH_IOASID, pgtbl_ioasid);
> 
> // Tell the kernel to associate pasid to pgtbl_ioasid in internal 
> structure;
> //  being a pointer due to a requirement in scenario-3
> ioctl(ioasid_fd, IOASID_SET_HWID, pgtbl_ioasid, );
> 
> // Set guest page table to the RID+pasid associated to pgtbl_ioasid
> ioctl(ioasid_fd, IOASID_BIND_PGTABLE, pgtbl_ioasid, gpa_addr);
> 
> 3) SRIOV, ENQCMD (Intel):
>   - "PASID global" with host-allocated PASIDs;
>   - PASID table managed by host (in HPA space);
>   - all RIDs bound to this ioasid_fd use the global pool;
>   - however, exposing global PASID into guest breaks migration;
>   - hybrid scheme: split local PASID range and global PASID range;
>   - force guest to use only local PASID range (through vIOMMU);
>   - for ENQCMD, configure CPU to translate local->global;
>   - for non-ENQCMD, setup both local/global pasid entries;
>   - uAPI for range split and CPU pasid mapping:
> 
> // set to "PASID global"
> ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_GLOBAL);
> 
> // split local/global range, applying to 

RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-11 Thread Tian, Kevin
> From: Jason Gunthorpe
> Sent: Monday, May 10, 2021 8:37 PM
> 
[...] 
> > gPASID!=hPASID has a problem when assigning a physical device which
> > supports both shared work queue (ENQCMD with PASID in MSR)
> > and dedicated work queue (PASID in device register) to a guest
> > process which is associated to a gPASID. Say the host kernel has setup
> > the hPASID entry with nested translation though /dev/ioasid. For
> > shared work queue the CPU is configured to translate gPASID in MSR
> > into **hPASID** before the payload goes out to the wire. However
> > for dedicated work queue the device MMIO register is directly mapped
> > to and programmed by the guest, thus containing a **gPASID** value
> > implying DMA requests through this interface will hit IOMMU faults
> > due to invalid gPASID entry. Having gPASID==hPASID is a simple
> > workaround here. mdev doesn't have this problem because the
> > PASID register is in emulated control-path thus can be translated
> > to hPASID manually by mdev driver.
> 
> This all must be explicit too.
> 
> If a PASID is allocated and it is going to be used with ENQCMD then
> everything needs to know it is actually quite different than a PASID
> that was allocated to be used with a normal SRIOV device, for
> instance.
> 
> The former case can accept that the guest PASID is virtualized, while
> the lattter can not.
> 
> This is also why PASID per RID has to be an option. When I assign a
> full SRIOV function to the guest then that entire RID space needs to
> also be assigned to the guest. Upon migration I need to take all the
> physical PASIDs and rebuild them in another hypervisor exactly as is.
> 
> If you force all RIDs into a global PASID pool then normal SRIOV
> migration w/PASID becomes impossible. ie ENQCMD breaks everything else
> that should work.
> 
> This is why you need to sort all this out and why it feels like some
> of the specs here have been mis-designed.
> 
> I'm not sure carving out ranges is really workable for migration.
> 
> I think the real answer is to carve out entire RIDs as being in the
> global pool or not. Then the ENQCMD HW can be bundled together and
> everything else can live in the natural PASID per RID world.
> 

OK. Here is the revised scheme by making it explicitly.

There are three scenarios to be considered:

1) SR-IOV (AMD/ARM):
- "PASID per RID" with guest-allocated PASIDs;
- PASID table managed by guest (in GPA space);
- the entire PASID space delegated to guest;
- no need to explicitly register guest-allocated PASIDs to host;
- uAPI for attaching PASID table:

// set to "PASID per RID"
ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_LOCAL);

// When Qemu captures a new PASID table through vIOMMU;
pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
ioctl(device_fd, VFIO_ATTACH_IOASID, pasidtbl_ioasid);

// Set the PASID table to the RID associated with pasidtbl_ioasid;
ioctl(ioasid_fd, IOASID_SET_PASID_TABLE, pasidtbl_ioasid, gpa_addr);

2) SR-IOV, no ENQCMD (Intel):
- "PASID per RID" with guest-allocated PASIDs;
- PASID table managed by host (in HPA space);
- the entire PASID space delegated to guest too;
- host must be explicitly notified for guest-allocated PASIDs;
- uAPI for binding user-allocated PASIDs:

// set to "PASID per RID"
ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_LOCAL);

// When Qemu captures a new PASID allocated through vIOMMU;
pgtbl_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
ioctl(device_fd, VFIO_ATTACH_IOASID, pgtbl_ioasid);

// Tell the kernel to associate pasid to pgtbl_ioasid in internal structure;
//  being a pointer due to a requirement in scenario-3
ioctl(ioasid_fd, IOASID_SET_HWID, pgtbl_ioasid, );

// Set guest page table to the RID+pasid associated to pgtbl_ioasid
ioctl(ioasid_fd, IOASID_BIND_PGTABLE, pgtbl_ioasid, gpa_addr);

3) SRIOV, ENQCMD (Intel):
- "PASID global" with host-allocated PASIDs;
- PASID table managed by host (in HPA space);
- all RIDs bound to this ioasid_fd use the global pool;
- however, exposing global PASID into guest breaks migration;
- hybrid scheme: split local PASID range and global PASID range;
- force guest to use only local PASID range (through vIOMMU);
- for ENQCMD, configure CPU to translate local->global;
- for non-ENQCMD, setup both local/global pasid entries;
- uAPI for range split and CPU pasid mapping:

// set to "PASID global"
ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_GLOBAL);

// split local/global range, applying to all RIDs in this fd
// Example: local [0, 1024), global [1024, max)
// local PASID range is managed by guest and migrated as VM state
// global PASIDs are re-allocated and mapped to local PASIDs post migration
ioctl(ioasid_fd, IOASID_HWID_SET_GLOBAL_MIN, 1024);

// When Qemu captures a new 

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-10 Thread Jacob Pan
Hi Jason,

On Mon, 10 May 2021 20:45:00 -0300, Jason Gunthorpe  wrote:

> On Mon, May 10, 2021 at 03:28:54PM -0700, Jacob Pan wrote:
> 
> > To satisfy your "give me a PASID for this RID" proposal, can we just use
> > the RID's struct device as the token? Also add a type field to
> > explicitly indicate global vs per-set(per-RID). i.e.  
> 
> You've got it backwards, the main behavior should be to allocate PASID
> per RID.
> 
Sure, we can make the local PASID as default. My point was that the
ioasid_set infrastructure's opaque token can support RID-local allocation
scheme. Anyway, this is a small detail as compared to uAPI.

> The special behavior is to bundle a bunch of PASIDs into a grouping
> and then say the PASID number space is shared between all the group
> members. 
> 
> /dev/ioasid should create and own this grouping either implicitly or
> explicitly. Jumping ahead to in-kernel APIs has missed the critical
> step of defining the uAPI and all the behaviors together in a
> completed RFC proposal.
> 
Agreed, the requirements for kernel API should come from uAPI.

> Jason


Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-10 Thread Jason Gunthorpe
On Mon, May 10, 2021 at 03:28:54PM -0700, Jacob Pan wrote:

> To satisfy your "give me a PASID for this RID" proposal, can we just use
> the RID's struct device as the token? Also add a type field to explicitly
> indicate global vs per-set(per-RID). i.e.

You've got it backwards, the main behavior should be to allocate PASID
per RID.

The special behavior is to bundle a bunch of PASIDs into a grouping
and then say the PASID number space is shared between all the group
members. 

/dev/ioasid should create and own this grouping either implicitly or
explicitly. Jumping ahead to in-kernel APIs has missed the critical
step of defining the uAPI and all the behaviors together in a
completed RFC proposal.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-10 Thread Jacob Pan
Hi Jason,

On Mon, 10 May 2021 13:39:56 -0300, Jason Gunthorpe  wrote:

> I still think it is smarter to push a group of RID's into a global
> allocation group and accept there are potential downsides with that
> than to try to force a global allocation group on every RID and then
> try to fix the mess that makes for non-ENQCMD devices.
The proposed ioasid_set change in this set has a token for each set of
IOASIDs.

/**
 * struct ioasid_set - Meta data about ioasid_set
 * @nh: List of notifiers private to that set
 * @xa: XArray to store ioasid_set private IDs, can be used for
 *  guest-host IOASID mapping, or just a private IOASID namespace.
 * @token:  Unique to identify an IOASID set
 * @type:   Token types
 * @quota:  Max number of IOASIDs can be allocated within the set
 * @nr_ioasids: Number of IOASIDs currently allocated in the set
 * @id: ID of the set
 */
struct ioasid_set {
struct atomic_notifier_head nh;
struct xarray xa;
void *token;
int type;
int quota;
atomic_t nr_ioasids;
int id;
struct rcu_head rcu;
struct misc_cg *misc_cg; /* For misc cgroup accounting */
};

To satisfy your "give me a PASID for this RID" proposal, can we just use
the RID's struct device as the token? Also add a type field to explicitly
indicate global vs per-set(per-RID). i.e.

ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
   int type, void *private)
Where flags can be:
enum ioasid_hwid_type {
IOASID_HWID_GLOBAL,
IOASID_HWID_PER_SET,
};

We are really talking about the HW IOASID, just a reminder.

Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-10 Thread Jason Gunthorpe
On Mon, May 10, 2021 at 09:22:12AM -0700, Raj, Ashok wrote:

> Those vIOMMU's will have a capability that it supports PASID allocation
> interface. When you allocate you can say what type of PASID you need
> (shared vs local) and Qemu will obtain either within the local range, or in
> the shared range.

Isn't this what I've been saying? This all has to be explicit, and it
is all local to the iommu driver. At worst we'd have some single
global API 'get me a global PASID' which co-ordinates with all the
iommu drivers to actually implement it.

> > > When we have both SRIOV and shared WQ exposed to the same guest, we 
> > > do have an issue. The simplest way that I thought was to have a guest 
> > > and host PASID separation.  Where the guest has its own PASID  space 
> > > and host has its own carved out.  Guest can do what ever it wants within 
> > > that allocated space without fear of any collition with any other device. 
> > 
> > And how do you reliably migrate if the target kernel has a PASID
> > already allocated in that range?
> 
> For any shared range, remember there is a mapping table. And since those
> ranges are always reserved in the new host it isn't a problem.

It is a smaller problem - all the ranges still need to match between
hosts and so forth. It is tractable but this all needs to be API'd
properly and nothing can be implicit, including the global/local range
split.

Basically all this needs to come through in your /dev/ioasid API RFC
proposal that I hope is being worked on.

I still think it is smarter to push a group of RID's into a global
allocation group and accept there are potential downsides with that
than to try to force a global allocation group on every RID and then
try to fix the mess that makes for non-ENQCMD devices.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-10 Thread Raj, Ashok
On Mon, May 10, 2021 at 12:31:11PM -0300, Jason Gunthorpe wrote:
> On Mon, May 10, 2021 at 08:25:02AM -0700, Raj, Ashok wrote:
> 
> > Global PASID doesn't break anything, but giving that control to vIOMMU
> > doesn't seem right. When we have mixed uses cases like hardware that
> > supports shared wq and SRIOV devices that need PASIDs we need to 
> > comprehend how they will work without having a backend to migrate PASIDs 
> > to new destination.
> 
> Why wouldn't there be a backend? SRIOV live migration is a real thing
> now (see Max's VFIO patches). The PASID space of the entire dedicated
> RID needs to be migratable, which means the destination vIOMMU must be
> able to program its local hardware with the same PASID numbers and any
> kind of global PASID scheme at all will interfere with it.

The way I'm imagining it works is as follows. We have 2 types of platforms.
Let me know if i missed something.

- no shared wq, meaning RID local PASID allocation is all that's needed.
  Say platform type p1
- Shared WQ configurations that require global PASIDs.
  Say platform type p2

vIOMMU might have a capability to indicate if vIOMMU has a PASID allocation
capability. If there is none, guest is free to obtain its own PASID per
RID since they are fully isolated. For p1 type platforms this should work.
Since there is no Qemu interface even required or /dev/iommu for that
instance. Guest kernel can manage it fully per-RID based.

Platform type p2 that has both SIOV (with enqcmd) and SRIOV that requires
PASID. My thought was to reserve say some number of PASID's for per-RID
use. When you request a RID local you get PASID in that range. When you ask
for global, you get in the upper PASID range. 

Say 0-4k is reserved for any RID local allocation. This is also the guest
PASID range.  4k->2^19 are for shared WQ. (I'm not implying the size, but 
just for discussion sake that we need a separation.)

Those vIOMMU's will have a capability that it supports PASID allocation
interface. When you allocate you can say what type of PASID you need
(shared vs local) and Qemu will obtain either within the local range, or in
the shared range. When they are allocated in the shared range, you still
end up using one in guest PASID range, but mapped to a different
host-pasid using the VMCS like PASID redirection per-guest (not-perRID).

Only the shared allocation requires /dev/iommu interface. All allocation in
the guest range is fully in Qemu control.

For supporting migration, the target also needs to have this capability for
migration. 
> 
> > When we have both SRIOV and shared WQ exposed to the same guest, we 
> > do have an issue. The simplest way that I thought was to have a guest 
> > and host PASID separation.  Where the guest has its own PASID  space 
> > and host has its own carved out.  Guest can do what ever it wants within 
> > that allocated space without fear of any collition with any other device. 
> 
> And how do you reliably migrate if the target kernel has a PASID
> already allocated in that range?

For any shared range, remember there is a mapping table. And since those
ranges are always reserved in the new host it isn't a problem. For shared
WQ that requires a PASID remapping to new host PASID, the VMCS remapping
per guest that does gPASID->hPASID does this job. So the guest PASID
remains unchanged. 

Does this make sense? 

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-10 Thread Jason Gunthorpe
On Mon, May 10, 2021 at 08:25:02AM -0700, Raj, Ashok wrote:

> Global PASID doesn't break anything, but giving that control to vIOMMU
> doesn't seem right. When we have mixed uses cases like hardware that
> supports shared wq and SRIOV devices that need PASIDs we need to 
> comprehend how they will work without having a backend to migrate PASIDs 
> to new destination.

Why wouldn't there be a backend? SRIOV live migration is a real thing
now (see Max's VFIO patches). The PASID space of the entire dedicated
RID needs to be migratable, which means the destination vIOMMU must be
able to program its local hardware with the same PASID numbers and any
kind of global PASID scheme at all will interfere with it.

> When we have both SRIOV and shared WQ exposed to the same guest, we 
> do have an issue. The simplest way that I thought was to have a guest 
> and host PASID separation.  Where the guest has its own PASID  space 
> and host has its own carved out.  Guest can do what ever it wants within 
> that allocated space without fear of any collition with any other device. 

And how do you reliably migrate if the target kernel has a PASID
already allocated in that range?

ENQCMD must not assume it is the only thing in the platform. It needs
to be compartmentalized to specific participating RIDs and made
explicit because it has a bad special requirement for cross-device
PASIDs

Jason 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-10 Thread Raj, Ashok
On Mon, May 10, 2021 at 09:37:29AM -0300, Jason Gunthorpe wrote:
> On Sat, May 08, 2021 at 09:56:59AM +, Tian, Kevin wrote:
> > > From: Raj, Ashok 
> > > Sent: Friday, May 7, 2021 12:33 AM
> > > 
> > > > Basically it means when the guest's top level IOASID is created for
> > > > nesting that IOASID claims all PASID's on the RID and excludes any
> > > > PASID IOASIDs from existing on the RID now or in future.
> > > 
> > > The way to look at it this is as follows:
> > > 
> > > For platforms that do not have a need to support shared work queue model
> > > support for ENQCMD or similar, PASID space is naturally per RID. There is 
> > > no
> > > complication with this. Every RID has the full range of PASID's and no 
> > > need
> > > for host to track which PASIDs are allocated now or in future in the 
> > > guest.
> > > 
> > > For platforms that support ENQCMD, it is required to mandate PASIDs are
> > > global across the entire system. Maybe its better to call them gPASID for
> > > guest and hPASID for host. Short reason being gPASID->hPASID is a guest
> > > wide mapping for ENQCMD and not a per-RID based mapping. (We covered
> > > that
> > > in earlier responses)
> > > 
> > > In our current implementation we actually don't separate this space, and
> > > gPASID == hPASID. The iommu driver enforces that by using the custom
> > > allocator and the architected interface that allows all guest vIOMMU
> > > allocations to be proxied to host. Nothing but a glorified hypercall like
> > > interface. In fact some OS's do use hypercall to get a hPASID vs using
> > > the vCMD style interface.
> > > 
> > 
> > After more thinking about the new interface, I feel gPASID==hPASID 
> > actually causes some confusion in uAPI design. In concept an ioasid
> > is not active until it's attached to a device, because it's just an ID
> > if w/o a device. So supposedly an ioasid should reject all user commands
> > before attach. However an guest likely asks for a new gPASID before
> > attaching it to devices and vIOMMU. if gPASID==hPASID then Qemu 
> > must request /dev/ioasid to allocate a hw_id for an ioasid which hasn't 
> > been attached to any device, with the assumption on kernel knowledge 
> > that this hw_id is from an global allocator w/o dependency on any 
> > device. This doesn't sound a clean design, not to say it also conflicts 
> > with live migration.
> 
> Everything must be explicit. The situation David pointed to of
> qemu emulating a vIOMMU while running on a host with a different
> platform/physical IOMMU must be considered.
> 
> If the vIOMMU needs specific behavior it must use /dev/iommu to ask
> for it specifically and not just make wild assumptions about how the
> platform works.

I think the right way is for pIOMMU to enforce the right behavior. vIOMMU
can ask for a PASID and physical IOMMU driver would give what is optimal
for the platform. if vIOMMU says give me per-device PASID, but that can
lead to conflicts in PASID name space, its best to avoid it. 

Global PASID doesn't break anything, but giving that control to vIOMMU
doesn't seem right. When we have mixed uses cases like hardware that
supports shared wq and SRIOV devices that need PASIDs we need to 
comprehend how they will work without having a backend to migrate PASIDs 
to new destination.

for ENQCMD we have the gPASID->hPASID translation in the VMCS control.
For devices that support SIOV, programming a PASID to  a device is also
mediated, so its possible for something like the mediated interface to
assist with that migration for the dedicated WQ.


When we have both SRIOV and shared WQ exposed to the same guest, we 
do have an issue. The simplest way that I thought was to have a guest 
and host PASID separation.  Where the guest has its own PASID  space 
and host has its own carved out.  Guest can do what ever it wants within 
that allocated space without fear of any collition with any other device. 

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-10 Thread Jason Gunthorpe
On Sat, May 08, 2021 at 09:56:59AM +, Tian, Kevin wrote:
> > From: Raj, Ashok 
> > Sent: Friday, May 7, 2021 12:33 AM
> > 
> > > Basically it means when the guest's top level IOASID is created for
> > > nesting that IOASID claims all PASID's on the RID and excludes any
> > > PASID IOASIDs from existing on the RID now or in future.
> > 
> > The way to look at it this is as follows:
> > 
> > For platforms that do not have a need to support shared work queue model
> > support for ENQCMD or similar, PASID space is naturally per RID. There is no
> > complication with this. Every RID has the full range of PASID's and no need
> > for host to track which PASIDs are allocated now or in future in the guest.
> > 
> > For platforms that support ENQCMD, it is required to mandate PASIDs are
> > global across the entire system. Maybe its better to call them gPASID for
> > guest and hPASID for host. Short reason being gPASID->hPASID is a guest
> > wide mapping for ENQCMD and not a per-RID based mapping. (We covered
> > that
> > in earlier responses)
> > 
> > In our current implementation we actually don't separate this space, and
> > gPASID == hPASID. The iommu driver enforces that by using the custom
> > allocator and the architected interface that allows all guest vIOMMU
> > allocations to be proxied to host. Nothing but a glorified hypercall like
> > interface. In fact some OS's do use hypercall to get a hPASID vs using
> > the vCMD style interface.
> > 
> 
> After more thinking about the new interface, I feel gPASID==hPASID 
> actually causes some confusion in uAPI design. In concept an ioasid
> is not active until it's attached to a device, because it's just an ID
> if w/o a device. So supposedly an ioasid should reject all user commands
> before attach. However an guest likely asks for a new gPASID before
> attaching it to devices and vIOMMU. if gPASID==hPASID then Qemu 
> must request /dev/ioasid to allocate a hw_id for an ioasid which hasn't 
> been attached to any device, with the assumption on kernel knowledge 
> that this hw_id is from an global allocator w/o dependency on any 
> device. This doesn't sound a clean design, not to say it also conflicts 
> with live migration.

Everything must be explicit. The situation David pointed to of
qemu emulating a vIOMMU while running on a host with a different
platform/physical IOMMU must be considered.

If the vIOMMU needs specific behavior it must use /dev/iommu to ask
for it specifically and not just make wild assumptions about how the
platform works.

> gPASID!=hPASID has a problem when assigning a physical device which 
> supports both shared work queue (ENQCMD with PASID in MSR) 
> and dedicated work queue (PASID in device register) to a guest
> process which is associated to a gPASID. Say the host kernel has setup
> the hPASID entry with nested translation though /dev/ioasid. For 
> shared work queue the CPU is configured to translate gPASID in MSR 
> into **hPASID** before the payload goes out to the wire. However 
> for dedicated work queue the device MMIO register is directly mapped 
> to and programmed by the guest, thus containing a **gPASID** value
> implying DMA requests through this interface will hit IOMMU faults
> due to invalid gPASID entry. Having gPASID==hPASID is a simple 
> workaround here. mdev doesn't have this problem because the
> PASID register is in emulated control-path thus can be translated
> to hPASID manually by mdev driver.

This all must be explicit too.

If a PASID is allocated and it is going to be used with ENQCMD then
everything needs to know it is actually quite different than a PASID
that was allocated to be used with a normal SRIOV device, for
instance.

The former case can accept that the guest PASID is virtualized, while
the lattter can not.

This is also why PASID per RID has to be an option. When I assign a
full SRIOV function to the guest then that entire RID space needs to
also be assigned to the guest. Upon migration I need to take all the
physical PASIDs and rebuild them in another hypervisor exactly as is.

If you force all RIDs into a global PASID pool then normal SRIOV
migration w/PASID becomes impossible. ie ENQCMD breaks everything else
that should work.

This is why you need to sort all this out and why it feels like some
of the specs here have been mis-designed.

I'm not sure carving out ranges is really workable for migration.

I think the real answer is to carve out entire RIDs as being in the
global pool or not. Then the ENQCMD HW can be bundled together and
everything else can live in the natural PASID per RID world.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-09 Thread Lu Baolu

On 5/8/21 3:31 PM, Tian, Kevin wrote:

From: Alex Williamson
Sent: Saturday, May 8, 2021 1:06 AM


Those are the main ones I can think of.  It is nice to have a simple
map/unmap interface, I'd hope that a new /dev/ioasid interface wouldn't
raise the barrier to entry too high, but the user needs to have the
ability to have more control of their mappings and locked page
accounting should probably be offloaded somewhere.  Thanks,


Based on your feedbacks I feel it's probably reasonable to start with
a type1v2 semantics for the new interface. Locked accounting could
also start with the same VFIO restriction and then improve it
incrementally, if a cleaner way is intrusive (if not affecting uAPI).
But I didn't get the suggestion on "more control of their mappings".
Can you elaborate?

Things like I note above, userspace cannot currently specify mapping
granularity nor has any visibility to the granularity they get from the
IOMMU.  What actually happens in the IOMMU is pretty opaque to the user
currently.  Thanks,


It's much clearer. Based on all the discussions so far I'm thinking about
a staging approach when building the new interface, basically following
the model that Jason pointed out - generic stuff first, then platform
specific extension:

Phase 1: /dev/ioasid with core ingredients and vfio type1v2 semantics
 - ioasid is the software handle representing an I/O page table


A trivial proposal, is it possible to use /dev/ioas? Conceptually, it's
an IO address space representation and has nothing to do with any ID.

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-08 Thread Tian, Kevin
> From: Raj, Ashok 
> Sent: Friday, May 7, 2021 12:33 AM
> 
> > Basically it means when the guest's top level IOASID is created for
> > nesting that IOASID claims all PASID's on the RID and excludes any
> > PASID IOASIDs from existing on the RID now or in future.
> 
> The way to look at it this is as follows:
> 
> For platforms that do not have a need to support shared work queue model
> support for ENQCMD or similar, PASID space is naturally per RID. There is no
> complication with this. Every RID has the full range of PASID's and no need
> for host to track which PASIDs are allocated now or in future in the guest.
> 
> For platforms that support ENQCMD, it is required to mandate PASIDs are
> global across the entire system. Maybe its better to call them gPASID for
> guest and hPASID for host. Short reason being gPASID->hPASID is a guest
> wide mapping for ENQCMD and not a per-RID based mapping. (We covered
> that
> in earlier responses)
> 
> In our current implementation we actually don't separate this space, and
> gPASID == hPASID. The iommu driver enforces that by using the custom
> allocator and the architected interface that allows all guest vIOMMU
> allocations to be proxied to host. Nothing but a glorified hypercall like
> interface. In fact some OS's do use hypercall to get a hPASID vs using
> the vCMD style interface.
> 

After more thinking about the new interface, I feel gPASID==hPASID 
actually causes some confusion in uAPI design. In concept an ioasid
is not active until it's attached to a device, because it's just an ID
if w/o a device. So supposedly an ioasid should reject all user commands
before attach. However an guest likely asks for a new gPASID before
attaching it to devices and vIOMMU. if gPASID==hPASID then Qemu 
must request /dev/ioasid to allocate a hw_id for an ioasid which hasn't 
been attached to any device, with the assumption on kernel knowledge 
that this hw_id is from an global allocator w/o dependency on any 
device. This doesn't sound a clean design, not to say it also conflicts 
with live migration.

Want to hear your and Jason's opinion about an alternative option to 
remove such restriction thus allowing gPASID!=hPASID.

gPASID!=hPASID has a problem when assigning a physical device which 
supports both shared work queue (ENQCMD with PASID in MSR) 
and dedicated work queue (PASID in device register) to a guest
process which is associated to a gPASID. Say the host kernel has setup
the hPASID entry with nested translation though /dev/ioasid. For 
shared work queue the CPU is configured to translate gPASID in MSR 
into **hPASID** before the payload goes out to the wire. However 
for dedicated work queue the device MMIO register is directly mapped 
to and programmed by the guest, thus containing a **gPASID** value
implying DMA requests through this interface will hit IOMMU faults
due to invalid gPASID entry. Having gPASID==hPASID is a simple 
workaround here. mdev doesn't have this problem because the
PASID register is in emulated control-path thus can be translated
to hPASID manually by mdev driver.

Along this story one possible option is having both gPASID and hPASID 
entries pointing to the same paging structure, sort of making gPASID
an aliasing hw_id to hPASID. Then we also need to make sure gPASID
range not colliding with hPASID range for this RID. Something like
below:

In the beginning Qemu specifies a minimal ID (say 1024) that hPASIDs 
must be allocated beyond (sort of delegating [0, 1023] of this RID to
userspace):
ioctl(ioasid_fd, SET_IOASID_MIN_HWID, 1024);

The guest still uses vIOMMU interface or hypercall to allocate gPASIDs.
Upon such request, Qemu returns a gPASID from [0, 1023] to guest 
and also allocates a new ioasid from /dev/ioasid. there is no hw_id 
allocated at this step:
ioasid = ioctl(ioasid_fd, ALLOC_IOASID);

hw_id (hPASID) is allocated when attaching ioasid to the said device:
ioctl(device_fd, VFIO_ATTACH_IOASID, ioasid);

Then gPASID is provided as an aliasing hwid to this ioasid:
ioctl(device_fd, VFIO_ALIASING_IOASID, ioasid, gPASID);

Starting from this point the kernel should make sure that any ioasid
operation should be applied to both gPASID and hPASID for this RID 
(entry setup, tear down, etc.)  and both PASID entries point to the same
paging structures. When a page fault happens, the IOMMU driver 
should also link a fault from either PASID back to the associated ioasid. 

As explained earlier this aliasing requirement only applies to physical
devices on Intel platform. We may either have mdev ignore such 
aliasing request, or have vfio device to report whether aliasing is allowed.

This is sort of a hybrid model. the gPASID range is reserved locally
in per-RID pasid space and delegated to userspace, while the hPASIDs 
are still managed globally and not exposed to userspace.

Does it sound a cleaner approach (still w/ some complexity) compared
to the restrictions of having gPASID==hPASID?

Thanks
Kevin

RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-08 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Saturday, May 8, 2021 1:06 AM
> 
> > > Those are the main ones I can think of.  It is nice to have a simple
> > > map/unmap interface, I'd hope that a new /dev/ioasid interface wouldn't
> > > raise the barrier to entry too high, but the user needs to have the
> > > ability to have more control of their mappings and locked page
> > > accounting should probably be offloaded somewhere.  Thanks,
> > >
> >
> > Based on your feedbacks I feel it's probably reasonable to start with
> > a type1v2 semantics for the new interface. Locked accounting could
> > also start with the same VFIO restriction and then improve it
> > incrementally, if a cleaner way is intrusive (if not affecting uAPI).
> > But I didn't get the suggestion on "more control of their mappings".
> > Can you elaborate?
> 
> Things like I note above, userspace cannot currently specify mapping
> granularity nor has any visibility to the granularity they get from the
> IOMMU.  What actually happens in the IOMMU is pretty opaque to the user
> currently.  Thanks,
> 

It's much clearer. Based on all the discussions so far I'm thinking about
a staging approach when building the new interface, basically following
the model that Jason pointed out - generic stuff first, then platform 
specific extension:

Phase 1: /dev/ioasid with core ingredients and vfio type1v2 semantics
- ioasid is the software handle representing an I/O page table
- uAPI accepts a type1v2 map/unmap semantics per ioasid
- helpers for VFIO/VDPA to bind ioasid_fd and attach ioasids
- multiple ioasids are allowed without nesting (vIOMMU, or devices
w/ incompatible iommu attributes)
- an ioasid disallows any operation before it's attached to a device
- an ioasid inherits iommu attributes from the 1st device attached
to it
- userspace is expected to manage hardware restrictions and the
kernel only returns error when restrictions are broken
* map/unmap on an ioasid will fail before every device in a group 
is attached to it
* ioasid attach will fail if the new device has incompatibile iommu
attribute as that of this ioasid
- thus no group semantics in uAPI
- no change to vfio container/group/type1 logic, for running existing
vfio applications
* imply some duplication between vfio type1 and ioasid for some time
- new uAPI in vfio to allow explicit opening of a device and then binding
it to the ioasid_fd
* possibly require each device exposed in /dev/vfio/
- support both pdev and mdev

Phase 2: ioasid nesting
- Allow bind/unbind_pgtable semantics per ioasid
- Allow ioasid nesting 
* HW ioasid nesting if supported by platform
* otherwise fall back to SW ioasid nesting (in-kernel shadowing)
- iotlb invalidation per ioasid
- I/O page fault handling per ioasid
- hw_id is not exposed in uAPI. Vendor IOMMU driver decides
when/how hw_id is allocated and programmed properly

Phase3: optimizations and vendor extensions (order undefined, up to
the specific feature owner):
- (Intel) ENQCMD support with hw_id exposure in uAPI
- (ARM/AMD) RID-based pasid table assignment
- (PPC) window-based iova management
- Optimizations:
* replace vfio type1 with a shim driver to use ioasid backend
* mapping granularity
* HW dirty page tracking
* ...

Does above sounds a sensible plan? If yes we'll start working on 
phase1 then...

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-08 Thread Liu Yi L
Hi Jason,

On Wed, 5 May 2021 19:21:20 -0300, Jason Gunthorpe wrote:

> On Wed, May 05, 2021 at 01:04:46PM -0700, Jacob Pan wrote:
> > Hi Jason,
> > 
> > On Wed, 5 May 2021 15:00:23 -0300, Jason Gunthorpe  wrote:
> >   
> > > On Wed, May 05, 2021 at 10:22:59AM -0700, Jacob Pan wrote:
> > >   
> > > > Global and pluggable are for slightly separate reasons.
> > > > - We need global PASID on VT-d in that we need to support shared
> > > > workqueues (SWQ). E.g. One SWQ can be wrapped into two mdevs then
> > > > assigned to two VMs. Each VM uses its private guest PASID to submit
> > > > work but each guest PASID must be translated to a global (system-wide)
> > > > host PASID to avoid conflict. Also, since PASID table storage is per
> > > > PF, if two mdevs of the same PF are assigned to different VMs, the
> > > > PASIDs must be unique.
> > > 
> > > From a protocol perspective each RID has a unique PASID table, and
> > > RIDs can have overlapping PASIDs.
> > >   
> > True, per RID or per PF as I was referring to.
> >   
> > > Since your SWQ is connected to a single RID the requirement that
> > > PASIDs are unique to the RID ensures they are sufficiently unique.
> > >   
> > True, but one process can submit work to multiple mdevs from different
> > RIDs/PFs. One process uses one PASID and PASID translation table is per VM.
> > The same PASID is used for all the PASID tables of each RID.  
> 
> If the model is "assign this PASID to this RID" then yes, there is a
> big problem keeping everything straight that can only be solved with a
> global table.
> 
> But if the model is "give me a PASID for this RID" then it isn't such
> a problem.

Let me double confirm if I'm understanding you correctly. So your suggestion
is to have a per-RID PASID namespace, which can be maintainer by IOMMU driver.
right? Take native SVM usage as an example, everytime a process is bound with
a device, a PASID within this RID will be allocated. Am I correct so far?

If yes, then there is a case in which IOTLB efficiency is really low. Let's ay
there is a process bound with multiple devices(RIDs) and has different PASIDs
allocated for each RID. In such case, the PASID values are different for each
RID. As most vendor will do, PASID will be used to tag IOTLB entries. So in such
case, here will be multiple IOTLB entries for a single VA->PA mapping. And the
number of such duplicate IOTLB entries increases linearly per the number of the
device number. Seems not good from performance perspective.

> 
> Basically trying to enforce a uniform PASID for an IOASID across all
> RIDs attached to it is not such a nice choice.
> 
> > > That is fine, but all this stuff should be inside the Intel vIOMMU
> > > driver not made into a global resource of the entire iommu subsystem.
> > >   
> > Intel vIOMMU has to use a generic uAPI to allocate PASID so the generic
> > code need to have this option. I guess you are saying we should also have a
> > per RID allocation option in addition to global?  
> 
> There always has to be a RID involvement for the PASID, for security,
> this issue really boils down to where the PASID lives.
> 
> If you need the PASID attached to the IOASID then it has to be global
> because the IOASID can be attached to any RID and must keep the same
> PASID.
> 
> If the PASID is learned when the IOASID is attached to a RID then the
> PASID is more flexible and isn't attached to the IOASID.
> 
> Honestly I'm a little leary to bake into a UAPI a specific HW choice
> that Intel made here.
> 
> I would advise making the "attach a global PASID to this IOASID"
> operation explicit and opt into for case that actually need it.
> 
> Which implies the API to the iommu driver should be more like:
> 
>   'assign an IOASID to this RID and return the PASID'
>   'reserve a PASID from every RID'
>   'assign an IOASID to this RID and use this specific PASID'
> 
> In all cases the scope of those operations are completely local to a
> certain IOMMU driver - 'reserver a PASID from every RID' is really
> every RID that driver can operate on.

Also, this reservation will be failed if the PASID happens to be occupied
by previous usage. As the PASID translation table is per-VM, ENQCMD in VM
will be a problem under such PASID management model.

> 
> So it is hard to see why the allocator should be a global resource and
> not something that is part of the iommu driver exclusively.
> 
> Jason

-- 
Regards,
Yi Liu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-08 Thread Tian, Kevin
> From: Jason Gunthorpe
> Sent: Saturday, May 8, 2021 1:11 AM
> 
> On Fri, May 07, 2021 at 11:06:14AM -0600, Alex Williamson wrote:
> 
> > We had tossed around an idea of a super-container with vfio, it's maybe
> > something we'd want to incorporate into this design.  For instance, if
> > memory could be pre-registered with a super container, which would
> > handle the locked memory accounting for that memory, then
> > sub-containers could all handle the IOMMU context of their sets of
> > devices relative to that common memory pool.
> 
> This is where I suggested to David to use nesting of IOASIDs.
> 
> Without HW support for nesting a SW nest is really just re-using the
> memory registration information stored in the parent when constructing
> the children
> 

yes, this sounds a sensible thing to do. it also unifies the user experience
regardless of whether the underlying hw supports nesting, e.g. when 
vIOMMU is present Qemu can always use IOASID nesting uAPI. In case
of SW nest then the kernel will merge the two-level translations from two
IOASIDs into one-level shadow page table (unlike today's VFIO which has 
the userspace to manage shadow-based mapping). 

but want to remark that nesting IOASIDs alone cannot solve this accounting
problem completely, as long as a process is allowed to have multiple ioasid
FDs (unless there is a mechanism to allow nesting IOASIDs cross FDs). But
this is probably not a big issue. With all the intended usages around the
new interface, I think for most applications one ioasid FD should be sufficient
to meet their requirements (multiple gpa_ioasids, ioasid nesting, etc.).

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-07 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, April 29, 2021 4:46 AM
> 
> > > I think the name IOASID is fine for the uAPI, the kernel version can
> > > be called ioasid_id or something.
> >
> > ioasid is already an id and then ioasid_id just adds confusion. Another
> > point is that ioasid is currently used to represent both PCI PASID and
> > ARM substream ID in the kernel. It implies that if we want to separate
> > ioasid and pasid in the uAPI the 'pasid' also needs to be replaced with
> > another general term usable for substream ID. Are we making the
> > terms too confusing here?
> 
> This is why I also am not so sure about exposing the PASID in the API
> because it is ultimately a HW specific item.
> 
> As I said to David, one avenue is to have some generic uAPI that is
> very general and keep all this deeply detailed stuff, that really only
> matters for qemu, as part of a more HW specific vIOMMU driver
> interface.
> 

OK, so the general uAPI will not expose hw_id and just provide everything
generic for managing I/O page table (map/unmap, nesting, etc.) through 
IOASID and then specific uAPI is provided to handle platform specific
requirements (hw_id, iova windows, etc.)

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-07 Thread Jacob Pan
Hi Jason,

On Fri, 7 May 2021 16:28:10 -0300, Jason Gunthorpe  wrote:

> > The unanswered question is how do we plumb from vIOMMU without a custom
> > allocator to get a system wide PASID?   
> 
> PASID allocation is part of the iommu driver, it really shouldn't be
> global.
> 
In the current code, the pluggable custom allocator *is* part of the iommu
vendor driver. If it decides the allocation is global then it should be
suitable for the platform since there will never be a VT-d IOMMU on another
vendor's platform.

It is true that the default allocator is global which suites the current
needs. I am just wondering if we are solving a problem does not exist yet.

> When the architecture code goes to allocate a single PASID for the
> mm_struct it should flag that allocation request with a 'must work for
> all RIDs flag' and the iommu driver should take care of it. That might
> mean the iommu driver consults a global static xarray, or maybe it
> does a hypercall, but it should be done through that API, not a side
> care global singleton.
Why do we need to flag the allocation every time if on a platform *every*
PASID can potentially be global? At the time of allocation, we don't know
if the PASID will be used for a shared (ENQCMD) or a dedicated workqueue.

Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-07 Thread Jason Gunthorpe
On Fri, May 07, 2021 at 12:23:25PM -0700, Raj, Ashok wrote:
> Hi Jason
> 
> - Removed lizefan's email due to bounces... 
> 
> On Fri, May 07, 2021 at 03:20:50PM -0300, Jason Gunthorpe wrote:
> > On Fri, May 07, 2021 at 11:14:58AM -0700, Raj, Ashok wrote:
> > > On Fri, May 07, 2021 at 02:20:51PM -0300, Jason Gunthorpe wrote:
> > > > On Thu, May 06, 2021 at 09:32:40AM -0700, Raj, Ashok wrote:
> > > > 
> > > > > For platforms that support ENQCMD, it is required to mandate PASIDs 
> > > > > are
> > > > > global across the entire system. Maybe its better to call them gPASID 
> > > > > for
> > > > > guest and hPASID for host. Short reason being gPASID->hPASID is a 
> > > > > guest
> > > > > wide mapping for ENQCMD and not a per-RID based mapping. (We covered 
> > > > > that
> > > > > in earlier responses)
> > > > 
> > > > I don't think it is actually ENQCMD that forces this, ENQCMD can use a
> > > > per-RID PASID in the translation table as well.
> > > 
> > > When using ENQCMD the PASID that needs to be sent on the wire is picked
> > > from an MSR setup by kernel. This is context switched along with the
> > > process. So each process has only 1 PASID that can go out when using
> > > ENQCMD. ENQCMD takes one mmio address specific to the acclerator and a
> > > source for the descriptor.
> > 
> > Oh. I forgot this also globally locked the PASID to a single
> > MSR. Sigh. That makes the whole mechanism useless for anything except
> > whole process SVA.
> 
> Is there another kind of SVA? Our mapping from that each process requires a
> single mm, and PASID for SVM was a direct map from that. 

There are lots of potential applications for something like ENQCMD
that are not whole process SVA. Linking it to a single PASID basically
nukes any other use of it unfortunately.

> > I would have to ask for a PASID that has the property it needs. You
> > are saying the property is even bigger than "usable on a group of
> > RIDs" but is actually "global for every RID and IOMMU in the system so
> > it can go into a MSR". Gross, but fine, ask for that explicitly when
> > allocating the PASID.
> 
> If one process has a single mm, is that also gross? :-) So a single process
> having a PASID is just an identifier for IOMMU. It just seems like what a
> mm is for a process == PASID for SVM-IOMMU support.
> 
> The unanswered question is how do we plumb from vIOMMU without a custom
> allocator to get a system wide PASID? 

PASID allocation is part of the iommu driver, it really shouldn't be
global.

When the architecture code goes to allocate a single PASID for the
mm_struct it should flag that allocation request with a 'must work for
all RIDs flag' and the iommu driver should take care of it. That might
mean the iommu driver consults a global static xarray, or maybe it
does a hypercall, but it should be done through that API, not a side
care global singleton.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-07 Thread Raj, Ashok
Hi Jason

- Removed lizefan's email due to bounces... 

On Fri, May 07, 2021 at 03:20:50PM -0300, Jason Gunthorpe wrote:
> On Fri, May 07, 2021 at 11:14:58AM -0700, Raj, Ashok wrote:
> > On Fri, May 07, 2021 at 02:20:51PM -0300, Jason Gunthorpe wrote:
> > > On Thu, May 06, 2021 at 09:32:40AM -0700, Raj, Ashok wrote:
> > > 
> > > > For platforms that support ENQCMD, it is required to mandate PASIDs are
> > > > global across the entire system. Maybe its better to call them gPASID 
> > > > for
> > > > guest and hPASID for host. Short reason being gPASID->hPASID is a guest
> > > > wide mapping for ENQCMD and not a per-RID based mapping. (We covered 
> > > > that
> > > > in earlier responses)
> > > 
> > > I don't think it is actually ENQCMD that forces this, ENQCMD can use a
> > > per-RID PASID in the translation table as well.
> > 
> > When using ENQCMD the PASID that needs to be sent on the wire is picked
> > from an MSR setup by kernel. This is context switched along with the
> > process. So each process has only 1 PASID that can go out when using
> > ENQCMD. ENQCMD takes one mmio address specific to the acclerator and a
> > source for the descriptor.
> 
> Oh. I forgot this also globally locked the PASID to a single
> MSR. Sigh. That makes the whole mechanism useless for anything except
> whole process SVA.

Is there another kind of SVA? Our mapping from that each process requires a
single mm, and PASID for SVM was a direct map from that. 

> 
> It also make it a general kernel problem and not just related to the
> vIOMMU scenario.
> 
> > > I think at the uAPI level the callpaths that require allocating a
> > > PASID from a group of RIDs should be explicit in their intention and
> > > not implicitly rely on a certain allocator behavior.
> > 
> > The difficult part I see is, when one application establishes a path
> > to one acclerator, we have no knowledge if its going to connect to a
> > second, third or such. I don't see how this can work reasonably
> > well. What if PASIDx is allocated for one, but the second RID its
> > trying to attach already has this PASID allocated?
> 
> You mean like some kind of vIOMMU hot plug?

Not vIOMMU hot plug. but an application opens accel1, does a bind to
allocate a PASID. What i meant was kernel has no information if this needs
to be a per-RID PASID, or a global PASID. Keeping this global solves the
other problems or more complex mechanisms to say "Reserve this PASID on all
accelerators" which seems pretty complicated to implement.

Now are we loosing anything by keeping the PASIDs global? 

As we discussed there is no security issue since the PASID table that hosts 
these PASIDs for SVM are still per-RID.  For e.g.

app establishes connection to accl1, allocates PASID-X
   RID for accel1 now has PASID-X and the process mm plummed 
later app also connects with accl2, now the PASID-X is plummed in for RID
of accel2.


> 
> > > If you want to get a PASID that can be used with every RID on in your
> > > /dev/ioasid then ask for that exactly.
> > 
> > Correct, but how does guest through vIOMMU driver communicate that intent 
> > so uAPI
> > plumbing can do this? I mean architecturally via IOMMU interfaces? 
> 
> I would have to ask for a PASID that has the property it needs. You
> are saying the property is even bigger than "usable on a group of
> RIDs" but is actually "global for every RID and IOMMU in the system so
> it can go into a MSR". Gross, but fine, ask for that explicitly when
> allocating the PASID.

If one process has a single mm, is that also gross? :-) So a single process
having a PASID is just an identifier for IOMMU. It just seems like what a
mm is for a process == PASID for SVM-IOMMU support.

The unanswered question is how do we plumb from vIOMMU without a custom
allocator to get a system wide PASID? 

The way it works today is if we have a custom allocator registered, that's
the mechanics to get PASIDs allocated. for Intel vIOMMU it happens to be a
global unique allocation. If a particular vIOMMU doesn't require, it does
not have vIOMMU interface, and those naturally get a guest local PASID name
space. (Im not sure if that's how the allocator works today, but I guess its
extensible to accomplish a RID local PASID if that's exactly what is
required)

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-07 Thread Jason Gunthorpe
On Fri, May 07, 2021 at 11:14:58AM -0700, Raj, Ashok wrote:
> On Fri, May 07, 2021 at 02:20:51PM -0300, Jason Gunthorpe wrote:
> > On Thu, May 06, 2021 at 09:32:40AM -0700, Raj, Ashok wrote:
> > 
> > > For platforms that support ENQCMD, it is required to mandate PASIDs are
> > > global across the entire system. Maybe its better to call them gPASID for
> > > guest and hPASID for host. Short reason being gPASID->hPASID is a guest
> > > wide mapping for ENQCMD and not a per-RID based mapping. (We covered that
> > > in earlier responses)
> > 
> > I don't think it is actually ENQCMD that forces this, ENQCMD can use a
> > per-RID PASID in the translation table as well.
> 
> When using ENQCMD the PASID that needs to be sent on the wire is picked
> from an MSR setup by kernel. This is context switched along with the
> process. So each process has only 1 PASID that can go out when using
> ENQCMD. ENQCMD takes one mmio address specific to the acclerator and a
> source for the descriptor.

Oh. I forgot this also globally locked the PASID to a single
MSR. Sigh. That makes the whole mechanism useless for anything except
whole process SVA.

It also make it a general kernel problem and not just related to the
vIOMMU scenario.

> > I think at the uAPI level the callpaths that require allocating a
> > PASID from a group of RIDs should be explicit in their intention and
> > not implicitly rely on a certain allocator behavior.
> 
> The difficult part I see is, when one application establishes a path
> to one acclerator, we have no knowledge if its going to connect to a
> second, third or such. I don't see how this can work reasonably
> well. What if PASIDx is allocated for one, but the second RID its
> trying to attach already has this PASID allocated?

You mean like some kind of vIOMMU hot plug?

> > If you want to get a PASID that can be used with every RID on in your
> > /dev/ioasid then ask for that exactly.
> 
> Correct, but how does guest through vIOMMU driver communicate that intent so 
> uAPI
> plumbing can do this? I mean architecturally via IOMMU interfaces? 

I would have to ask for a PASID that has the property it needs. You
are saying the property is even bigger than "usable on a group of
RIDs" but is actually "global for every RID and IOMMU in the system so
it can go into a MSR". Gross, but fine, ask for that explicitly when
allocating the PASID.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-07 Thread Raj, Ashok
On Fri, May 07, 2021 at 02:20:51PM -0300, Jason Gunthorpe wrote:
> On Thu, May 06, 2021 at 09:32:40AM -0700, Raj, Ashok wrote:
> 
> > For platforms that support ENQCMD, it is required to mandate PASIDs are
> > global across the entire system. Maybe its better to call them gPASID for
> > guest and hPASID for host. Short reason being gPASID->hPASID is a guest
> > wide mapping for ENQCMD and not a per-RID based mapping. (We covered that
> > in earlier responses)
> 
> I don't think it is actually ENQCMD that forces this, ENQCMD can use a
> per-RID PASID in the translation table as well.

When using ENQCMD the PASID that needs to be sent on the wire is picked
from an MSR setup by kernel. This is context switched along with the
process. So each process has only 1 PASID that can go out when using
ENQCMD. ENQCMD takes one mmio address specific to the acclerator and a
source for the descriptor. 

When one application is connecting to more than one accelerator since this
is MSR based filled in by the cpu instruction automaticaly requires both
accelerators to use the same PASID. 

Did you refer to this implementation? or something else?
> 
> You get forced here only based on the design of the vIOMMU
> communication channel. If the guest can demand any RID is attached to
> a specific guest determined PASID then the hypervisor must accommodate
> that.

True, but when we have guest using vSVM, and enabling vENQCMD the
requirement is the same inside a guest.
> 
> > > Which would be a different behavior than something like Intel's top
> > > level IOASID that doesn't claim all the PASIDs.
> > 
> > isn't this simple, if we can say ioasid allocator can provide 
> > 
> > - system wide PASID
> > - RID local PASID
> > 
> > Based on platform capabilities that require such differentiation?
> 
> I think at the uAPI level the callpaths that require allocating a
> PASID from a group of RIDs should be explicit in their intention and
> not implicitly rely on a certain allocator behavior.

The difficult part I see is, when one application establishes a path to one
acclerator, we have no knowledge if its going to connect to a second, third 
or such. I don't see how this can work reasonably well. What if PASIDx is 
allocated
for one, but the second RID its trying to attach already has this
PASID allocated? 

> 
> If you want to get a PASID that can be used with every RID on in your
> /dev/ioasid then ask for that exactly.

Correct, but how does guest through vIOMMU driver communicate that intent so 
uAPI
plumbing can do this? I mean architecturally via IOMMU interfaces? 


Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-07 Thread Jason Gunthorpe
On Thu, May 06, 2021 at 09:32:40AM -0700, Raj, Ashok wrote:

> For platforms that support ENQCMD, it is required to mandate PASIDs are
> global across the entire system. Maybe its better to call them gPASID for
> guest and hPASID for host. Short reason being gPASID->hPASID is a guest
> wide mapping for ENQCMD and not a per-RID based mapping. (We covered that
> in earlier responses)

I don't think it is actually ENQCMD that forces this, ENQCMD can use a
per-RID PASID in the translation table as well.

You get forced here only based on the design of the vIOMMU
communication channel. If the guest can demand any RID is attached to
a specific guest determined PASID then the hypervisor must accommodate
that.

> > Which would be a different behavior than something like Intel's top
> > level IOASID that doesn't claim all the PASIDs.
> 
> isn't this simple, if we can say ioasid allocator can provide 
> 
> - system wide PASID
> - RID local PASID
> 
> Based on platform capabilities that require such differentiation?

I think at the uAPI level the callpaths that require allocating a
PASID from a group of RIDs should be explicit in their intention and
not implicitly rely on a certain allocator behavior.

If you want to get a PASID that can be used with every RID on in your
/dev/ioasid then ask for that exactly.

It makes the uAPI much more understandable to be explicit.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-07 Thread Jason Gunthorpe
On Fri, May 07, 2021 at 11:06:14AM -0600, Alex Williamson wrote:

> We had tossed around an idea of a super-container with vfio, it's maybe
> something we'd want to incorporate into this design.  For instance, if
> memory could be pre-registered with a super container, which would
> handle the locked memory accounting for that memory, then
> sub-containers could all handle the IOMMU context of their sets of
> devices relative to that common memory pool.

This is where I suggested to David to use nesting of IOASIDs.

Without HW support for nesting a SW nest is really just re-using the
memory registration information stored in the parent when constructing
the children

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-07 Thread Alex Williamson
On Fri, 7 May 2021 07:36:49 +
"Tian, Kevin"  wrote:

> > From: Alex Williamson 
> > Sent: Wednesday, April 28, 2021 11:06 PM
> > 
> > On Wed, 28 Apr 2021 06:34:11 +
> > "Tian, Kevin"  wrote:
> > >
> > > Can you or Alex elaborate where the complexity and performance problem
> > > locate in VFIO map/umap? We'd like to understand more detail and see  
> > how  
> > > to avoid it in the new interface.  
> > 
> > 
> > The map/unmap interface is really only good for long lived mappings,
> > the overhead is too high for things like vIOMMU use cases or any case
> > where the mapping is intended to be dynamic.  Userspace drivers must
> > make use of a long lived buffer mapping in order to achieve performance.  
> 
> This is not a limitation of VFIO map/unmap. It's the limitation of any
> map/unmap semantics since the fact of long-lived vs. short-lived is 
> imposed by userspace. Nested translation is the only viable optimization
> allowing 2nd-level to be a long-lived mapping even w/ vIOMMU. From 
> this angle I'm not sure how a new map/unmap implementation could 
> address this perf limitation alone.

Sure, we don't need to try to tackle every problem at once, a map/unmap
interface compatible with what we have is a good place to start and
nested translation may provide the high performance option.  That's not
to say that we couldn't, in the future, extend the map/unmap with memory
pre-registration like done in the spapr IOMMU to see how that could
reduce latency.

> > The mapping and unmapping granularity has been a problem as well,
> > type1v1 allowed arbitrary unmaps to bisect the original mapping, with
> > the massive caveat that the caller relies on the return value of the
> > unmap to determine what was actually unmapped because the IOMMU use
> > of
> > superpages is transparent to the caller.  This led to type1v2 that
> > simply restricts the user to avoid ever bisecting mappings.  That still
> > leaves us with problems for things like virtio-mem support where we
> > need to create initial mappings with a granularity that allows us to
> > later remove entries, which can prevent effective use of IOMMU
> > superpages.  
> 
> We could start with a semantics similar to type1v2. 
> 
> btw why does virtio-mem require a smaller granularity? Can we split
> superpages in-the-fly when removal actually happens (just similar
> to page split in VM live migration for efficient dirty page tracking)?

The IOMMU API doesn't currently support those semantics.  If the IOMMU
used a superpage, then a superpage gets unmapped, it doesn't get
atomically broken down into smaller pages.  Therefore virtio-mem
proposes a fixed mapping granularity to allow for that same unmapping
granularity.

> and isn't it another problem imposed by userspace? How could a new
> map/unmap implementation mitigate this problem if the userspace 
> insists on a smaller granularity for initial mappings?

Currently if userspace wants to guarantee unmap granularity, they need
to make the same restriction themselves on the mapping granularity.
For instance, userspace cannot currently map a 1GB IOVA range while
guaranteeing 2MB unmap granularity of that range with a single ioctl.
Instead userspace would need to make 512, 2MB mappings calls.

> > Locked page accounting has been another constant issue.  We perform
> > locked page accounting at the container level, where each container
> > accounts independently.  A user may require multiple containers, the
> > containers may pin the same physical memory, but be accounted against
> > the user once per container.  
> 
> for /dev/ioasid there is still an open whether an process is allowed to
> open /dev/ioasid once or multiple times. If there is only one ioasid_fd
> per process, the accounting can be made accurately. otherwise the
> same problem still exists as each ioasid_fd is akin to the container, then
> we need find a better solution.

We had tossed around an idea of a super-container with vfio, it's maybe
something we'd want to incorporate into this design.  For instance, if
memory could be pre-registered with a super container, which would
handle the locked memory accounting for that memory, then
sub-containers could all handle the IOMMU context of their sets of
devices relative to that common memory pool.
 
> > Those are the main ones I can think of.  It is nice to have a simple
> > map/unmap interface, I'd hope that a new /dev/ioasid interface wouldn't
> > raise the barrier to entry too high, but the user needs to have the
> > ability to have more control of their mappings and locked page
> > accounting should probably be offloaded somewhere.  Thanks,
> >   
> 
> Based on your feedbacks I feel it's probably reasonable to start with
> a type1v2 semantics for the new interface. Locked accounting could
> also start with the same VFIO restriction and then improve it
> incrementally, if a cleaner way is intrusive (if not affecting uAPI).
> But I didn't get the suggestion on "more control of their 

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-07 Thread Jason Gunthorpe
On Fri, May 07, 2021 at 07:36:49AM +, Tian, Kevin wrote:

> for /dev/ioasid there is still an open whether an process is allowed to
> open /dev/ioasid once or multiple times. If there is only one ioasid_fd
> per process, the accounting can be made accurately. otherwise the
> same problem still exists as each ioasid_fd is akin to the container, then
> we need find a better solution.

You can't really do tricks like 'FD once per process' in linux.

The locked page accounting problem is much bigger than vfio and I
don't really know of any solution..

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-07 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, May 5, 2021 1:13 AM
> 
> On Wed, Apr 28, 2021 at 06:58:19AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Wednesday, April 28, 2021 1:12 AM
> > >
> > [...]
> > > > As Alex says, if this line fails because of the group restrictions,
> > > > that's not great because it's not very obvious what's gone wrong.
> > >
> > > Okay, that is fair, but let's solve that problem directly. For
> > > instance netlink has been going in the direction of adding a "extack"
> > > from the kernel which is a descriptive error string. If the failing
> > > ioctl returned the string:
> > >
> > >   "cannot join this device to the IOASID because device XXX in the
> > >same group #10 is in use"
> > >
> > > Would you agree it is now obvious what has gone wrong? In fact would
> > > you agree this is a lot better user experience than what applications
> > > do today even though they have the group FD?
> > >
> >
> > Currently all the discussions are around implicit vs. explicit uAPI 
> > semantics
> > on the group restriction. However if we look beyond group the implicit
> > semantics might be inevitable when dealing with incompatible iommu
> > domains. An existing example of iommu incompatibility is IOMMU_
> > CACHE.
> 
> I still think we need to get rid of these incompatibilities
> somehow. Having multiple HW incompatible IOASID in the same platform
> is just bad all around.
> 
> When modeling in userspace IOMMU_CACHE sounds like it is a property of
> each individual IOASID, not an attribute that requires a new domain.

sure. the iommu domain is an kernel-internal concept. The userspace 
should focus everything on IOASID.

> 
> People that want to create cache bypass IOASID's should just ask for
> that that directly.
> 

Yes, in earlier discussion we agreed on a scheme that ioasid module
will return an error to userspace indicating incompatibility detected
when binding a device to ioasid then the userspace should create 
a new IOASID for this device. This has to be done 'explicitly'. 

When I used it as the example for 'implicit semantics" is that the kernel 
won't create another group-like object to contain devices with compatible 
attributes and 'explicitly' manage it in uAPI like group_fd. If we anyway
rely on the userspace to have more intelligence on those hardware
restrictions, it's little sense to only explicitly handle group_fd in uAPI.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-07 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Wednesday, April 28, 2021 11:06 PM
> 
> On Wed, 28 Apr 2021 06:34:11 +
> "Tian, Kevin"  wrote:
> 
> > > From: Jason Gunthorpe 
> > > Sent: Monday, April 26, 2021 8:38 PM
> > >
> > [...]
> > > > Want to hear your opinion for one open here. There is no doubt that
> > > > an ioasid represents a HW page table when the table is constructed by
> > > > userspace and then linked to the IOMMU through the bind/unbind
> > > > API. But I'm not very sure about whether an ioasid should represent
> > > > the exact pgtable or the mapping metadata when the underlying
> > > > pgtable is indirectly constructed through map/unmap API. VFIO does
> > > > the latter way, which is why it allows multiple incompatible domains
> > > > in a single container which all share the same mapping metadata.
> > >
> > > I think VFIO's map/unmap is way too complex and we know it has bad
> > > performance problems.
> >
> > Can you or Alex elaborate where the complexity and performance problem
> > locate in VFIO map/umap? We'd like to understand more detail and see
> how
> > to avoid it in the new interface.
> 
> 
> The map/unmap interface is really only good for long lived mappings,
> the overhead is too high for things like vIOMMU use cases or any case
> where the mapping is intended to be dynamic.  Userspace drivers must
> make use of a long lived buffer mapping in order to achieve performance.

This is not a limitation of VFIO map/unmap. It's the limitation of any
map/unmap semantics since the fact of long-lived vs. short-lived is 
imposed by userspace. Nested translation is the only viable optimization
allowing 2nd-level to be a long-lived mapping even w/ vIOMMU. From 
this angle I'm not sure how a new map/unmap implementation could 
address this perf limitation alone.

> 
> The mapping and unmapping granularity has been a problem as well,
> type1v1 allowed arbitrary unmaps to bisect the original mapping, with
> the massive caveat that the caller relies on the return value of the
> unmap to determine what was actually unmapped because the IOMMU use
> of
> superpages is transparent to the caller.  This led to type1v2 that
> simply restricts the user to avoid ever bisecting mappings.  That still
> leaves us with problems for things like virtio-mem support where we
> need to create initial mappings with a granularity that allows us to
> later remove entries, which can prevent effective use of IOMMU
> superpages.

We could start with a semantics similar to type1v2. 

btw why does virtio-mem require a smaller granularity? Can we split
superpages in-the-fly when removal actually happens (just similar
to page split in VM live migration for efficient dirty page tracking)?

and isn't it another problem imposed by userspace? How could a new
map/unmap implementation mitigate this problem if the userspace 
insists on a smaller granularity for initial mappings?

> 
> Locked page accounting has been another constant issue.  We perform
> locked page accounting at the container level, where each container
> accounts independently.  A user may require multiple containers, the
> containers may pin the same physical memory, but be accounted against
> the user once per container.

for /dev/ioasid there is still an open whether an process is allowed to
open /dev/ioasid once or multiple times. If there is only one ioasid_fd
per process, the accounting can be made accurately. otherwise the
same problem still exists as each ioasid_fd is akin to the container, then
we need find a better solution.

> 
> Those are the main ones I can think of.  It is nice to have a simple
> map/unmap interface, I'd hope that a new /dev/ioasid interface wouldn't
> raise the barrier to entry too high, but the user needs to have the
> ability to have more control of their mappings and locked page
> accounting should probably be offloaded somewhere.  Thanks,
> 

Based on your feedbacks I feel it's probably reasonable to start with
a type1v2 semantics for the new interface. Locked accounting could
also start with the same VFIO restriction and then improve it
incrementally, if a cleaner way is intrusive (if not affecting uAPI).
But I didn't get the suggestion on "more control of their mappings".
Can you elaborate?

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-06 Thread Raj, Ashok
Hi Jason

On Thu, May 06, 2021 at 09:27:30AM -0300, Jason Gunthorpe wrote:
> On Thu, May 06, 2021 at 09:23:48AM +0200, Jean-Philippe Brucker wrote:
> > On Wed, May 05, 2021 at 01:04:46PM -0700, Jacob Pan wrote:
> > > > > For ARM, since the guest owns the per device PASID table. There is no
> > > > > need to allocate PASIDs from the host nor the hypervisor. Without SWQ,
> > > > > there is no need for global PASID/SSID either. So PASID being global
> > > > > for ARM is for simplicity in case of host PASID/SSID.  
> > > > 
> > > > It isn't clear how ARM can support PASID and mdev but that is an
> > > > unrelated issue..
> > > > 
> > > AFAIK, the current SMMU device assignment is per RID, since only one 
> > > stage2
> > > page tables per RID, not per PASID. This is equivalent to the older VT-d
> > > spec. prior to scalable mode.
> > 
> > Yes that's right. Since SMMUv3 has a single level-2 page table per RID, it
> > doesn't support assigning level-1 page tables to guests for mdevs (sub-VF
> > devices). So no PASIDs for mdevs, which also means each guest has its own
> > PASID space and the host doesn't track guest PASIDs.
> 
> Basically it means when the guest's top level IOASID is created for
> nesting that IOASID claims all PASID's on the RID and excludes any
> PASID IOASIDs from existing on the RID now or in future.

The way to look at it this is as follows:

For platforms that do not have a need to support shared work queue model
support for ENQCMD or similar, PASID space is naturally per RID. There is no
complication with this. Every RID has the full range of PASID's and no need
for host to track which PASIDs are allocated now or in future in the guest.

For platforms that support ENQCMD, it is required to mandate PASIDs are
global across the entire system. Maybe its better to call them gPASID for
guest and hPASID for host. Short reason being gPASID->hPASID is a guest
wide mapping for ENQCMD and not a per-RID based mapping. (We covered that
in earlier responses)

In our current implementation we actually don't separate this space, and
gPASID == hPASID. The iommu driver enforces that by using the custom
allocator and the architected interface that allows all guest vIOMMU
allocations to be proxied to host. Nothing but a glorified hypercall like
interface. In fact some OS's do use hypercall to get a hPASID vs using
the vCMD style interface.

For cases where there is full PASID range for every RID and completely
managed by the guest that requires no assist from host to ensure
uniqueness, they don't need to have a custom allocator. Maybe the general
allocator can have ways to ensure global uniqueness vs. RID wide
uniqueness. This is still managed by the iommu driver (vIOMMU) + the
backend for vCMD in the host IOMMU driver.

> 
> Which would be a different behavior than something like Intel's top
> level IOASID that doesn't claim all the PASIDs.

isn't this simple, if we can say ioasid allocator can provide 

- system wide PASID
- RID local PASID

Based on platform capabilities that require such differentiation?

And based on the other threads, if ioasid is just a pgtable representation,
it doesn't need a PASID per-se. But when you want to use SVM or such, you
can associate a PASID with it for the IOMMU to plumb things with hardware.

Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-06 Thread Jason Gunthorpe
On Thu, May 06, 2021 at 09:23:48AM +0200, Jean-Philippe Brucker wrote:
> On Wed, May 05, 2021 at 01:04:46PM -0700, Jacob Pan wrote:
> > > > For ARM, since the guest owns the per device PASID table. There is no
> > > > need to allocate PASIDs from the host nor the hypervisor. Without SWQ,
> > > > there is no need for global PASID/SSID either. So PASID being global
> > > > for ARM is for simplicity in case of host PASID/SSID.  
> > > 
> > > It isn't clear how ARM can support PASID and mdev but that is an
> > > unrelated issue..
> > > 
> > AFAIK, the current SMMU device assignment is per RID, since only one stage2
> > page tables per RID, not per PASID. This is equivalent to the older VT-d
> > spec. prior to scalable mode.
> 
> Yes that's right. Since SMMUv3 has a single level-2 page table per RID, it
> doesn't support assigning level-1 page tables to guests for mdevs (sub-VF
> devices). So no PASIDs for mdevs, which also means each guest has its own
> PASID space and the host doesn't track guest PASIDs.

Basically it means when the guest's top level IOASID is created for
nesting that IOASID claims all PASID's on the RID and excludes any
PASID IOASIDs from existing on the RID now or in future.

Which would be a different behavior than something like Intel's top
level IOASID that doesn't claim all the PASIDs.

Lots of little special flags in here :|

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-06 Thread Jason Gunthorpe
On Wed, May 05, 2021 at 04:23:19PM -0700, Raj, Ashok wrote:
> > Which implies the API to the iommu driver should be more like:
> > 
> >   'assign an IOASID to this RID and return the PASID'
> >   'reserve a PASID from every RID'
> 
> I don't think this has any decent change of success. Its rather round about
> way to get a global PASID namespace.
> 
> >   'assign an IOASID to this RID and use this specific PASID'
> 
> This seems a bit complicated. Another way to specify this.

Maybe, but I don't like that the driver-based iommu API has been
corrupted by injecting a global 'first driver to claim it'
resource. It is not properly layered anymore.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-06 Thread Jean-Philippe Brucker
On Wed, May 05, 2021 at 01:04:46PM -0700, Jacob Pan wrote:
> > > For ARM, since the guest owns the per device PASID table. There is no
> > > need to allocate PASIDs from the host nor the hypervisor. Without SWQ,
> > > there is no need for global PASID/SSID either. So PASID being global
> > > for ARM is for simplicity in case of host PASID/SSID.  
> > 
> > It isn't clear how ARM can support PASID and mdev but that is an
> > unrelated issue..
> > 
> AFAIK, the current SMMU device assignment is per RID, since only one stage2
> page tables per RID, not per PASID. This is equivalent to the older VT-d
> spec. prior to scalable mode.

Yes that's right. Since SMMUv3 has a single level-2 page table per RID, it
doesn't support assigning level-1 page tables to guests for mdevs (sub-VF
devices). So no PASIDs for mdevs, which also means each guest has its own
PASID space and the host doesn't track guest PASIDs.

Thanks,
Jean
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-05 Thread Raj, Ashok
On Wed, May 05, 2021 at 07:21:20PM -0300, Jason Gunthorpe wrote:
> On Wed, May 05, 2021 at 01:04:46PM -0700, Jacob Pan wrote:
> > Hi Jason,
> > 
> > On Wed, 5 May 2021 15:00:23 -0300, Jason Gunthorpe  wrote:
> > 
> > > On Wed, May 05, 2021 at 10:22:59AM -0700, Jacob Pan wrote:
> > > 
> > > > Global and pluggable are for slightly separate reasons.
> > > > - We need global PASID on VT-d in that we need to support shared
> > > > workqueues (SWQ). E.g. One SWQ can be wrapped into two mdevs then
> > > > assigned to two VMs. Each VM uses its private guest PASID to submit
> > > > work but each guest PASID must be translated to a global (system-wide)
> > > > host PASID to avoid conflict. Also, since PASID table storage is per
> > > > PF, if two mdevs of the same PF are assigned to different VMs, the
> > > > PASIDs must be unique.  
> > > 
> > > From a protocol perspective each RID has a unique PASID table, and
> > > RIDs can have overlapping PASIDs.
> > > 
> > True, per RID or per PF as I was referring to.
> > 
> > > Since your SWQ is connected to a single RID the requirement that
> > > PASIDs are unique to the RID ensures they are sufficiently unique.
> > > 
> > True, but one process can submit work to multiple mdevs from different
> > RIDs/PFs. One process uses one PASID and PASID translation table is per VM.
> > The same PASID is used for all the PASID tables of each RID.
> 
> If the model is "assign this PASID to this RID" then yes, there is a
> big problem keeping everything straight that can only be solved with a
> global table.
> 
> But if the model is "give me a PASID for this RID" then it isn't such
> a problem.

Correct, since we have usage with ENQCMD, its more like

- Give me a PASID1 (not attached to any RID)
- Bind/attach PASID1 with RID1
- Bind/attach PASID1 with RID2

and ENQCMD isn't just for Intel, with the DMWr spec in PCI, it brings it to
all devices as long as routing is supported by interim switches and such.

> 
> Basically trying to enforce a uniform PASID for an IOASID across all
> RIDs attached to it is not such a nice choice.
> 
> > > That is fine, but all this stuff should be inside the Intel vIOMMU
> > > driver not made into a global resource of the entire iommu subsystem.
> > > 
> > Intel vIOMMU has to use a generic uAPI to allocate PASID so the generic
> > code need to have this option. I guess you are saying we should also have a
> > per RID allocation option in addition to global?
> 
> There always has to be a RID involvement for the PASID, for security,
> this issue really boils down to where the PASID lives.

We do have a RID involvement with PASID always for security. Every RID has
its own PASID table, but the PASID name space is global. 

So if you have RID1 associated with PASID1, another RID2 doesn't have the
PASID1 in its PASID table. Until when the app binds PASID1 with RID2 as
well. Then you have PASID1 plumbed in the PASID table for RID2.

Is this what you refer to for security? 


> 
> If you need the PASID attached to the IOASID then it has to be global
> because the IOASID can be attached to any RID and must keep the same
> PASID.
> 
> If the PASID is learned when the IOASID is attached to a RID then the
> PASID is more flexible and isn't attached to the IOASID.
> 
> Honestly I'm a little leary to bake into a UAPI a specific HW choice
> that Intel made here.

Like I mentioned, this isn't just Intel going forward. The specs are public
in PCIe. I just can't comment which other vendors are adopting it.

> 
> I would advise making the "attach a global PASID to this IOASID"
> operation explicit and opt into for case that actually need it.
> 
> Which implies the API to the iommu driver should be more like:
> 
>   'assign an IOASID to this RID and return the PASID'
>   'reserve a PASID from every RID'

I don't think this has any decent change of success. Its rather round about
way to get a global PASID namespace.

>   'assign an IOASID to this RID and use this specific PASID'

This seems a bit complicated. Another way to specify this.

- IOASID is a logical construct to specify a page table.
- You can bind a global PASID to an IOASID

We aren't loosing any security by using a global PASID name space. 
Until the application asks for it, that is not bound to any other RID without 
an explicit
request.


-- 
Cheers,
Ashok
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-05 Thread Jason Gunthorpe
On Wed, May 05, 2021 at 01:04:46PM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Wed, 5 May 2021 15:00:23 -0300, Jason Gunthorpe  wrote:
> 
> > On Wed, May 05, 2021 at 10:22:59AM -0700, Jacob Pan wrote:
> > 
> > > Global and pluggable are for slightly separate reasons.
> > > - We need global PASID on VT-d in that we need to support shared
> > > workqueues (SWQ). E.g. One SWQ can be wrapped into two mdevs then
> > > assigned to two VMs. Each VM uses its private guest PASID to submit
> > > work but each guest PASID must be translated to a global (system-wide)
> > > host PASID to avoid conflict. Also, since PASID table storage is per
> > > PF, if two mdevs of the same PF are assigned to different VMs, the
> > > PASIDs must be unique.  
> > 
> > From a protocol perspective each RID has a unique PASID table, and
> > RIDs can have overlapping PASIDs.
> > 
> True, per RID or per PF as I was referring to.
> 
> > Since your SWQ is connected to a single RID the requirement that
> > PASIDs are unique to the RID ensures they are sufficiently unique.
> > 
> True, but one process can submit work to multiple mdevs from different
> RIDs/PFs. One process uses one PASID and PASID translation table is per VM.
> The same PASID is used for all the PASID tables of each RID.

If the model is "assign this PASID to this RID" then yes, there is a
big problem keeping everything straight that can only be solved with a
global table.

But if the model is "give me a PASID for this RID" then it isn't such
a problem.

Basically trying to enforce a uniform PASID for an IOASID across all
RIDs attached to it is not such a nice choice.

> > That is fine, but all this stuff should be inside the Intel vIOMMU
> > driver not made into a global resource of the entire iommu subsystem.
> > 
> Intel vIOMMU has to use a generic uAPI to allocate PASID so the generic
> code need to have this option. I guess you are saying we should also have a
> per RID allocation option in addition to global?

There always has to be a RID involvement for the PASID, for security,
this issue really boils down to where the PASID lives.

If you need the PASID attached to the IOASID then it has to be global
because the IOASID can be attached to any RID and must keep the same
PASID.

If the PASID is learned when the IOASID is attached to a RID then the
PASID is more flexible and isn't attached to the IOASID.

Honestly I'm a little leary to bake into a UAPI a specific HW choice
that Intel made here.

I would advise making the "attach a global PASID to this IOASID"
operation explicit and opt into for case that actually need it.

Which implies the API to the iommu driver should be more like:

  'assign an IOASID to this RID and return the PASID'
  'reserve a PASID from every RID'
  'assign an IOASID to this RID and use this specific PASID'

In all cases the scope of those operations are completely local to a
certain IOMMU driver - 'reserver a PASID from every RID' is really
every RID that driver can operate on.

So it is hard to see why the allocator should be a global resource and
not something that is part of the iommu driver exclusively.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-05 Thread Jacob Pan
Hi Jason,

On Wed, 5 May 2021 15:00:23 -0300, Jason Gunthorpe  wrote:

> On Wed, May 05, 2021 at 10:22:59AM -0700, Jacob Pan wrote:
> 
> > Global and pluggable are for slightly separate reasons.
> > - We need global PASID on VT-d in that we need to support shared
> > workqueues (SWQ). E.g. One SWQ can be wrapped into two mdevs then
> > assigned to two VMs. Each VM uses its private guest PASID to submit
> > work but each guest PASID must be translated to a global (system-wide)
> > host PASID to avoid conflict. Also, since PASID table storage is per
> > PF, if two mdevs of the same PF are assigned to different VMs, the
> > PASIDs must be unique.  
> 
> From a protocol perspective each RID has a unique PASID table, and
> RIDs can have overlapping PASIDs.
> 
True, per RID or per PF as I was referring to.

> Since your SWQ is connected to a single RID the requirement that
> PASIDs are unique to the RID ensures they are sufficiently unique.
> 
True, but one process can submit work to multiple mdevs from different
RIDs/PFs. One process uses one PASID and PASID translation table is per VM.
The same PASID is used for all the PASID tables of each RID.

For example:
VM1 has two mdevs: mdev1 and mdev2. mdev1's parent is RID1, mdev2's parent
is RID2. The guest process A allocates PASID_A and bind to both mdev1 and
mdev2. PASID_A must be present in the PASID tables for both RID1 and RID2.

If the allocator is per RID, it is not possible to ensure PASID_A is
available for both RIDs. Right?

Sorry I missed this point in my earlier explanation.

> If the IOMMU driver has additional restrictions then it should raise
> the PASID table up higher in the hierarchy than at the RID.
> 
That higher level in the hierarchy is global, right? I am a little
concerned about expanding PASID table sharing from security perspective.
Even though, VMs already share PASID table for mdevs.

> I think what you are trying to explain is that the Intel vIOMMU has a
> single PASID address space shared globally by the vCPU because ENQCMD
> uses the global vGPU translation table.
> 
Yes, PASID translation table is per VM, global in terms of the guest.
That combined with the case of two mdevs from different RIDs can be used by
the same guest process/PASID requires global PASID.

> That is fine, but all this stuff should be inside the Intel vIOMMU
> driver not made into a global resource of the entire iommu subsystem.
> 
Intel vIOMMU has to use a generic uAPI to allocate PASID so the generic
code need to have this option. I guess you are saying we should also have a
per RID allocation option in addition to global?

> Systems that work this way just cannot have multiple iommu drivers
> competing for PASID.
> 
Sorry, I am not following. There would not be mixed iommu drivers on one
platform, I must have missed your point. Could you explain a little?

> > - The pluggable allocator is to support the option where the guest
> > PASIDs are allocated by the hypervisor.   
> 
> And if the hypervisor allocates the PASID then again the specific
> vIOMMU itself is concerned with this and it has nothing to do with
> global behavior of the iommu subsystem.
> 
> > For ARM, since the guest owns the per device PASID table. There is no
> > need to allocate PASIDs from the host nor the hypervisor. Without SWQ,
> > there is no need for global PASID/SSID either. So PASID being global
> > for ARM is for simplicity in case of host PASID/SSID.  
> 
> It isn't clear how ARM can support PASID and mdev but that is an
> unrelated issue..
> 
AFAIK, the current SMMU device assignment is per RID, since only one stage2
page tables per RID, not per PASID. This is equivalent to the older VT-d
spec. prior to scalable mode.

Eric/Jean, can you help?

> Jason


Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-05 Thread Jason Gunthorpe
On Wed, May 05, 2021 at 10:22:59AM -0700, Jacob Pan wrote:

> Global and pluggable are for slightly separate reasons.
> - We need global PASID on VT-d in that we need to support shared
> workqueues (SWQ). E.g. One SWQ can be wrapped into two mdevs then assigned
> to two VMs. Each VM uses its private guest PASID to submit work but
> each guest PASID must be translated to a global (system-wide) host PASID to
> avoid conflict. Also, since PASID table storage is per PF, if two mdevs of
> the same PF are assigned to different VMs, the PASIDs must be unique.

>From a protocol perspective each RID has a unique PASID table, and
RIDs can have overlapping PASIDs.

Since your SWQ is connected to a single RID the requirement that
PASIDs are unique to the RID ensures they are sufficiently unique.

If the IOMMU driver has additional restrictions then it should raise
the PASID table up higher in the hierarchy than at the RID.

I think what you are trying to explain is that the Intel vIOMMU has a
single PASID address space shared globally by the vCPU because ENQCMD
uses the global vGPU translation table.

That is fine, but all this stuff should be inside the Intel vIOMMU
driver not made into a global resource of the entire iommu subsystem.

Systems that work this way just cannot have multiple iommu drivers
competing for PASID.

> - The pluggable allocator is to support the option where the guest PASIDs
> are allocated by the hypervisor. 

And if the hypervisor allocates the PASID then again the specific
vIOMMU itself is concerned with this and it has nothing to do with
global behavior of the iommu subsystem.

> For ARM, since the guest owns the per device PASID table. There is no need
> to allocate PASIDs from the host nor the hypervisor. Without SWQ, there is
> no need for global PASID/SSID either. So PASID being global for ARM is for
> simplicity in case of host PASID/SSID.

It isn't clear how ARM can support PASID and mdev but that is an
unrelated issue..

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-05 Thread Jacob Pan
Hi Jason,

On Tue, 4 May 2021 20:15:30 -0300, Jason Gunthorpe  wrote:

> On Tue, May 04, 2021 at 03:11:54PM -0700, Jacob Pan wrote:
> 
> > > It is a weird way to use xarray to have a structure which
> > > itself is just a wrapper around another RCU protected structure.
> > > 
> > > Make the caller supply the ioasid_data memory, embedded in its own
> > > element, get rid of the void * and rely on XA_ZERO_ENTRY to hold
> > > allocated but not active entries.
> > >   
> > Let me try to paraphrase to make sure I understand. Currently
> > struct ioasid_data is private to the iasid core, its memory is
> > allocated by the ioasid core.
> > 
> > You are suggesting the following:
> > 1. make struct ioasid_data public
> > 2. caller allocates memory for ioasid_data, initialize it then pass it
> > to ioasid_alloc to store in the xarray
> > 3. caller will be responsible for setting private data inside
> > ioasid_data and do call_rcu after update if needed.  
> 
> Basically, but you probably won't need a "private data" once the
> caller has this struct as it can just embed it in whatever larger
> struct makes sense for it and use container_of/etc
> 
that makes sense. thanks!

> I didn't look too closely at the whole thing though. Honestly I'm a
> bit puzzled why we need a pluggable global allocator framework.. The
> whole framework went to some trouble to isolate everything into iommu
> drivers then that whole design is disturbed by this global thing.
> 
Global and pluggable are for slightly separate reasons.
- We need global PASID on VT-d in that we need to support shared
workqueues (SWQ). E.g. One SWQ can be wrapped into two mdevs then assigned
to two VMs. Each VM uses its private guest PASID to submit work but
each guest PASID must be translated to a global (system-wide) host PASID to
avoid conflict. Also, since PASID table storage is per PF, if two mdevs of
the same PF are assigned to different VMs, the PASIDs must be unique.

- The pluggable allocator is to support the option where the guest PASIDs
are allocated by the hypervisor. Let it be the same as the host PASID or
some arbitrary number cooked up by the hypervisor but backed by a host HW
PASID. VT-d spec has this virtual command interface that requires the guest
to use it instead of allocating from the guest ioasid xarray. This is the
reason why it has to go down to iommu vendor driver. I guess that is what
you meant by "went to some trouble to isolate everything into iommu"?

For ARM, since the guest owns the per device PASID table. There is no need
to allocate PASIDs from the host nor the hypervisor. Without SWQ, there is
no need for global PASID/SSID either. So PASID being global for ARM is for
simplicity in case of host PASID/SSID.



> Jason


Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-05 Thread Jason Gunthorpe
On Wed, May 05, 2021 at 02:28:53PM +1000, Alexey Kardashevskiy wrote:

> This is a good feature in general when let's say there is a linux supported
> device which has a proprietary device firmware update tool which only exists
> as an x86 binary and your hardware is not x86 - running qemu + vfio in full
> emulation would provide a way to run the tool to update a physical device.

That specific use case doesn't really need a vIOMMU though, does it?

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-05 Thread Auger Eric
Hi Jason,

On 4/29/21 10:04 PM, Jason Gunthorpe wrote:
> On Thu, Apr 29, 2021 at 03:26:55PM +0200, Auger Eric wrote:
>> From the pseudo code,
>>
>>   gpa_ioasid_id = ioctl(ioasid_fd, CREATE_IOASID, ..)
>>   ioctl(ioasid_fd, SET_IOASID_PAGE_TABLES, ..)
>>
>> I fail to understand whether the SET_IOASID_PAGE_TABLES would apply to
>> the whole IOASIDs within /dev/ioasid or to a specific one.
> 
> Sorry, nearly every IOCTL would be scoped to a specific IOASID as one
> of the arguments.

OK thank you for the clarification.
> 
>> Also in subsequent emails when you talk about IOASID, is it the
>> ioasid_id, just to double check the terminology.
> 
> I am refering to IOASID as 'handle of the page table object inside the
> /dev/ioasid fd'. If that is equal to some HW value or not I think
> remains as decision point.
OK
> 
> Basically the fd has an xarray of 'struct [something] *' and the
> IOASID is index to that FD's private xarray. This is necessary to
> create proper security as even if we have global PASID numbers or
> something they still need to be isolated to only the FD that has
> been authorized access.
> 
>>>   nested_ioasid = ioctl(ioasid_fd, CREATE_NESTED_IOASID,  gpa_ioasid_id);
>>>   ioctl(ioasid_fd, SET_NESTED_IOASID_PAGE_TABLES, nested_ioasid, ..)
>> is the nested_ioasid the allocated PASID id or is it a complete
>> different object id.
> 
> It is the IOASID handle above.
ok as per the following emails and below comment IOASID and PASID are
different.The first would be a logic ID wgile the second the HW ID.

Thanks

Eric
> 
>>>
>>>// IOMMU will match on the device RID, no PASID:
>>>   ioctl(vfio_device, ATTACH_IOASID, nested_ioasid);
>>>
>>>// IOMMU will match on the device RID and PASID:
>>>   ioctl(vfio_device, ATTACH_IOASID_PASID, pasid, nested_ioasid);
>> here I see you pass a different pasid, so I guess they are different, in
>> which case you would need to have an allocator function for this pasid,
>> right?
> 
> Yes, the underlying HW ID (PASID or substream id or whatver) is
> something slightly different
> 
> Jason
> 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-04 Thread Alexey Kardashevskiy




On 05/05/2021 04:15, Jason Gunthorpe wrote:

On Tue, May 04, 2021 at 01:54:55PM +1000, David Gibson wrote:

On Mon, May 03, 2021 at 01:05:30PM -0300, Jason Gunthorpe wrote:

On Thu, Apr 29, 2021 at 01:20:22PM +1000, David Gibson wrote:

There is a certain appeal to having some
'PPC_TCE_CREATE_SPECIAL_IOASID' entry point that has a wack of extra
information like windows that can be optionally called by the viommu
driver and it remains well defined and described.


Windows really aren't ppc specific.  They're absolutely there on x86
and everything else as well - it's just that people are used to having
a window at 0.. that you can often get away with
treating it sloppily.


My point is this detailed control seems to go on to more than just
windows. As you say the vIOMMU is emulating specific HW that needs to
have kernel interfaces to match it exactly.


It's really not that bad.  The case of emulating the PAPR vIOMMU on
something else is relatively easy, because all updates to the IO page
tables go through hypercalls.  So, as long as the backend IOMMU can
map all the IOVAs that the guest IOMMU can, then qemu's implementation
of those hypercalls just needs to put an equivalent mapping in the
backend, which it can do with a generic VFIO_DMA_MAP.


So you also want the PAPR vIOMMU driver to run on, say, an ARM IOMMU?



This is a good feature in general when let's say there is a linux 
supported device which has a proprietary device firmware update tool 
which only exists as an x86 binary and your hardware is not x86 - 
running qemu + vfio in full emulation would provide a way to run the 
tool to update a physical device.



--
Alexey
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-04 Thread Jason Gunthorpe
On Tue, May 04, 2021 at 03:11:54PM -0700, Jacob Pan wrote:

> > It is a weird way to use xarray to have a structure which
> > itself is just a wrapper around another RCU protected structure.
> > 
> > Make the caller supply the ioasid_data memory, embedded in its own
> > element, get rid of the void * and rely on XA_ZERO_ENTRY to hold
> > allocated but not active entries.
> > 
> Let me try to paraphrase to make sure I understand. Currently
> struct ioasid_data is private to the iasid core, its memory is allocated by
> the ioasid core.
> 
> You are suggesting the following:
> 1. make struct ioasid_data public
> 2. caller allocates memory for ioasid_data, initialize it then pass it to
> ioasid_alloc to store in the xarray
> 3. caller will be responsible for setting private data inside ioasid_data
> and do call_rcu after update if needed.

Basically, but you probably won't need a "private data" once the
caller has this struct as it can just embed it in whatever larger
struct makes sense for it and use container_of/etc

I didn't look too closely at the whole thing though. Honestly I'm a
bit puzzled why we need a pluggable global allocator framework.. The
whole framework went to some trouble to isolate everything into iommu
drivers then that whole design is disturbed by this global thing.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-04 Thread Jacob Pan
Hi Jason,

On Tue, 4 May 2021 15:00:50 -0300, Jason Gunthorpe  wrote:

> On Tue, May 04, 2021 at 08:41:48AM -0700, Jacob Pan wrote:
> > > > 
> > > > (also looking at ioasid.c, why do we need such a thin and odd
> > > > wrapper around xarray?)
> > > > 
> > > 
> > > I'll leave it to Jean and Jacob.  
> 
> > Could you elaborate?  
> 
> I mean stuff like this:
> 
> int ioasid_set_data(ioasid_t ioasid, void *data)
> {
> struct ioasid_data *ioasid_data;
> int ret = 0;
> 
> spin_lock(_allocator_lock);
> ioasid_data = xa_load(_allocator->xa, ioasid);
> if (ioasid_data)
> rcu_assign_pointer(ioasid_data->private, data);
> else
> ret = -ENOENT;
> spin_unlock(_allocator_lock);
> 
> /*
>  * Wait for readers to stop accessing the old private data, so the
>  * caller can free it.
>  */
> if (!ret)
> synchronize_rcu();
> 
> return ret;
> }
> EXPORT_SYMBOL_GPL(ioasid_set_data);
> 
> It is a weird way to use xarray to have a structure which
> itself is just a wrapper around another RCU protected structure.
> 
> Make the caller supply the ioasid_data memory, embedded in its own
> element, get rid of the void * and rely on XA_ZERO_ENTRY to hold
> allocated but not active entries.
> 
Let me try to paraphrase to make sure I understand. Currently
struct ioasid_data is private to the iasid core, its memory is allocated by
the ioasid core.

You are suggesting the following:
1. make struct ioasid_data public
2. caller allocates memory for ioasid_data, initialize it then pass it to
ioasid_alloc to store in the xarray
3. caller will be responsible for setting private data inside ioasid_data
and do call_rcu after update if needed.

Correct?

> Make the synchronize_rcu() the caller responsiblity, and callers
> should really be able to use call_rcu()
> 
> Jason


Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-04 Thread Jason Gunthorpe
On Tue, May 04, 2021 at 01:54:55PM +1000, David Gibson wrote:
> On Mon, May 03, 2021 at 01:05:30PM -0300, Jason Gunthorpe wrote:
> > On Thu, Apr 29, 2021 at 01:20:22PM +1000, David Gibson wrote:
> > > > There is a certain appeal to having some
> > > > 'PPC_TCE_CREATE_SPECIAL_IOASID' entry point that has a wack of extra
> > > > information like windows that can be optionally called by the viommu
> > > > driver and it remains well defined and described.
> > > 
> > > Windows really aren't ppc specific.  They're absolutely there on x86
> > > and everything else as well - it's just that people are used to having
> > > a window at 0.. that you can often get away with
> > > treating it sloppily.
> > 
> > My point is this detailed control seems to go on to more than just
> > windows. As you say the vIOMMU is emulating specific HW that needs to
> > have kernel interfaces to match it exactly.
> 
> It's really not that bad.  The case of emulating the PAPR vIOMMU on
> something else is relatively easy, because all updates to the IO page
> tables go through hypercalls.  So, as long as the backend IOMMU can
> map all the IOVAs that the guest IOMMU can, then qemu's implementation
> of those hypercalls just needs to put an equivalent mapping in the
> backend, which it can do with a generic VFIO_DMA_MAP.

So you also want the PAPR vIOMMU driver to run on, say, an ARM IOMMU?

> vIOMMUs with page tables in guest memory are harder, but only really
> in the usual ways that a vIOMMU of that type is harder (needs cache
> mode or whatever).  At whatever point you need to shadow from the
> guest IO page tables to the host backend, you can again do that with
> generic maps, as long as the backend supports the necessary IOVAs, and
> has an IO page size that's equal to or a submultiple of the vIOMMU
> page size.

But this definitely all becomes HW specific.

For instance I want to have an ARM vIOMMU driver it needs to do some

 ret = ioctl(ioasid_fd, CREATE_NESTED_IOASID, [page table format is ARMvXXX])
 if (ret == -EOPNOTSUPP)
 ret = ioctl(ioasid_fd, CREATE_NORMAL_IOASID, ..)
 // and do completely different and more expensive emulation

I can get a little bit of generality, but at the end of the day the
IOMMU must create a specific HW layout of the nested page table, if it
can't, it can't.

> > I'm remarking that trying to unify every HW IOMMU implementation that
> > ever has/will exist into a generic API complete enough to allow the
> > vIOMMU to be created is likely to result in an API too complicated to
> > understand..
> 
> Maybe not every one, but I think we can get a pretty wide range with a
> reasonable interface.  

It sounds like a reasonable guideline is if the feature is actually
general to all IOMMUs and can be used by qemu as part of a vIOMMU
emulation when compatible vIOMMU HW is not available.

Having 'requested window' support that isn't actually implemented in
every IOMMU is going to mean the PAPR vIOMMU emulation won't work,
defeating the whole point of making things general?

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-04 Thread Jason Gunthorpe
On Tue, May 04, 2021 at 08:41:48AM -0700, Jacob Pan wrote:
> > > 
> > > (also looking at ioasid.c, why do we need such a thin and odd wrapper
> > > around xarray?)
> > >   
> > 
> > I'll leave it to Jean and Jacob.

> Could you elaborate?

I mean stuff like this:

int ioasid_set_data(ioasid_t ioasid, void *data)
{
struct ioasid_data *ioasid_data;
int ret = 0;

spin_lock(_allocator_lock);
ioasid_data = xa_load(_allocator->xa, ioasid);
if (ioasid_data)
rcu_assign_pointer(ioasid_data->private, data);
else
ret = -ENOENT;
spin_unlock(_allocator_lock);

/*
 * Wait for readers to stop accessing the old private data, so the
 * caller can free it.
 */
if (!ret)
synchronize_rcu();

return ret;
}
EXPORT_SYMBOL_GPL(ioasid_set_data);

It is a weird way to use xarray to have a structure which
itself is just a wrapper around another RCU protected structure.

Make the caller supply the ioasid_data memory, embedded in its own
element, get rid of the void * and rely on XA_ZERO_ENTRY to hold
allocated but not active entries.

Make the synchronize_rcu() the caller responsiblity, and callers
should really be able to use call_rcu()

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-04 Thread Jason Gunthorpe
On Wed, Apr 28, 2021 at 06:58:19AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Wednesday, April 28, 2021 1:12 AM
> > 
> [...] 
> > > As Alex says, if this line fails because of the group restrictions,
> > > that's not great because it's not very obvious what's gone wrong.
> > 
> > Okay, that is fair, but let's solve that problem directly. For
> > instance netlink has been going in the direction of adding a "extack"
> > from the kernel which is a descriptive error string. If the failing
> > ioctl returned the string:
> > 
> >   "cannot join this device to the IOASID because device XXX in the
> >same group #10 is in use"
> > 
> > Would you agree it is now obvious what has gone wrong? In fact would
> > you agree this is a lot better user experience than what applications
> > do today even though they have the group FD?
> > 
> 
> Currently all the discussions are around implicit vs. explicit uAPI semantics
> on the group restriction. However if we look beyond group the implicit 
> semantics might be inevitable when dealing with incompatible iommu
> domains. An existing example of iommu incompatibility is IOMMU_
> CACHE. 

I still think we need to get rid of these incompatibilities
somehow. Having multiple HW incompatible IOASID in the same platform
is just bad all around.

When modeling in userspace IOMMU_CACHE sounds like it is a property of
each individual IOASID, not an attribute that requires a new domain.

People that want to create cache bypass IOASID's should just ask for
that that directly.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-04 Thread Jason Gunthorpe
On Tue, May 04, 2021 at 09:22:55AM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Wed, 28 Apr 2021 17:46:06 -0300, Jason Gunthorpe  wrote:
> 
> > > > I think the name IOASID is fine for the uAPI, the kernel version can
> > > > be called ioasid_id or something.  
> > > 
> > > ioasid is already an id and then ioasid_id just adds confusion. Another
> > > point is that ioasid is currently used to represent both PCI PASID and
> > > ARM substream ID in the kernel. It implies that if we want to separate
> > > ioasid and pasid in the uAPI the 'pasid' also needs to be replaced with
> > > another general term usable for substream ID. Are we making the
> > > terms too confusing here?  
> > 
> > This is why I also am not so sure about exposing the PASID in the API
> > because it is ultimately a HW specific item.
> > 
> > As I said to David, one avenue is to have some generic uAPI that is
> > very general and keep all this deeply detailed stuff, that really only
> > matters for qemu, as part of a more HW specific vIOMMU driver
> > interface.
> I think it is not just for QEMU. I am assuming you meant PASID is
> needed for guest driver to program assigned but not mediated devices.

Anything that directly operates the device and tries to instantiate
PASIDs for vfio-pci devices will need to understand the PASID.

> User space drivers may also need to get the real HW PASID to program it on
> to the HW. So this uAPI need to provide some lookup functionality. Perhaps
> the kernel generic version can be called ioasid_hw_id?
> 
> So we have the following per my understanding:
> - IOASID: a userspace logical number which identifies a page table, this can
> be a first level (GVA-GPA), or a second level (GPA->HPA) page table.
> - PASID: strictly defined in PCIe term
> - Substream ID: strictly defined in ARM SMMUv3 spec.
> - IOASID_HW_ID: a generic ID backed by PASID, Substream ID, or any other
>HW IDs used to tag DMA
> 
> Is that right?

It is reasonable. If a IOASID_HW_ID IOCTL can back with a enum that
qualified its exact nature it might be perfectly fine.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-04 Thread Jacob Pan
Hi Jason,

On Wed, 28 Apr 2021 17:46:06 -0300, Jason Gunthorpe  wrote:

> > > I think the name IOASID is fine for the uAPI, the kernel version can
> > > be called ioasid_id or something.  
> > 
> > ioasid is already an id and then ioasid_id just adds confusion. Another
> > point is that ioasid is currently used to represent both PCI PASID and
> > ARM substream ID in the kernel. It implies that if we want to separate
> > ioasid and pasid in the uAPI the 'pasid' also needs to be replaced with
> > another general term usable for substream ID. Are we making the
> > terms too confusing here?  
> 
> This is why I also am not so sure about exposing the PASID in the API
> because it is ultimately a HW specific item.
> 
> As I said to David, one avenue is to have some generic uAPI that is
> very general and keep all this deeply detailed stuff, that really only
> matters for qemu, as part of a more HW specific vIOMMU driver
> interface.
I think it is not just for QEMU. I am assuming you meant PASID is
needed for guest driver to program assigned but not mediated devices.

User space drivers may also need to get the real HW PASID to program it on
to the HW. So this uAPI need to provide some lookup functionality. Perhaps
the kernel generic version can be called ioasid_hw_id?

So we have the following per my understanding:
- IOASID: a userspace logical number which identifies a page table, this can
be a first level (GVA-GPA), or a second level (GPA->HPA) page table.
- PASID: strictly defined in PCIe term
- Substream ID: strictly defined in ARM SMMUv3 spec.
- IOASID_HW_ID: a generic ID backed by PASID, Substream ID, or any other
 HW IDs used to tag DMA

Is that right?

Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-04 Thread Jacob Pan
Hi Kevin,

On Wed, 28 Apr 2021 06:34:11 +, "Tian, Kevin" 
wrote:

> > 
> > (also looking at ioasid.c, why do we need such a thin and odd wrapper
> > around xarray?)
> >   
> 
> I'll leave it to Jean and Jacob.
I am not sure whether you are referring to the current ioasid.c or the
changes proposed in this patchset. I added a per VM/ioasid_set
(also per /dev/ioasid fd) xarray to store guest-host PASID mapping.

The current code has a xarray for the allocators.

struct ioasid_allocator_data {
struct ioasid_allocator_ops *ops;
struct list_head list;
struct list_head slist;
#define IOASID_ALLOCATOR_CUSTOM BIT(0) /* Needs framework to track results */
unsigned long flags;
struct xarray xa;
struct rcu_head rcu;
};

Could you elaborate?

Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-03 Thread David Gibson
On Mon, May 03, 2021 at 01:05:30PM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 29, 2021 at 01:20:22PM +1000, David Gibson wrote:
> > > There is a certain appeal to having some
> > > 'PPC_TCE_CREATE_SPECIAL_IOASID' entry point that has a wack of extra
> > > information like windows that can be optionally called by the viommu
> > > driver and it remains well defined and described.
> > 
> > Windows really aren't ppc specific.  They're absolutely there on x86
> > and everything else as well - it's just that people are used to having
> > a window at 0.. that you can often get away with
> > treating it sloppily.
> 
> My point is this detailed control seems to go on to more than just
> windows. As you say the vIOMMU is emulating specific HW that needs to
> have kernel interfaces to match it exactly.

It's really not that bad.  The case of emulating the PAPR vIOMMU on
something else is relatively easy, because all updates to the IO page
tables go through hypercalls.  So, as long as the backend IOMMU can
map all the IOVAs that the guest IOMMU can, then qemu's implementation
of those hypercalls just needs to put an equivalent mapping in the
backend, which it can do with a generic VFIO_DMA_MAP.

vIOMMUs with page tables in guest memory are harder, but only really
in the usual ways that a vIOMMU of that type is harder (needs cache
mode or whatever).  At whatever point you need to shadow from the
guest IO page tables to the host backend, you can again do that with
generic maps, as long as the backend supports the necessary IOVAs, and
has an IO page size that's equal to or a submultiple of the vIOMMU
page size.

> I'm remarking that trying to unify every HW IOMMU implementation that
> ever has/will exist into a generic API complete enough to allow the
> vIOMMU to be created is likely to result in an API too complicated to
> understand..

Maybe not every one, but I think we can get a pretty wide range with a
reasonable interface.  Explicitly handling IOVA windows does most of
it.  And we kind of need to handle that anyway to expose what ranges
the IOMMU is capable of translating anyway.  I think making handling
valid IOVA windows explicit makes things simpler than having
per-backend-family interfaces to expose the limits of their
translation ranges, which is what's likely to happen without it.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-03 Thread Jason Gunthorpe
On Thu, Apr 29, 2021 at 01:04:05PM +1000, David Gibson wrote:
> Again, I don't know enough about VDPA to make sense of that.  Are we
> essentially talking non-PCI virtual devices here?  In which case you
> could define the VDPA "bus" to always have one-device groups.

It is much worse than that.

What these non-PCI devices need is for the kernel driver to be part of
the IOMMU group of the underlying PCI device but tell VFIO land that
"groups don't matter"

Today mdev tries to fake this by using singleton iommu groups, but it
is really horrible and direcly hacks up the VFIO IOMMU code to
understand these special cases. Intel was proposing more special
hacking in the VFIO IOMMU code to extend this to PASID.

When we get to a /dev/ioasid this is all nonsense. The kernel device
driver is going to have to tell drivers/iommu exactly what kind of
ioasid it can accept, be it a PASID inside a kernel owned group, a SW
emulated 'mdev' ioasid, or whatever.

In these cases the "group" idea has become a fiction that just creates
a pain. "Just reorganize VDPA to do something insane with the driver
core so we can create a dummy group to satisfy an unnecessary uAPI
restriction" is not a very compelling argument.

So if the nonsensical groups goes away for PASID/mdev, where does it
leave the uAPI in other cases?

> I don't think simplified-but-wrong is a good goal.  The thing about
> groups is that if they're there, you can't just "not care" about them,
> they affect you whether you like it or not.

You really can. If one thing claims the group then all the other group
devices become locked out.

The main point to understand is that groups are NOT an application
restriction! It is a whole system restriction that the operator needs
to understand and deal with. This is why things like dpdk don't care
about the group at all - there is nothing they can do with the
information.

If the operator says to run dpdk on a specific device then the
operator is the one that has to deal with all the other devices in the
group getting locked out.

At best the application can make it more obvious that the operator is
doing something dangerous, but the current kernel API doesn't seem to
really support that either.

Jason

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-03 Thread Jason Gunthorpe
On Thu, Apr 29, 2021 at 01:20:22PM +1000, David Gibson wrote:
> > There is a certain appeal to having some
> > 'PPC_TCE_CREATE_SPECIAL_IOASID' entry point that has a wack of extra
> > information like windows that can be optionally called by the viommu
> > driver and it remains well defined and described.
> 
> Windows really aren't ppc specific.  They're absolutely there on x86
> and everything else as well - it's just that people are used to having
> a window at 0.. that you can often get away with
> treating it sloppily.

My point is this detailed control seems to go on to more than just
windows. As you say the vIOMMU is emulating specific HW that needs to
have kernel interfaces to match it exactly.

I'm remarking that trying to unify every HW IOMMU implementation that
ever has/will exist into a generic API complete enough to allow the
vIOMMU to be created is likely to result in an API too complicated to
understand..

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-29 Thread Jason Gunthorpe
On Thu, Apr 29, 2021 at 03:26:55PM +0200, Auger Eric wrote:
> From the pseudo code,
> 
>   gpa_ioasid_id = ioctl(ioasid_fd, CREATE_IOASID, ..)
>   ioctl(ioasid_fd, SET_IOASID_PAGE_TABLES, ..)
> 
> I fail to understand whether the SET_IOASID_PAGE_TABLES would apply to
> the whole IOASIDs within /dev/ioasid or to a specific one.

Sorry, nearly every IOCTL would be scoped to a specific IOASID as one
of the arguments.

> Also in subsequent emails when you talk about IOASID, is it the
> ioasid_id, just to double check the terminology.

I am refering to IOASID as 'handle of the page table object inside the
/dev/ioasid fd'. If that is equal to some HW value or not I think
remains as decision point.

Basically the fd has an xarray of 'struct [something] *' and the
IOASID is index to that FD's private xarray. This is necessary to
create proper security as even if we have global PASID numbers or
something they still need to be isolated to only the FD that has
been authorized access.

> >   nested_ioasid = ioctl(ioasid_fd, CREATE_NESTED_IOASID,  gpa_ioasid_id);
> >   ioctl(ioasid_fd, SET_NESTED_IOASID_PAGE_TABLES, nested_ioasid, ..)
> is the nested_ioasid the allocated PASID id or is it a complete
> different object id.

It is the IOASID handle above.

> >
> >// IOMMU will match on the device RID, no PASID:
> >   ioctl(vfio_device, ATTACH_IOASID, nested_ioasid);
> > 
> >// IOMMU will match on the device RID and PASID:
> >   ioctl(vfio_device, ATTACH_IOASID_PASID, pasid, nested_ioasid);
> here I see you pass a different pasid, so I guess they are different, in
> which case you would need to have an allocator function for this pasid,
> right?

Yes, the underlying HW ID (PASID or substream id or whatver) is
something slightly different

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-29 Thread Auger Eric
Hi,

On 4/22/21 2:10 PM, Jason Gunthorpe wrote:
> On Thu, Apr 22, 2021 at 08:34:32AM +, Tian, Kevin wrote:
> 
>> The shim layer could be considered as a new iommu backend in VFIO,
>> which connects VFIO iommu ops to the internal helpers in
>> drivers/ioasid.
> 
> It may be the best we can do because of SPAPR, but the ideal outcome
> should be to remove the entire pluggable IOMMU stuff from vfio
> entirely and have it only use /dev/ioasid
> 
> We should never add another pluggable IOMMU type to vfio - everything
> should be done through drives/iommu now that it is much more capable.
> 
>> Another tricky thing is that a container may be linked to multiple iommu
>> domains in VFIO, as devices in the container may locate behind different
>> IOMMUs with inconsistent capability (commit 1ef3e2bc). 
> 
> Frankly this sounds over complicated. I would think /dev/ioasid should
> select the IOMMU when the first device is joined, and all future joins
> must be compatible with the original IOMMU - ie there is only one set
> of IOMMU capabilities in a /dev/ioasid.
> 
> This means qemue might have multiple /dev/ioasid's if the system has
> multiple incompatible IOMMUs (is this actually a thing?) The platform
> should design its IOMMU domains to minimize the number of
> /dev/ioasid's required.
> 
> Is there a reason we need to share IOASID'd between completely
> divergance IOMMU implementations? I don't expect the HW should be able
> to physically share page tables??
> 
> That decision point alone might be the thing that just says we can't
> ever have /dev/vfio/vfio == /dev/ioasid
> 
>> Just to confirm. Above flow is for current map/unmap flavor as what
>> VFIO/vDPA do today. Later when nested translation is supported,
>> there is no need to detach gpa_ioasid_fd. Instead, a new cmd will
>> be introduced to nest rid_ioasid_fd on top of gpa_ioasid_fd:
> 
> Sure.. The tricky bit will be to define both of the common nested
> operating modes.
>

>From the pseudo code,

  gpa_ioasid_id = ioctl(ioasid_fd, CREATE_IOASID, ..)
  ioctl(ioasid_fd, SET_IOASID_PAGE_TABLES, ..)

I fail to understand whether the SET_IOASID_PAGE_TABLES would apply to
the whole IOASIDs within /dev/ioasid or to a specific one.

Also in subsequent emails when you talk about IOASID, is it the
ioasid_id, just to double check the terminology.


>   nested_ioasid = ioctl(ioasid_fd, CREATE_NESTED_IOASID,  gpa_ioasid_id);
>   ioctl(ioasid_fd, SET_NESTED_IOASID_PAGE_TABLES, nested_ioasid, ..)
is the nested_ioasid the allocated PASID id or is it a complete
different object id.
> 
>// IOMMU will match on the device RID, no PASID:
>   ioctl(vfio_device, ATTACH_IOASID, nested_ioasid);
> 
>// IOMMU will match on the device RID and PASID:
>   ioctl(vfio_device, ATTACH_IOASID_PASID, pasid, nested_ioasid);
here I see you pass a different pasid, so I guess they are different, in
which case you would need to have an allocator function for this pasid,
right?

Thanks

Eric
> 
> Notice that ATTACH (or bind, whatever) is always done on the
> vfio_device FD. ATTACH tells the IOMMU HW to link the PCI BDF to
> a specific page table defined by an IOASID.
> 
> I expect we have many flavours of IOASID tables, eg we have normal,
> and 'nested with table controlled by hypervisor'. ARM has 'nested with
> table controlled by guest' right? So like this?
> 
>   nested_ioasid = ioctl(ioasid_fd, CREATE_DELGATED_IOASID,
>gpa_ioasid_id, )
>   // PASID now goes to 
>   ioctl(vfio_device, ATTACH_IOASID_PASID, pasid, nested_ioasid);

> 
> Where  is some internal to the guest handle of the viommu
> page table scoped within gpa_ioasid_id? Like maybe it is GPA of the
> base of the page table?
> 
> The guest can't select its own PASIDs without telling the hypervisor,
> right?
> 
>> I also feel hiding group from uAPI is a good thing and is interested in
>> the rationale behind for explicitly managing group in vfio (which is
>> essentially the same boundary as provided by iommu group), e.g. for 
>> better user experience when group security is broken? 
> 
> Indeed, I can see how things might have just evolved into this, but if
> it has a purpose it seems pretty hidden.
> we need it or not seems pretty hidden.
> 
> Jason
> 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-29 Thread Auger Eric
Hi,

On 4/22/21 2:10 PM, Jason Gunthorpe wrote:
> On Thu, Apr 22, 2021 at 08:34:32AM +, Tian, Kevin wrote:
> 
>> The shim layer could be considered as a new iommu backend in VFIO,
>> which connects VFIO iommu ops to the internal helpers in
>> drivers/ioasid.
> 
> It may be the best we can do because of SPAPR, but the ideal outcome
> should be to remove the entire pluggable IOMMU stuff from vfio
> entirely and have it only use /dev/ioasid
> 
> We should never add another pluggable IOMMU type to vfio - everything
> should be done through drives/iommu now that it is much more capable.
> 
>> Another tricky thing is that a container may be linked to multiple iommu
>> domains in VFIO, as devices in the container may locate behind different
>> IOMMUs with inconsistent capability (commit 1ef3e2bc). 
> 
> Frankly this sounds over complicated. I would think /dev/ioasid should
> select the IOMMU when the first device is joined, and all future joins
> must be compatible with the original IOMMU - ie there is only one set
> of IOMMU capabilities in a /dev/ioasid.
> 
> This means qemue might have multiple /dev/ioasid's if the system has
> multiple incompatible IOMMUs (is this actually a thing?) The platform
> should design its IOMMU domains to minimize the number of
> /dev/ioasid's required.
> 
> Is there a reason we need to share IOASID'd between completely
> divergance IOMMU implementations? I don't expect the HW should be able
> to physically share page tables??
> 
> That decision point alone might be the thing that just says we can't
> ever have /dev/vfio/vfio == /dev/ioasid
> 
>> Just to confirm. Above flow is for current map/unmap flavor as what
>> VFIO/vDPA do today. Later when nested translation is supported,
>> there is no need to detach gpa_ioasid_fd. Instead, a new cmd will
>> be introduced to nest rid_ioasid_fd on top of gpa_ioasid_fd:
> 
> Sure.. The tricky bit will be to define both of the common nested
> operating modes.
> 
>   nested_ioasid = ioctl(ioasid_fd, CREATE_NESTED_IOASID,  gpa_ioasid_id);
>   ioctl(ioasid_fd, SET_NESTED_IOASID_PAGE_TABLES, nested_ioasid, ..)
> 
>// IOMMU will match on the device RID, no PASID:
>   ioctl(vfio_device, ATTACH_IOASID, nested_ioasid);
> 
>// IOMMU will match on the device RID and PASID:
>   ioctl(vfio_device, ATTACH_IOASID_PASID, pasid, nested_ioasid);
> 
> Notice that ATTACH (or bind, whatever) is always done on the
> vfio_device FD. ATTACH tells the IOMMU HW to link the PCI BDF to
> a specific page table defined by an IOASID.
> 
> I expect we have many flavours of IOASID tables, eg we have normal,
> and 'nested with table controlled by hypervisor'. ARM has 'nested with
> table controlled by guest' right? So like this?
yes the PASID table is fully controlled by the guest Same for the stage
1 table.
> 
>   nested_ioasid = ioctl(ioasid_fd, CREATE_DELGATED_IOASID,
>gpa_ioasid_id, )
>   // PASID now goes to 
>   ioctl(vfio_device, ATTACH_IOASID_PASID, pasid, nested_ioasid);
> 
> Where  is some internal to the guest handle of the viommu
> page table scoped within gpa_ioasid_id? Like maybe it is GPA of the
> base of the page table?
Yes the GPA of the first level page table + some misc info like the max
number of IOASIDs.
> 
> The guest can't select its own PASIDs without telling the hypervisor,
> right?
on ARM there is no system wide IOASID allocator as for x86. So the guest
can select its own PASID without telling the hyp.

Thanks

Eric
> 
>> I also feel hiding group from uAPI is a good thing and is interested in
>> the rationale behind for explicitly managing group in vfio (which is
>> essentially the same boundary as provided by iommu group), e.g. for 
>> better user experience when group security is broken? 
> 
> Indeed, I can see how things might have just evolved into this, but if
> it has a purpose it seems pretty hidden.
> we need it or not seems pretty hidden.
> 
> Jason
> 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-29 Thread Auger Eric
Hi,

On 4/23/21 1:49 PM, Jason Gunthorpe wrote:
> On Fri, Apr 23, 2021 at 09:06:44AM +, Tian, Kevin wrote:
> 
>> Or could we still have just one /dev/ioasid but allow userspace to create
>> multiple gpa_ioasid_id's each associated to a different iommu domain? 
>> Then the compatibility check will be done at ATTACH_IOASID instead of 
>> JOIN_IOASID_FD.
> 
> To my mind what makes sense that that /dev/ioasid presents a single
> IOMMU behavior that is basically the same. This may ultimately not be
> what we call a domain today.
> 
> We may end up with a middle object which is a group of domains that
> all have the same capabilities, and we define capabilities in a way
> that most platforms have a single group of domains.
> 
> The key capability of a group of domains is they can all share the HW
> page table representation, so if an IOASID instantiates a page table
> it can be assigned to any device on any domain in the gruop of domains.
> 
> If you try to say that /dev/ioasid has many domains and they can't
> have their HW page tables shared then I think the implementation
> complexity will explode.
> 
>> This does impose one burden to userspace though, to understand the 
>> IOMMU compatibilities and figure out which incompatible features may
>> affect the page table management (while such knowledge is IOMMU
>> vendor specific) and then explicitly manage multiple /dev/ioasid's or 
>> multiple gpa_ioasid_id's.
> 
> Right, this seems very hard in the general case..
>  
>> Alternatively is it a good design by having the kernel return error at
>> attach/join time to indicate that incompatibility is detected then the 
>> userspace should open a new /dev/ioasid or creates a new gpa_ioasid_id
>> for the failing device upon such failure, w/o constructing its own 
>> compatibility knowledge?
> 
> Yes, this feels workable too
> 
>>> This means qemue might have multiple /dev/ioasid's if the system has
>>> multiple incompatible IOMMUs (is this actually a thing?) The platform
>>
>> One example is Intel platform with igd. Typically there is one IOMMU
>> dedicated for igd and the other IOMMU serving all the remaining devices.
>> The igd IOMMU may not support IOMMU_CACHE while the other one
>> does.
> 
> If we can do as above the two domains may be in the same group of
> domains and the IOMMU_CACHE is not exposed at the /dev/ioasid level.
> 
> For instance the API could specifiy IOMMU_CACHE during attach, not
> during IOASID creation.
> 
> Getting all the data model right in the API is going to be trickiest
> part of this.
> 
>> yes, e.g. in vSVA both devices (behind divergence IOMMUs) are bound
>> to a single guest process which has an unique PASID and 1st-level page
>> table. Earlier incompatibility example is only for 2nd-level.
> 
> Because when we get to here, things become inscrutable as an API if
> you are trying to say two different IOMMU presentations can actually
> be nested.
> 
>>> Sure.. The tricky bit will be to define both of the common nested
>>> operating modes.
>>>
>>>   nested_ioasid = ioctl(ioasid_fd, CREATE_NESTED_IOASID,  gpa_ioasid_id);
>>>   ioctl(ioasid_fd, SET_NESTED_IOASID_PAGE_TABLES, nested_ioasid, ..)
>>>
>>>// IOMMU will match on the device RID, no PASID:
>>>   ioctl(vfio_device, ATTACH_IOASID, nested_ioasid);
>>>
>>>// IOMMU will match on the device RID and PASID:
>>>   ioctl(vfio_device, ATTACH_IOASID_PASID, pasid, nested_ioasid);
>>
>> I'm a bit confused here why we have both pasid and ioasid notations together.
>> Why not use nested_ioasid as pasid directly (i.e. every pasid in nested mode
>> is created by CREATE_NESTED_IOASID)?
> 
> The IOASID is not a PASID, it is just a page table.
> 
> A generic IOMMU matches on either RID or (RID,PASID), so you should
> specify the PASID when establishing the match.
> 
> IOASID only specifies the page table.
> 
> So you read the above as configuring the path
> 
>   PCI_DEVICE -> (RID,PASID) -> nested_ioasid -> gpa_ioasid_id -> physical
> 
> Where (RID,PASID) indicate values taken from the PCI packet.
> 
> In principle the IOMMU could also be commanded to reuse the same
> ioasid page table with a different PASID:
> 
>   PCI_DEVICE_B -> (RID_B,PASID_B) -> nested_ioasid -> gpa_ioasid_id -> 
> physical
> 
> This is impossible if the ioasid == PASID in the API.
> 
>> Below I list different scenarios for ATTACH_IOASID in my view. Here 
>> vfio_device could be a real PCI function (RID), or a subfunction device 
>> (RID+def_ioasid). 
> 
> What is RID+def_ioasid? The IOMMU does not match on IOASID's.
> 
> A subfunction device always need to use PASID, or an internal IOMMU,
> confused what you are trying to explain?
> 
>> If the whole PASID table is delegated to the guest in ARM case, the guest
>> can select its own PASIDs w/o telling the hypervisor. 
> 
> The hypervisor has to route the PASID's to the guest at some point - a
> guest can't just claim a PASID unilaterally, that would not be secure.
AFAIU On ARM the stage 2 table is uniquely defined per 

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-28 Thread David Gibson
On Wed, Apr 28, 2021 at 11:56:22AM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 28, 2021 at 10:58:29AM +1000, David Gibson wrote:
> > On Tue, Apr 27, 2021 at 02:12:12PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Apr 27, 2021 at 03:08:46PM +1000, David Gibson wrote:
> > > > > Starting from a BDF the general pseudo code is:
> > > > >  device_name = first_directory_of("/sys/bus/pci/devices/BDF/vfio/")
> > > > >  device_fd = open("/dev/vfio/"+device_name)
> > > > >  ioasidfd = open("/dev/ioasid")
> > > > >  ioctl(device_fd, JOIN_IOASID_FD, ioasidfd)
> > > > 
> > > > This line is the problem.
> > > > 
> > > > [Historical aside: Alex's early drafts for the VFIO interface looked
> > > > quite similar to this.  Ben Herrenschmidt and myself persuaded him it
> > > > was a bad idea, and groups were developed instead.  I still think it's
> > > > a bad idea, and not just for POWER]
> > > 
> > > Spawning the VFIO device FD from the group FD is incredibly gross from
> > > a kernel design perspective. Since that was done the struct
> > > vfio_device missed out on a sysfs presence and doesn't have the
> > > typical 'struct device' member or dedicated char device you'd expect a
> > > FD based subsystem to have.
> > > 
> > > This basically traded normal usage of the driver core for something
> > > that doesn't serve a technical usage. Given we are now nearly 10 years
> > > later and see that real widely deployed applications are not doing
> > > anything special with the group FD it makes me question the wisdom of
> > > this choice.
> > 
> > I'm really not sure what "anything special" would constitute here.
> 
> Well, really anything actually. All I see in, say, dpdk, is open the
> group fd, get a device fd, do the container dance and never touch the
> group fd again or care about groups in any way. It seems typical of
> this class of application.

Well, sure, the only operation you do on the group itself is attach it
to the container (and then every container operation can be thought of
as applying to all its attached groups).  But that attach operation
really is fundamentally about the group.  It always, unavoidably,
fundamentally affects every device in the group - including devices
you may not typically think about, like bridges and switches.

That is *not* true of the other device operations, like poking IO.

> If dpdk is exposing other devices to a risk it certainly hasn't done
> anything to make that obvious.

And in practice I suspect it will just break if you give it a >1
device group.

> > > Okay, that is fair, but let's solve that problem directly. For
> > > instance netlink has been going in the direction of adding a "extack"
> > > from the kernel which is a descriptive error string. If the failing
> > > ioctl returned the string:
> > > 
> > >   "cannot join this device to the IOASID because device XXX in the
> > >same group #10 is in use"
> > 
> > Um.. is there a sane way to return strings from an ioctl()?
> 
> Yes, it can be done, a string buffer pointer and length in the input
> for instance.

I suppose.  Rare enough that I expect everyone will ignore it, alas :/.

> > Getting the device fds from the group fd kind of follows, because it's
> > unsafe to do basically anything on the device unless you already
> > control the group (which in this case means attaching it to a
> > container/ioasid).  I'm entirely open to ways of doing that that are
> > less inelegant from a sysfs integration point of view, but the point
> > is you must manage the group before you can do anything at all with
> > individual devices.
> 
> I think most likely VFIO is going to be the only thing to manage a
> multi-device group.

You don't get to choose that.  You could explicitly limit other things
to only one-device groups, but that would be an explicit limitation.
Essentially any device can end up in a multi-device group, if you put
it behind a PCIe to PCI bridge, or a PCIe switch which doesn't support
access controls.

The groups are still there, whether or not other things want to deal
with them.

> I see things like VDPA being primarily about PASID, and an IOASID that
> is matched to a PASID is inherently a single device IOMMU group.

I don't know enough about PASID to make sense of that.

> > I don't see why.  I mean, sure, you don't want explicitly the *vfio*
> > group as such.  But IOMMU group is already a cross-subsystem concept
> > and you can explicitly expose that in a different way.
> 
> Yes, and no, the kernel drivers in something like VDPA have decided
> what device and group they are in before we get to IOASID. It is
> illogical to try to retro-actively bolt in a group concept to their
> APIs.

Again, I don't know enough about VDPA to make sense of that.  Are we
essentially talking non-PCI virtual devices here?  In which case you
could define the VDPA "bus" to always have one-device groups.

> > Again, I realy think this is necessary complexity.  You're right that
> > far too little of the userspace properly 

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-28 Thread David Gibson
On Wed, Apr 28, 2021 at 09:21:49PM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 28, 2021 at 11:23:39AM +1000, David Gibson wrote:
> 
> > Yes.  My proposed model for a unified interface would be that when you
> > create a new container/IOASID, *no* IOVAs are valid.
> 
> Hurm, it is quite tricky. All IOMMUs seem to have a dead zone around
> the MSI window, so negotiating this all in a general way is not going
> to be a very simple API.
> 
> To be general it would be nicer to say something like 'I need XXGB of
> IOVA space' 'I need 32 bit IOVA space' etc and have the kernel return
> ranges that sum up to at least that big. Then the kernel can do its
> all its optimizations.

Ah, yes, sorry.  We do need an API that lets the kernel make more of
the decisions too.  For userspace drivers it would generally be
sufficient to just ask for XXX size of IOVA space wherever you can get
it.  Handling guests requires more precision.  So, maybe a request
interface with a bunch of hint variables and a matching set of
MAP_FIXED-like flags to assert which ones aren't negotiable.

> I guess you are going to say that the qemu PPC vIOMMU driver needs
> more exact control..

*Every* vIOMMU driver needs more exact control.  The guest drivers
will expect to program the guest devices with IOVAs matching the guest
platform's IOMMU model.  Therefore the backing host IOMMU has to be
programmed to respond to those IOVAs.  If it can't be, there's no way
around it, and you want to fail out early.  With this model that will
happen when qemu (say) requests the host IOMMU window(s) to match the
guest's expected IOVA ranges.

Actually, come to that even guests without a vIOMMU need more exact
control: they'll expect IOVA to match GPA, so if your host IOMMU can't
be set up translate the full range of GPAs, again, you're out of luck.

The only reason x86 has been able to ignore this is that the
assumption has been that all IOMMUs can translate IOVAs from 0...  Once you really start to
look at what the limits are, you need the exact window control I'm
describing.

> > I expect we'd need some kind of query operation to expose limitations
> > on the number of windows, addresses for them, available pagesizes etc.
> 
> Is page size an assumption that hugetlbfs will always be used for backing
> memory or something?

So for TCEs (and maybe other IOMMUs out there), the IO page tables are
independent of the CPU page tables.  They don't have the same format,
and they don't necessarily have the same page size.  In the case of a
bare metal kernel working in physical addresses they can use that TCE
page size however they like.  For userspace you get another layer of
complexity.  Essentially to implement things correctly the backing
IOMMU needs to have a page size granularity that's the minimum of
whatever granularity the userspace or guest driver expects and the
host page size backing the memory.

> > > As an ideal, only things like the HW specific qemu vIOMMU driver
> > > should be reaching for all the special stuff.
> > 
> > I'm hoping we can even avoid that, usually.  With the explicitly
> > created windows model I propose above, it should be able to: qemu will
> > create the windows according to the IOVA windows the guest platform
> > expects to see and they either will or won't work on the host platform
> > IOMMU.  If they do, generic maps/unmaps should be sufficient.  If they
> > don't well, the host IOMMU simply cannot emulate the vIOMMU so you're
> > out of luck anyway.
> 
> It is not just P9 that has special stuff, and this whole area of PASID
> seems to be quite different on every platform
> 
> If things fit very naturally and generally then maybe, but I've been
> down this road before of trying to make a general description of a
> group of very special HW. It ended in tears after 10 years when nobody
> could understand the "general" API after it was Frankenstein'd up with
> special cases for everything. Cautionary tale
> 
> There is a certain appeal to having some
> 'PPC_TCE_CREATE_SPECIAL_IOASID' entry point that has a wack of extra
> information like windows that can be optionally called by the viommu
> driver and it remains well defined and described.

Windows really aren't ppc specific.  They're absolutely there on x86
and everything else as well - it's just that people are used to having
a window at 0.. that you can often get away with
treating it sloppily.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-28 Thread Jason Gunthorpe
On Wed, Apr 28, 2021 at 11:23:39AM +1000, David Gibson wrote:

> Yes.  My proposed model for a unified interface would be that when you
> create a new container/IOASID, *no* IOVAs are valid.

Hurm, it is quite tricky. All IOMMUs seem to have a dead zone around
the MSI window, so negotiating this all in a general way is not going
to be a very simple API.

To be general it would be nicer to say something like 'I need XXGB of
IOVA space' 'I need 32 bit IOVA space' etc and have the kernel return
ranges that sum up to at least that big. Then the kernel can do its
all its optimizations.

I guess you are going to say that the qemu PPC vIOMMU driver needs
more exact control..

> I expect we'd need some kind of query operation to expose limitations
> on the number of windows, addresses for them, available pagesizes etc.

Is page size an assumption that hugetlbfs will always be used for backing
memory or something?

> > As an ideal, only things like the HW specific qemu vIOMMU driver
> > should be reaching for all the special stuff.
> 
> I'm hoping we can even avoid that, usually.  With the explicitly
> created windows model I propose above, it should be able to: qemu will
> create the windows according to the IOVA windows the guest platform
> expects to see and they either will or won't work on the host platform
> IOMMU.  If they do, generic maps/unmaps should be sufficient.  If they
> don't well, the host IOMMU simply cannot emulate the vIOMMU so you're
> out of luck anyway.

It is not just P9 that has special stuff, and this whole area of PASID
seems to be quite different on every platform

If things fit very naturally and generally then maybe, but I've been
down this road before of trying to make a general description of a
group of very special HW. It ended in tears after 10 years when nobody
could understand the "general" API after it was Frankenstein'd up with
special cases for everything. Cautionary tale

There is a certain appeal to having some
'PPC_TCE_CREATE_SPECIAL_IOASID' entry point that has a wack of extra
information like windows that can be optionally called by the viommu
driver and it remains well defined and described.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-28 Thread Jason Gunthorpe
On Wed, Apr 28, 2021 at 06:34:11AM +, Tian, Kevin wrote:

> > If /dev/ioasid is single HW page table only then I would focus on that
> > implementation and leave it to userspace to span different
> > /dev/ioasids if needed.
> > 
> > > OK, now I see where the disconnection comes from. In my context ioasid
> > > is the identifier that is actually used in the wire, but seems you treat 
> > > it as
> > > a sw-defined namespace purely for representing page tables. We should
> > > clear this concept first before further discussing other details. 
> > 
> > There is no general HW requirement that every IO page table be
> > referred to by the same PASID and this API would have to support
> 
> Yes, but what is the value of allowing multiple PASIDs referring to the
> the same I/O page table (except the nesting pgtable case)? Doesn't it 
> lead to poor iotlb efficiency issue similar to multiple iommu domains 
> referring to the same page table?

I think iotlb efficiency is up to the platform.

The general use case is to make an IOASID for something like the GPA
and use it concurrently with three say three devices:
 - VFIO (not PASID)
 - VDPA (PASID capable HW)
 - 'Future VDPA storage' (PASID capable HW)

The uAPI for this should be very general and the kernel should decide
the optimal way to configure the HW. Maybe it is one page table and
one PASID, or maybe it is something else.

Allowing the kernel to choose the PASID once it knows the RID is the
highest generality.

> > non-PASID IO page tables as well. So I'd keep the two things
> > separated in the uAPI - even though the kernel today has a global
> > PASID pool.
> 
> for non-PASID usages the allocated PASID will be wasted if we don't
> separate ioasid from pasid. But it may be worthwhile given 1m available 
> pasids and the simplification in the uAPI which only needs to care about 
> one id space then.

I'd prefer this be a platform choice and not forced in the uAPI,
because we can never go back on it if we see that yes we need to
optimize here. I understand many platforms have different available
PASID spaces already.

> > Simple things like DPDK can use #2 and potentially have better PASID
> > limits. hypervisors will most likely have to use #1, but it depends on
> > how their vIOMMU interface works.
> 
> Can you elaborate why DPDK wants to use #2 i.e. not using a global
> PASID?

It gives the kernel an option to make the decision about the PASID
when it has the full information, including the RID.

> > I think the name IOASID is fine for the uAPI, the kernel version can
> > be called ioasid_id or something.
> 
> ioasid is already an id and then ioasid_id just adds confusion. Another
> point is that ioasid is currently used to represent both PCI PASID and
> ARM substream ID in the kernel. It implies that if we want to separate
> ioasid and pasid in the uAPI the 'pasid' also needs to be replaced with
> another general term usable for substream ID. Are we making the
> terms too confusing here?

This is why I also am not so sure about exposing the PASID in the API
because it is ultimately a HW specific item.

As I said to David, one avenue is to have some generic uAPI that is
very general and keep all this deeply detailed stuff, that really only
matters for qemu, as part of a more HW specific vIOMMU driver
interface.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-28 Thread Jason Gunthorpe
On Wed, Apr 28, 2021 at 07:47:56AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Wednesday, April 28, 2021 1:12 AM
> > 
> [...]
> > One option is VFIO can keep its group FD but nothing else will have
> > anthing like it. However I don't much like the idea that VFIO will
> > have a special and unique programming model to do that same things
> > other subsystem will do. That will make it harder for userspace to
> > implement.
> 
> Hi, Jason,
> 
> I have a question here. Based on discussions so far, it's clearly that the
> new ioasid uAPI will differ from existing VFIO uAPI a lot, e.g. ioasid-
> centric operations, no group fd, no incompatible domains, etc. Then 
> I wonder how we plan to support legacy VFIO applications in this 
> transition phase. 

I suspect the VFIO group fd will have to be registered with
/dev/ioasid in addition to each device if we are to retain the same
model.

> Earlier you ever mentioned the desire of directly replacing
> /dev/vfio/vfio with /dev/ioasid and having ioasid to present both
> VFIO and new uAPI. Doesn't it imply that we have to copy the VFIO
> container/group semantics into /dev/ioasid although it's a special
> programming model only for VFIO?

I gave that as a something to think about, if it doesn't work out then
it is just a bad idea to discard.

> Alternatively we could keep all the container/group legacy within VFIO
> and having /dev/ioasid support only the new uAPI semantics. In this case
> VFIO will include a shim iommu backend to connect its legacy uAPI into 
> drivers/ioasid backend functions for backward compatibility. Then VFIO
> will also support a new model which only uses its device uAPI to bind
> to new ioasid fd w/o using any legacy container/group/iommu uAPI.
> Does this sound a plan? 

It may be where we end up.. Though I fear it will make it overly
complex inside VFIO to access the new stuff. It would be very nice if
we could see a path where VFIO insides could only deal with the
in-kernel ioasid handles, whatever they are.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-28 Thread Alex Williamson
On Wed, 28 Apr 2021 06:34:11 +
"Tian, Kevin"  wrote:

> > From: Jason Gunthorpe 
> > Sent: Monday, April 26, 2021 8:38 PM
> >   
> [...]
> > > Want to hear your opinion for one open here. There is no doubt that
> > > an ioasid represents a HW page table when the table is constructed by
> > > userspace and then linked to the IOMMU through the bind/unbind
> > > API. But I'm not very sure about whether an ioasid should represent
> > > the exact pgtable or the mapping metadata when the underlying
> > > pgtable is indirectly constructed through map/unmap API. VFIO does
> > > the latter way, which is why it allows multiple incompatible domains
> > > in a single container which all share the same mapping metadata.  
> > 
> > I think VFIO's map/unmap is way too complex and we know it has bad
> > performance problems.  
> 
> Can you or Alex elaborate where the complexity and performance problem
> locate in VFIO map/umap? We'd like to understand more detail and see how 
> to avoid it in the new interface.


The map/unmap interface is really only good for long lived mappings,
the overhead is too high for things like vIOMMU use cases or any case
where the mapping is intended to be dynamic.  Userspace drivers must
make use of a long lived buffer mapping in order to achieve performance.

The mapping and unmapping granularity has been a problem as well,
type1v1 allowed arbitrary unmaps to bisect the original mapping, with
the massive caveat that the caller relies on the return value of the
unmap to determine what was actually unmapped because the IOMMU use of
superpages is transparent to the caller.  This led to type1v2 that
simply restricts the user to avoid ever bisecting mappings.  That still
leaves us with problems for things like virtio-mem support where we
need to create initial mappings with a granularity that allows us to
later remove entries, which can prevent effective use of IOMMU
superpages.

Locked page accounting has been another constant issue.  We perform
locked page accounting at the container level, where each container
accounts independently.  A user may require multiple containers, the
containers may pin the same physical memory, but be accounted against
the user once per container.

Those are the main ones I can think of.  It is nice to have a simple
map/unmap interface, I'd hope that a new /dev/ioasid interface wouldn't
raise the barrier to entry too high, but the user needs to have the
ability to have more control of their mappings and locked page
accounting should probably be offloaded somewhere.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-28 Thread Jason Gunthorpe
On Wed, Apr 28, 2021 at 10:58:29AM +1000, David Gibson wrote:
> On Tue, Apr 27, 2021 at 02:12:12PM -0300, Jason Gunthorpe wrote:
> > On Tue, Apr 27, 2021 at 03:08:46PM +1000, David Gibson wrote:
> > > > Starting from a BDF the general pseudo code is:
> > > >  device_name = first_directory_of("/sys/bus/pci/devices/BDF/vfio/")
> > > >  device_fd = open("/dev/vfio/"+device_name)
> > > >  ioasidfd = open("/dev/ioasid")
> > > >  ioctl(device_fd, JOIN_IOASID_FD, ioasidfd)
> > > 
> > > This line is the problem.
> > > 
> > > [Historical aside: Alex's early drafts for the VFIO interface looked
> > > quite similar to this.  Ben Herrenschmidt and myself persuaded him it
> > > was a bad idea, and groups were developed instead.  I still think it's
> > > a bad idea, and not just for POWER]
> > 
> > Spawning the VFIO device FD from the group FD is incredibly gross from
> > a kernel design perspective. Since that was done the struct
> > vfio_device missed out on a sysfs presence and doesn't have the
> > typical 'struct device' member or dedicated char device you'd expect a
> > FD based subsystem to have.
> > 
> > This basically traded normal usage of the driver core for something
> > that doesn't serve a technical usage. Given we are now nearly 10 years
> > later and see that real widely deployed applications are not doing
> > anything special with the group FD it makes me question the wisdom of
> > this choice.
> 
> I'm really not sure what "anything special" would constitute here.

Well, really anything actually. All I see in, say, dpdk, is open the
group fd, get a device fd, do the container dance and never touch the
group fd again or care about groups in any way. It seems typical of
this class of application.

If dpdk is exposing other devices to a risk it certainly hasn't done
anything to make that obvious.

> > Okay, that is fair, but let's solve that problem directly. For
> > instance netlink has been going in the direction of adding a "extack"
> > from the kernel which is a descriptive error string. If the failing
> > ioctl returned the string:
> > 
> >   "cannot join this device to the IOASID because device XXX in the
> >same group #10 is in use"
> 
> Um.. is there a sane way to return strings from an ioctl()?

Yes, it can be done, a string buffer pointer and length in the input
for instance.

> Getting the device fds from the group fd kind of follows, because it's
> unsafe to do basically anything on the device unless you already
> control the group (which in this case means attaching it to a
> container/ioasid).  I'm entirely open to ways of doing that that are
> less inelegant from a sysfs integration point of view, but the point
> is you must manage the group before you can do anything at all with
> individual devices.

I think most likely VFIO is going to be the only thing to manage a
multi-device group.

I see things like VDPA being primarily about PASID, and an IOASID that
is matched to a PASID is inherently a single device IOMMU group.

> I don't see why.  I mean, sure, you don't want explicitly the *vfio*
> group as such.  But IOMMU group is already a cross-subsystem concept
> and you can explicitly expose that in a different way.

Yes, and no, the kernel drivers in something like VDPA have decided
what device and group they are in before we get to IOASID. It is
illogical to try to retro-actively bolt in a group concept to their
APIs.
 
> Again, I realy think this is necessary complexity.  You're right that
> far too little of the userspace properly understands group
> restrictions.. but these come from real hardware limitations, and I
> don't feel like making it *less* obvious in the interface is going to
> help that.

The appeal of making it less obvious is we can have a single
simplified API flow so that an application that doesn't understand or
care about groups can have uniformity.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-28 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, April 28, 2021 1:12 AM
> 
[...]
> One option is VFIO can keep its group FD but nothing else will have
> anthing like it. However I don't much like the idea that VFIO will
> have a special and unique programming model to do that same things
> other subsystem will do. That will make it harder for userspace to
> implement.

Hi, Jason,

I have a question here. Based on discussions so far, it's clearly that the
new ioasid uAPI will differ from existing VFIO uAPI a lot, e.g. ioasid-
centric operations, no group fd, no incompatible domains, etc. Then 
I wonder how we plan to support legacy VFIO applications in this 
transition phase. Earlier you ever mentioned the desire of directly
replacing /dev/vfio/vfio with /dev/ioasid and having ioasid to present
both VFIO and new uAPI. Doesn't it imply that we have to copy the 
VFIO container/group semantics into /dev/ioasid although it's a special 
programming model only for VFIO?

Alternatively we could keep all the container/group legacy within VFIO
and having /dev/ioasid support only the new uAPI semantics. In this case
VFIO will include a shim iommu backend to connect its legacy uAPI into 
drivers/ioasid backend functions for backward compatibility. Then VFIO
will also support a new model which only uses its device uAPI to bind
to new ioasid fd w/o using any legacy container/group/iommu uAPI.
Does this sound a plan? 

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-28 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, April 28, 2021 1:12 AM
> 
[...] 
> > As Alex says, if this line fails because of the group restrictions,
> > that's not great because it's not very obvious what's gone wrong.
> 
> Okay, that is fair, but let's solve that problem directly. For
> instance netlink has been going in the direction of adding a "extack"
> from the kernel which is a descriptive error string. If the failing
> ioctl returned the string:
> 
>   "cannot join this device to the IOASID because device XXX in the
>same group #10 is in use"
> 
> Would you agree it is now obvious what has gone wrong? In fact would
> you agree this is a lot better user experience than what applications
> do today even though they have the group FD?
> 

Currently all the discussions are around implicit vs. explicit uAPI semantics
on the group restriction. However if we look beyond group the implicit 
semantics might be inevitable when dealing with incompatible iommu
domains. An existing example of iommu incompatibility is IOMMU_
CACHE. In the future there could be other incompatibilities such as 
whether nested translation is supported. In the end the userspace has 
to do some due diligence on understanding iommu topology and attributes 
to decide how many VFIO containers or ioasid fds should be created. It 
does push some burden to userspace but it's difficult to define a group-
like kernel object to enforce such restriction for iommu compatibility. 
Then the best that the kernel can do is to return an informational error 
message in case an incompatible device is attached to the existing domain. 
If this is the perceived way to move forward anyway, I feel that removing 
explicit group FD from uAPI doesn't make userspace worse...

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-28 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Monday, April 26, 2021 8:38 PM
> 
[...]
> > Want to hear your opinion for one open here. There is no doubt that
> > an ioasid represents a HW page table when the table is constructed by
> > userspace and then linked to the IOMMU through the bind/unbind
> > API. But I'm not very sure about whether an ioasid should represent
> > the exact pgtable or the mapping metadata when the underlying
> > pgtable is indirectly constructed through map/unmap API. VFIO does
> > the latter way, which is why it allows multiple incompatible domains
> > in a single container which all share the same mapping metadata.
> 
> I think VFIO's map/unmap is way too complex and we know it has bad
> performance problems.

Can you or Alex elaborate where the complexity and performance problem
locate in VFIO map/umap? We'd like to understand more detail and see how 
to avoid it in the new interface.

> 
> If /dev/ioasid is single HW page table only then I would focus on that
> implementation and leave it to userspace to span different
> /dev/ioasids if needed.
> 
> > OK, now I see where the disconnection comes from. In my context ioasid
> > is the identifier that is actually used in the wire, but seems you treat it 
> > as
> > a sw-defined namespace purely for representing page tables. We should
> > clear this concept first before further discussing other details. 
> 
> There is no general HW requirement that every IO page table be
> referred to by the same PASID and this API would have to support

Yes, but what is the value of allowing multiple PASIDs referring to the
the same I/O page table (except the nesting pgtable case)? Doesn't it 
lead to poor iotlb efficiency issue similar to multiple iommu domains 
referring to the same page table?

> non-PASID IO page tables as well. So I'd keep the two things
> separated in the uAPI - even though the kernel today has a global
> PASID pool.

for non-PASID usages the allocated PASID will be wasted if we don't
separate ioasid from pasid. But it may be worthwhile given 1m available 
pasids and the simplification in the uAPI which only needs to care about 
one id space then.

> 
> > Then following your proposal, does it mean that we need another
> > interface for allocating PASID? and since ioasid means different
> > thing in uAPI and in-kernel API, possibly a new name is required to
> > avoid confusion?
> 
> I would suggest have two ways to control the PASID
> 
>  1) Over /dev/ioasid allocate a PASID for an IOASID. All future PASID
> based usages of the IOASID will use that global PASID
> 
>  2) Over the device FD, when the IOASID is bound return the PASID that
> was selected. If the IOASID does not have a global PASID then the
> kernel is free to make something up. In this mode a single IOASID
> can have multiple PASIDs.
> 
> Simple things like DPDK can use #2 and potentially have better PASID
> limits. hypervisors will most likely have to use #1, but it depends on
> how their vIOMMU interface works.

Can you elaborate why DPDK wants to use #2 i.e. not using a global
PASID?

> 
> I think the name IOASID is fine for the uAPI, the kernel version can
> be called ioasid_id or something.

ioasid is already an id and then ioasid_id just adds confusion. Another
point is that ioasid is currently used to represent both PCI PASID and
ARM substream ID in the kernel. It implies that if we want to separate
ioasid and pasid in the uAPI the 'pasid' also needs to be replaced with
another general term usable for substream ID. Are we making the
terms too confusing here? 

> 
> (also looking at ioasid.c, why do we need such a thin and odd wrapper
> around xarray?)
> 

I'll leave it to Jean and Jacob.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

  1   2   3   >