Re: [RFC v2] /dev/iommu uAPI proposal

2021-08-10 Thread David Gibson
On Fri, Aug 06, 2021 at 09:32:11AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 06, 2021 at 02:45:26PM +1000, David Gibson wrote:
> 
> > Well, that's kind of what I'm doing.  PCI currently has the notion of
> > "default" address space for a RID, but there's no guarantee that other
> > buses (or even future PCI extensions) will.  The idea is that
> > "endpoint" means exactly the (RID, PASID) or (SID, SSID) or whatever
> > future variations are on that.
> 
> This is already happening in this proposal, it is why I insisted that
> the driver facing API has to be very explicit. That API specifies
> exactly what the device silicon is doing.
> 
> However, that is placed at the IOASID level. There is no reason to
> create endpoint objects that are 1:1 with IOASID objects, eg for
> PASID.

They're not 1:1 though.  You can have multiple endpoints in the same
IOAS, that's the whole point.

> We need to have clear software layers and responsibilities, I think
> this is where the VFIO container design has fallen behind.
> 
> The device driver is responsible to delcare what TLPs the device it
> controls will issue

Right.. and I'm envisaging an endpoint as a abstraction to represent a
single TLP.

> The system layer is responsible to determine how those TLPs can be
> matched to IO page tables, if at all
> 
> The IO page table layer is responsible to map the TLPs to physical
> memory.
> 
> Each must stay in its box and we should not create objects that smush
> together, say, the device and system layers because it will only make
> a mess of the software design.

I agree... and endpoints are explicitly an attempt to do that.  I
don't see how you think they're smushing things together.

> Since the system layer doesn't have any concrete objects in our
> environment (which is based on devices and IO page tables) it has to
> exist as metadata attached to the other two objects.

Whereas I'm suggesting clarifying this by *creating* concrete objects
to represent the concept we need.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC v2] /dev/iommu uAPI proposal

2021-08-10 Thread Tian, Kevin
> From: Eric Auger 
> Sent: Tuesday, August 10, 2021 3:17 PM
> 
> Hi Kevin,
> 
> On 8/5/21 2:36 AM, Tian, Kevin wrote:
> >> From: Eric Auger 
> >> Sent: Wednesday, August 4, 2021 11:59 PM
> >>
> > [...]
> >>> 1.2. Attach Device to I/O address space
> >>> +++
> >>>
> >>> Device attach/bind is initiated through passthrough framework uAPI.
> >>>
> >>> Device attaching is allowed only after a device is successfully bound to
> >>> the IOMMU fd. User should provide a device cookie when binding the
> >>> device through VFIO uAPI. This cookie is used when the user queries
> >>> device capability/format, issues per-device iotlb invalidation and
> >>> receives per-device I/O page fault data via IOMMU fd.
> >>>
> >>> Successful binding puts the device into a security context which isolates
> >>> its DMA from the rest system. VFIO should not allow user to access the
> >> s/from the rest system/from the rest of the system
> >>> device before binding is completed. Similarly, VFIO should prevent the
> >>> user from unbinding the device before user access is withdrawn.
> >> With Intel scalable IOV, I understand you could assign an RID/PASID to
> >> one VM and another one to another VM (which is not the case for ARM).
> Is
> >> it a targetted use case?How would it be handled? Is it related to the
> >> sub-groups evoked hereafter?
> > Not related to sub-group. Each mdev is bound to the IOMMU fd
> respectively
> > with the defPASID which represents the mdev.
> But how does it work in term of security. The device (RID) is bound to
> an IOMMU fd. But then each SID/PASID may be working for a different VM.
> How do you detect this is safe as each SID can work safely for a
> different VM versus the ARM case where it is not possible.

PASID is managed by the parent driver, which knows which PASID to be 
used given a mdev when later attaching it to an IOASID. 

> 
> 1.3 says
> "
> 
> 1)  A successful binding call for the first device in the group creates
> the security context for the entire group, by:
> "
> What does it mean for above scalable IOV use case?
> 

This is a good question (as Alex raised) which needs more explanation 
in next version:

https://lore.kernel.org/linux-iommu/20210712124150.2bf421d1.alex.william...@redhat.com/

In general we need provide different helpers for binding pdev/mdev/
sw mdev. 1.3 in v2 describes the behavior for pdev via iommu_register_
device(). for mdev a new helper (e.g. iommu_register_device_pasid()) 
is required and then the IOMMU-API will also provide a pasid variation 
for creating security context per pasid. sw mdev will also have its binding 
helper to indicate no routing info required in ioasid attaching.

Thanks
Kevin 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-08-10 Thread Eric Auger
Hi Kevin,

On 8/5/21 2:36 AM, Tian, Kevin wrote:
>> From: Eric Auger 
>> Sent: Wednesday, August 4, 2021 11:59 PM
>>
> [...] 
>>> 1.2. Attach Device to I/O address space
>>> +++
>>>
>>> Device attach/bind is initiated through passthrough framework uAPI.
>>>
>>> Device attaching is allowed only after a device is successfully bound to
>>> the IOMMU fd. User should provide a device cookie when binding the
>>> device through VFIO uAPI. This cookie is used when the user queries
>>> device capability/format, issues per-device iotlb invalidation and
>>> receives per-device I/O page fault data via IOMMU fd.
>>>
>>> Successful binding puts the device into a security context which isolates
>>> its DMA from the rest system. VFIO should not allow user to access the
>> s/from the rest system/from the rest of the system
>>> device before binding is completed. Similarly, VFIO should prevent the
>>> user from unbinding the device before user access is withdrawn.
>> With Intel scalable IOV, I understand you could assign an RID/PASID to
>> one VM and another one to another VM (which is not the case for ARM). Is
>> it a targetted use case?How would it be handled? Is it related to the
>> sub-groups evoked hereafter?
> Not related to sub-group. Each mdev is bound to the IOMMU fd respectively
> with the defPASID which represents the mdev.
But how does it work in term of security. The device (RID) is bound to
an IOMMU fd. But then each SID/PASID may be working for a different VM.
How do you detect this is safe as each SID can work safely for a
different VM versus the ARM case where it is not possible.

1.3 says
"

1)  A successful binding call for the first device in the group creates
the security context for the entire group, by:
"
What does it mean for above scalable IOV use case?

>
>> Actually all devices bound to an IOMMU fd should have the same parent
>> I/O address space or root address space, am I correct? If so, maybe add
>> this comment explicitly?
> in most cases yes but it's not mandatory. multiple roots are allowed
> (e.g. with vIOMMU but no nesting).
OK, right, this corresponds to example 4.2 for example. I misinterpreted
the notion of security context. The security context does not match the
IOMMU fd but is something implicit created on 1st device binding.
>
> [...]
>>> The device in the /dev/iommu context always refers to a physical one
>>> (pdev) which is identifiable via RID. Physically each pdev can support
>>> one default I/O address space (routed via RID) and optionally multiple
>>> non-default I/O address spaces (via RID+PASID).
>>>
>>> The device in VFIO context is a logic concept, being either a physical
>>> device (pdev) or mediated device (mdev or subdev). Each vfio device
>>> is represented by RID+cookie in IOMMU fd. User is allowed to create
>>> one default I/O address space (routed by vRID from user p.o.v) per
>>> each vfio_device.
>> The concept of default address space is not fully clear for me. I
>> currently understand this is a
>> root address space (not nesting). Is that coorect.This may need
>> clarification.
> w/o PASID there is only one address space (either GPA or GIOVA)
> per device. This one is called default. whether it's root is orthogonal
> (e.g. GIOVA could be also nested) to the device view of this space.
>
> w/ PASID additional address spaces can be targeted by the device.
> those are called non-default.
>
> I could also rename default to RID address space and non-default to 
> RID+PASID address space if doing so makes it clearer.
Yes I think it is worth having a kind of glossary and defining root as,
default as as you clearly defined child/parent.
>
>>> VFIO decides the routing information for this default
>>> space based on device type:
>>>
>>> 1)  pdev, routed via RID;
>>>
>>> 2)  mdev/subdev with IOMMU-enforced DMA isolation, routed via
>>> the parent's RID plus the PASID marking this mdev;
>>>
>>> 3)  a purely sw-mediated device (sw mdev), no routing required i.e. no
>>> need to install the I/O page table in the IOMMU. sw mdev just uses
>>> the metadata to assist its internal DMA isolation logic on top of
>>> the parent's IOMMU page table;
>> Maybe you should introduce this concept of SW mediated device earlier
>> because it seems to special case the way the attach behaves. I am
>> especially refering to
>>
>> "Successful attaching activates an I/O address space in the IOMMU, if the
>> device is not purely software mediated"
> makes sense.
>
>>> In addition, VFIO may allow user to create additional I/O address spaces
>>> on a vfio_device based on the hardware capability. In such case the user
>>> has its own view of the virtual routing information (vPASID) when marking
>>> these non-default address spaces.
>> I do not catch what does mean "marking these non default address space".
> as explained above, those non-default address spaces are identified/routed
> via PASID. 
>
>>> 1.3. Group isolation
>>> 
>

RE: [RFC v2] /dev/iommu uAPI proposal

2021-08-09 Thread Tian, Kevin
> From: David Gibson 
> Sent: Tuesday, August 10, 2021 12:48 PM
> 
> On Mon, Aug 09, 2021 at 08:34:06AM +, Tian, Kevin wrote:
> > > From: David Gibson 
> > > Sent: Friday, August 6, 2021 12:45 PM
> > > > > > In concept I feel the purpose of DMA endpoint is equivalent to the
> > > routing
> > > > > > info in this proposal.
> > > > >
> > > > > Maybe?  I'm afraid I never quite managed to understand the role of
> the
> > > > > routing info in your proposal.
> > > > >
> > > >
> > > > the IOMMU routes incoming DMA packets to a specific I/O page table,
> > > > according to RID or RID+PASID carried in the packet. RID or RID+PASID
> > > > is the routing information (represented by device cookie +PASID in
> > > > proposed uAPI) and what the iommu driver really cares when activating
> > > > the I/O page table in the iommu.
> > >
> > > Ok, so yes, endpoint is roughly equivalent to that.  But my point is
> > > that the IOMMU layer really only cares about that (device+routing)
> > > combination, not other aspects of what the device is.  So that's the
> > > concept we should give a name and put front and center in the API.
> > >
> >
> > This is how this proposal works, centered around the routing info. the
> > uAPI doesn't care what the device is. It just requires the user to specify
> > the user view of routing info (device fd + optional pasid) to tag an IOAS.
> 
> Which works as long as (just device fd) and (device fd + PASID) covers
> all the options.  If we have new possibilities we need new interfaces.
> And, that can't even handle the case of one endpoint for multiple
> devices (e.g. ACS-incapable bridge).

why not? We have went through a long debate in v1 to reach conclusion
that a device-centric uAPI can cover above group scenario (see section 1.3)

> 
> > the user view is then converted to the kernel view of routing (rid or
> > rid+pasid) by vfio driver and then passed to iommu fd in the attaching
> > operation. and GET_INFO interface is provided for the user to check
> > whether a device supports multiple IOASes and whether pasid space is
> > delegated to the user. We just need a better name if pasid is considered
> > too pci specific...
> >
> > But creating an endpoint per ioasid and making it centered in uAPI is not
> > what the IOMMU layer cares about.
> 
> It's not an endpoint per ioasid.  You can have multiple endpoints per
> ioasid, just not the other way around.  As it is multiple IOASes per

you need create one endpoint per device fd to attach to gpa_ioasid.

then one endpoint per device fd to attach to pasidtbl_ioasid on arm/amd.

then one endpoint per pasid to attach to gva_ioasid on intel.

In the end you just create one endpoint per each attached ioasid given 
a device.

> device means *some* sort of disambiguation (generally by PASID) which
> is hard to describe generall.  Having endpoints as a first-class
> concept makes that simpler.
> 

I don't think pasid causes any disambiguation (except the name itself
being pci specific). with multiple IOASes you always need an id to tag it. 
This id is what iommu layer cares about. which endpoint on the device 
uses the id is not a business to iommu.

Thanks
Kevin

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-08-09 Thread David Gibson
On Mon, Aug 09, 2021 at 08:34:06AM +, Tian, Kevin wrote:
> > From: David Gibson 
> > Sent: Friday, August 6, 2021 12:45 PM
> > > > > In concept I feel the purpose of DMA endpoint is equivalent to the
> > routing
> > > > > info in this proposal.
> > > >
> > > > Maybe?  I'm afraid I never quite managed to understand the role of the
> > > > routing info in your proposal.
> > > >
> > >
> > > the IOMMU routes incoming DMA packets to a specific I/O page table,
> > > according to RID or RID+PASID carried in the packet. RID or RID+PASID
> > > is the routing information (represented by device cookie +PASID in
> > > proposed uAPI) and what the iommu driver really cares when activating
> > > the I/O page table in the iommu.
> > 
> > Ok, so yes, endpoint is roughly equivalent to that.  But my point is
> > that the IOMMU layer really only cares about that (device+routing)
> > combination, not other aspects of what the device is.  So that's the
> > concept we should give a name and put front and center in the API.
> > 
> 
> This is how this proposal works, centered around the routing info. the 
> uAPI doesn't care what the device is. It just requires the user to specify 
> the user view of routing info (device fd + optional pasid) to tag an IOAS. 

Which works as long as (just device fd) and (device fd + PASID) covers
all the options.  If we have new possibilities we need new interfaces.
And, that can't even handle the case of one endpoint for multiple
devices (e.g. ACS-incapable bridge).

> the user view is then converted to the kernel view of routing (rid or 
> rid+pasid) by vfio driver and then passed to iommu fd in the attaching 
> operation. and GET_INFO interface is provided for the user to check 
> whether a device supports multiple IOASes and whether pasid space is 
> delegated to the user. We just need a better name if pasid is considered 
> too pci specific...
> 
> But creating an endpoint per ioasid and making it centered in uAPI is not 
> what the IOMMU layer cares about.

It's not an endpoint per ioasid.  You can have multiple endpoints per
ioasid, just not the other way around.  As it is multiple IOASes per
device means *some* sort of disambiguation (generally by PASID) which
is hard to describe generall.  Having endpoints as a first-class
concept makes that simpler.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC v2] /dev/iommu uAPI proposal

2021-08-09 Thread Tian, Kevin
> From: David Gibson 
> Sent: Friday, August 6, 2021 12:45 PM
> > > > In concept I feel the purpose of DMA endpoint is equivalent to the
> routing
> > > > info in this proposal.
> > >
> > > Maybe?  I'm afraid I never quite managed to understand the role of the
> > > routing info in your proposal.
> > >
> >
> > the IOMMU routes incoming DMA packets to a specific I/O page table,
> > according to RID or RID+PASID carried in the packet. RID or RID+PASID
> > is the routing information (represented by device cookie +PASID in
> > proposed uAPI) and what the iommu driver really cares when activating
> > the I/O page table in the iommu.
> 
> Ok, so yes, endpoint is roughly equivalent to that.  But my point is
> that the IOMMU layer really only cares about that (device+routing)
> combination, not other aspects of what the device is.  So that's the
> concept we should give a name and put front and center in the API.
> 

This is how this proposal works, centered around the routing info. the 
uAPI doesn't care what the device is. It just requires the user to specify 
the user view of routing info (device fd + optional pasid) to tag an IOAS. 
the user view is then converted to the kernel view of routing (rid or 
rid+pasid) by vfio driver and then passed to iommu fd in the attaching 
operation. and GET_INFO interface is provided for the user to check 
whether a device supports multiple IOASes and whether pasid space is 
delegated to the user. We just need a better name if pasid is considered 
too pci specific...

But creating an endpoint per ioasid and making it centered in uAPI is not 
what the IOMMU layer cares about.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-08-06 Thread Jason Gunthorpe via iommu
On Fri, Aug 06, 2021 at 02:45:26PM +1000, David Gibson wrote:

> Well, that's kind of what I'm doing.  PCI currently has the notion of
> "default" address space for a RID, but there's no guarantee that other
> buses (or even future PCI extensions) will.  The idea is that
> "endpoint" means exactly the (RID, PASID) or (SID, SSID) or whatever
> future variations are on that.

This is already happening in this proposal, it is why I insisted that
the driver facing API has to be very explicit. That API specifies
exactly what the device silicon is doing.

However, that is placed at the IOASID level. There is no reason to
create endpoint objects that are 1:1 with IOASID objects, eg for
PASID.

We need to have clear software layers and responsibilities, I think
this is where the VFIO container design has fallen behind.

The device driver is responsible to delcare what TLPs the device it
controls will issue

The system layer is responsible to determine how those TLPs can be
matched to IO page tables, if at all

The IO page table layer is responsible to map the TLPs to physical
memory.

Each must stay in its box and we should not create objects that smush
together, say, the device and system layers because it will only make
a mess of the software design.

Since the system layer doesn't have any concrete objects in our
environment (which is based on devices and IO page tables) it has to
exist as metadata attached to the other two objects.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-08-05 Thread David Gibson
On Tue, Aug 03, 2021 at 03:19:26AM +, Tian, Kevin wrote:
> > From: David Gibson 
> > Sent: Tuesday, August 3, 2021 9:51 AM
> > 
> > On Wed, Jul 28, 2021 at 04:04:24AM +, Tian, Kevin wrote:
> > > Hi, David,
> > >
> > > > From: David Gibson 
> > > > Sent: Monday, July 26, 2021 12:51 PM
> > > >
> > > > On Fri, Jul 09, 2021 at 07:48:44AM +, Tian, Kevin wrote:
> > > > > /dev/iommu provides an unified interface for managing I/O page tables
> > for
> > > > > devices assigned to userspace. Device passthrough frameworks (VFIO,
> > > > vDPA,
> > > > > etc.) are expected to use this interface instead of creating their own
> > logic to
> > > > > isolate untrusted device DMAs initiated by userspace.
> > > > >
> > > > > This proposal describes the uAPI of /dev/iommu and also sample
> > > > sequences
> > > > > with VFIO as example in typical usages. The driver-facing kernel API
> > > > provided
> > > > > by the iommu layer is still TBD, which can be discussed after 
> > > > > consensus
> > is
> > > > > made on this uAPI.
> > > > >
> > > > > It's based on a lengthy discussion starting from here:
> > > > >   https://lore.kernel.org/linux-
> > > > iommu/20210330132830.go2356...@nvidia.com/
> > > > >
> > > > > v1 can be found here:
> > > > >   https://lore.kernel.org/linux-
> > > >
> > iommu/PH0PR12MB54811863B392C644E5365446DC3E9@PH0PR12MB5481.n
> > > > amprd12.prod.outlook.com/T/
> > > > >
> > > > > This doc is also tracked on github, though it's not very useful for 
> > > > > v1->v2
> > > > > given dramatic refactoring:
> > > > >   https://github.com/luxis1999/dev_iommu_uapi
> > > >
> > > > Thanks for all your work on this, Kevin.  Apart from the actual
> > > > semantic improvements, I'm finding v2 significantly easier to read and
> > > > understand than v1.
> > > >
> > > > [snip]
> > > > > 1.2. Attach Device to I/O address space
> > > > > +++
> > > > >
> > > > > Device attach/bind is initiated through passthrough framework uAPI.
> > > > >
> > > > > Device attaching is allowed only after a device is successfully bound 
> > > > > to
> > > > > the IOMMU fd. User should provide a device cookie when binding the
> > > > > device through VFIO uAPI. This cookie is used when the user queries
> > > > > device capability/format, issues per-device iotlb invalidation and
> > > > > receives per-device I/O page fault data via IOMMU fd.
> > > > >
> > > > > Successful binding puts the device into a security context which 
> > > > > isolates
> > > > > its DMA from the rest system. VFIO should not allow user to access the
> > > > > device before binding is completed. Similarly, VFIO should prevent the
> > > > > user from unbinding the device before user access is withdrawn.
> > > > >
> > > > > When a device is in an iommu group which contains multiple devices,
> > > > > all devices within the group must enter/exit the security context
> > > > > together. Please check {1.3} for more info about group isolation via
> > > > > this device-centric design.
> > > > >
> > > > > Successful attaching activates an I/O address space in the IOMMU,
> > > > > if the device is not purely software mediated. VFIO must provide 
> > > > > device
> > > > > specific routing information for where to install the I/O page table 
> > > > > in
> > > > > the IOMMU for this device. VFIO must also guarantee that the attached
> > > > > device is configured to compose DMAs with the routing information
> > that
> > > > > is provided in the attaching call. When handling DMA requests, IOMMU
> > > > > identifies the target I/O address space according to the routing
> > > > > information carried in the request. Misconfiguration breaks DMA
> > > > > isolation thus could lead to severe security vulnerability.
> > > > >
> > > > > Routing information is per-device and bus specific. For PCI, it is
> > > > > Requester ID (RID) identifying the device plus optional Process 
> > > > > Address
> > > > > Space ID (PASID). For ARM, it is Stream ID (SID) plus optional Sub-
> > Stream
> > > > > ID (SSID). PASID or SSID is used when multiple I/O address spaces are
> > > > > enabled on a single device. For simplicity and continuity reason the
> > > > > following context uses RID+PASID though SID+SSID may sound a clearer
> > > > > naming from device p.o.v. We can decide the actual naming when
> > coding.
> > > > >
> > > > > Because one I/O address space can be attached by multiple devices,
> > > > > per-device routing information (plus device cookie) is tracked under
> > > > > each IOASID and is used respectively when activating the I/O address
> > > > > space in the IOMMU for each attached device.
> > > > >
> > > > > The device in the /dev/iommu context always refers to a physical one
> > > > > (pdev) which is identifiable via RID. Physically each pdev can support
> > > > > one default I/O address space (routed via RID) and optionally multiple
> > > > > non-default I/O address spaces (via RID+PASID).
> > > > >
> > > > > The device in 

Re: [RFC v2] /dev/iommu uAPI proposal

2021-08-05 Thread David Gibson
On Wed, Aug 04, 2021 at 11:04:47AM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 02, 2021 at 02:49:44AM +, Tian, Kevin wrote:
> 
> > Can you elaborate? IMO the user only cares about the label (device cookie 
> > plus optional vPASID) which is generated by itself when doing the attaching
> > call, and expects this virtual label being used in various spots 
> > (invalidation,
> > page fault, etc.). How the system labels the traffic (the physical RID or 
> > RID+
> > PASID) should be completely invisible to userspace.
> 
> I don't think that is true if the vIOMMU driver is also emulating
> PASID. Presumably the same is true for other PASID-like schemes.

Right.  The idea for an SVA capable vIOMMU in my scheme is that the
hypervisor would set up an IOAS of address type "PASID+address" with
the mappings made by the guest according to its vIOMMU semantics.
Then SVA capable devices would be plugged into that IOAS by using
"PASID+address" type endpoints from those devices.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC v2] /dev/iommu uAPI proposal

2021-08-05 Thread David Gibson
On Wed, Aug 04, 2021 at 11:07:42AM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 03, 2021 at 11:58:54AM +1000, David Gibson wrote:
> > > I'd rather deduce the endpoint from a collection of devices than the
> > > other way around...
> > 
> > Which I think is confusing, and in any case doesn't cover the case of
> > one "device" with multiple endpoints.
> 
> Well they are both confusing, and I'd prefer to focus on the common
> case without extra mandatory steps. Exposing optional endpoint sharing
> information seems more in line with where everything is going than
> making endpoint sharing a first class object.
> 
> AFAIK a device with multiple endpoints where those endpoints are
> shared with other devices doesn't really exist/or is useful? Eg PASID
> has multiple RIDs by they are not shared.

No, I can't think of a (non-contrived) example where a device would
have *both* multiple endpoints and those endpoints are shared amongst
multiple devices.  I can easily think of examples where a device has
multiple (non shared) endpoints and where multiple devices share a
single endpoint.

The point is that making endpoints explicit separates the various
options here from the logic of the IOMMU layer itself.  New device
types with new possibilities here means new interfaces *on those
devices*, but not new interfaces on /dev/iommu.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC v2] /dev/iommu uAPI proposal

2021-08-05 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, August 5, 2021 7:27 PM
> 
> On Wed, Aug 04, 2021 at 10:59:21PM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Wednesday, August 4, 2021 10:05 PM
> > >
> > > On Mon, Aug 02, 2021 at 02:49:44AM +, Tian, Kevin wrote:
> > >
> > > > Can you elaborate? IMO the user only cares about the label (device
> cookie
> > > > plus optional vPASID) which is generated by itself when doing the
> attaching
> > > > call, and expects this virtual label being used in various spots
> (invalidation,
> > > > page fault, etc.). How the system labels the traffic (the physical RID 
> > > > or
> RID+
> > > > PASID) should be completely invisible to userspace.
> > >
> > > I don't think that is true if the vIOMMU driver is also emulating
> > > PASID. Presumably the same is true for other PASID-like schemes.
> > >
> >
> > I'm getting even more confused with this comment. Isn't it the
> > consensus from day one that physical PASID should not be exposed
> > to userspace as doing so breaks live migration?
> 
> Uh, no?
> 
> > with PASID emulation vIOMMU only cares about vPASID instead of
> > pPASID, and the uAPI only requires user to register vPASID instead
> > of reporting pPASID back to userspace...
> 
> vPASID is only a feature of one device in existance, so we can't make
> vPASID mandatory.
> 

sure. my point is just that if vPASID is being emulated there is no need
of exposing pPASID to user space. Can you give a concrete example
where pPASID must be exposed and how the user wants to use this
information? 

Thanks
Kevin

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-08-05 Thread Jason Gunthorpe via iommu
On Wed, Aug 04, 2021 at 10:59:21PM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Wednesday, August 4, 2021 10:05 PM
> > 
> > On Mon, Aug 02, 2021 at 02:49:44AM +, Tian, Kevin wrote:
> > 
> > > Can you elaborate? IMO the user only cares about the label (device cookie
> > > plus optional vPASID) which is generated by itself when doing the 
> > > attaching
> > > call, and expects this virtual label being used in various spots 
> > > (invalidation,
> > > page fault, etc.). How the system labels the traffic (the physical RID or 
> > > RID+
> > > PASID) should be completely invisible to userspace.
> > 
> > I don't think that is true if the vIOMMU driver is also emulating
> > PASID. Presumably the same is true for other PASID-like schemes.
> > 
> 
> I'm getting even more confused with this comment. Isn't it the
> consensus from day one that physical PASID should not be exposed
> to userspace as doing so breaks live migration? 

Uh, no?

> with PASID emulation vIOMMU only cares about vPASID instead of
> pPASID, and the uAPI only requires user to register vPASID instead
> of reporting pPASID back to userspace...

vPASID is only a feature of one device in existance, so we can't make
vPASID mandatory.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC v2] /dev/iommu uAPI proposal

2021-08-04 Thread Tian, Kevin
> From: Eric Auger 
> Sent: Wednesday, August 4, 2021 11:59 PM
>
[...] 
> > 1.2. Attach Device to I/O address space
> > +++
> >
> > Device attach/bind is initiated through passthrough framework uAPI.
> >
> > Device attaching is allowed only after a device is successfully bound to
> > the IOMMU fd. User should provide a device cookie when binding the
> > device through VFIO uAPI. This cookie is used when the user queries
> > device capability/format, issues per-device iotlb invalidation and
> > receives per-device I/O page fault data via IOMMU fd.
> >
> > Successful binding puts the device into a security context which isolates
> > its DMA from the rest system. VFIO should not allow user to access the
> s/from the rest system/from the rest of the system
> > device before binding is completed. Similarly, VFIO should prevent the
> > user from unbinding the device before user access is withdrawn.
> With Intel scalable IOV, I understand you could assign an RID/PASID to
> one VM and another one to another VM (which is not the case for ARM). Is
> it a targetted use case?How would it be handled? Is it related to the
> sub-groups evoked hereafter?

Not related to sub-group. Each mdev is bound to the IOMMU fd respectively
with the defPASID which represents the mdev.

> 
> Actually all devices bound to an IOMMU fd should have the same parent
> I/O address space or root address space, am I correct? If so, maybe add
> this comment explicitly?

in most cases yes but it's not mandatory. multiple roots are allowed
(e.g. with vIOMMU but no nesting).

[...]
> > The device in the /dev/iommu context always refers to a physical one
> > (pdev) which is identifiable via RID. Physically each pdev can support
> > one default I/O address space (routed via RID) and optionally multiple
> > non-default I/O address spaces (via RID+PASID).
> >
> > The device in VFIO context is a logic concept, being either a physical
> > device (pdev) or mediated device (mdev or subdev). Each vfio device
> > is represented by RID+cookie in IOMMU fd. User is allowed to create
> > one default I/O address space (routed by vRID from user p.o.v) per
> > each vfio_device.
> The concept of default address space is not fully clear for me. I
> currently understand this is a
> root address space (not nesting). Is that coorect.This may need
> clarification.

w/o PASID there is only one address space (either GPA or GIOVA)
per device. This one is called default. whether it's root is orthogonal
(e.g. GIOVA could be also nested) to the device view of this space.

w/ PASID additional address spaces can be targeted by the device.
those are called non-default.

I could also rename default to RID address space and non-default to 
RID+PASID address space if doing so makes it clearer.

> > VFIO decides the routing information for this default
> > space based on device type:
> >
> > 1)  pdev, routed via RID;
> >
> > 2)  mdev/subdev with IOMMU-enforced DMA isolation, routed via
> > the parent's RID plus the PASID marking this mdev;
> >
> > 3)  a purely sw-mediated device (sw mdev), no routing required i.e. no
> > need to install the I/O page table in the IOMMU. sw mdev just uses
> > the metadata to assist its internal DMA isolation logic on top of
> > the parent's IOMMU page table;
> Maybe you should introduce this concept of SW mediated device earlier
> because it seems to special case the way the attach behaves. I am
> especially refering to
> 
> "Successful attaching activates an I/O address space in the IOMMU, if the
> device is not purely software mediated"

makes sense.

> 
> >
> > In addition, VFIO may allow user to create additional I/O address spaces
> > on a vfio_device based on the hardware capability. In such case the user
> > has its own view of the virtual routing information (vPASID) when marking
> > these non-default address spaces.
> I do not catch what does mean "marking these non default address space".

as explained above, those non-default address spaces are identified/routed
via PASID. 

> >
> > 1.3. Group isolation
> > 
[...]
> >
> > 1)  A successful binding call for the first device in the group creates
> > the security context for the entire group, by:
> >
> > * Verifying group viability in a similar way as VFIO does;
> >
> > * Calling IOMMU-API to move the group into a block-dma state,
> >   which makes all devices in the group attached to an block-dma
> >   domain with an empty I/O page table;
> this block-dma state/domain would deserve to be better defined (I know
> you already evoked it in 1.1 with the dma mapping protocol though)
> activates an empty I/O page table in the IOMMU (if the device is not
> purely SW mediated)?

sure. some explanations are scattered in following paragraph, but I
can consider to further clarify it.

> How does that relate to the default address space? Is it the same?

different. this block-dma domain doesn't hold any valid mappin

RE: [RFC v2] /dev/iommu uAPI proposal

2021-08-04 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, August 4, 2021 10:05 PM
> 
> On Mon, Aug 02, 2021 at 02:49:44AM +, Tian, Kevin wrote:
> 
> > Can you elaborate? IMO the user only cares about the label (device cookie
> > plus optional vPASID) which is generated by itself when doing the attaching
> > call, and expects this virtual label being used in various spots 
> > (invalidation,
> > page fault, etc.). How the system labels the traffic (the physical RID or 
> > RID+
> > PASID) should be completely invisible to userspace.
> 
> I don't think that is true if the vIOMMU driver is also emulating
> PASID. Presumably the same is true for other PASID-like schemes.
> 

I'm getting even more confused with this comment. Isn't it the
consensus from day one that physical PASID should not be exposed
to userspace as doing so breaks live migration? with PASID emulation
vIOMMU only cares about vPASID instead of pPASID, and the uAPI
only requires user to register vPASID instead of reporting pPASID
back to userspace...

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-08-04 Thread Eric Auger
Hi Kevin,

Few comments/questions below.

On 7/9/21 9:48 AM, Tian, Kevin wrote:
> /dev/iommu provides an unified interface for managing I/O page tables for 
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, 
> etc.) are expected to use this interface instead of creating their own logic 
> to 
> isolate untrusted device DMAs initiated by userspace. 
>
> This proposal describes the uAPI of /dev/iommu and also sample sequences 
> with VFIO as example in typical usages. The driver-facing kernel API provided 
> by the iommu layer is still TBD, which can be discussed after consensus is 
> made on this uAPI.
>
> It's based on a lengthy discussion starting from here:
>   
> https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/ 
>
> v1 can be found here:
>   
> https://lore.kernel.org/linux-iommu/ph0pr12mb54811863b392c644e5365446dc...@ph0pr12mb5481.namprd12.prod.outlook.com/T/
>
> This doc is also tracked on github, though it's not very useful for v1->v2 
> given dramatic refactoring:
>   https://github.com/luxis1999/dev_iommu_uapi 
>
> Changelog (v1->v2):
> - Rename /dev/ioasid to /dev/iommu (Jason);
> - Add a section for device-centric vs. group-centric design (many);
> - Add a section for handling no-snoop DMA (Jason/Alex/Paolo);
> - Add definition of user/kernel/shared I/O page tables (Baolu/Jason);
> - Allow one device bound to multiple iommu fd's (Jason);
> - No need to track user I/O page tables in kernel on ARM/AMD (Jean/Jason);
> - Add a device cookie for iotlb invalidation and fault handling (Jean/Jason);
> - Add capability/format query interface per device cookie (Jason);
> - Specify format/attribute when creating an IOASID, leading to several v1
>   uAPI commands removed (Jason);
> - Explain the value of software nesting (Jean);
> - Replace IOASID_REGISTER_VIRTUAL_MEMORY with software nesting (David/Jason);
> - Cover software mdev usage (Jason);
> - No restriction on map/unmap vs. bind/invalidate (Jason/David);
> - Report permitted IOVA range instead of reserved range (David);
> - Refine the sample structures and helper functions (Jason);
> - Add definition of default and non-default I/O address spaces;
> - Expand and clarify the design for PASID virtualization;
> - and lots of subtle refinement according to above changes;
>
> TOC
> 
> 1. Terminologies and Concepts
> 1.1. Manage I/O address space
> 1.2. Attach device to I/O address space
> 1.3. Group isolation
> 1.4. PASID virtualization
> 1.4.1. Devices which don't support DMWr
> 1.4.2. Devices which support DMWr
> 1.4.3. Mix different types together
> 1.4.4. User sequence
> 1.5. No-snoop DMA
> 2. uAPI Proposal
> 2.1. /dev/iommu uAPI
> 2.2. /dev/vfio device uAPI
> 2.3. /dev/kvm uAPI
> 3. Sample Structures and Helper Functions
> 4. Use Cases and Flows
> 4.1. A simple example
> 4.2. Multiple IOASIDs (no nesting)
> 4.3. IOASID nesting (software)
> 4.4. IOASID nesting (hardware)
> 4.5. Guest SVA (vSVA)
> 4.6. I/O page fault
> 
>
> 1. Terminologies and Concepts
> -
>
> IOMMU fd is the container holding multiple I/O address spaces. User 
> manages those address spaces through fd operations. Multiple fd's are 
> allowed per process, but with this proposal one fd should be sufficient for 
> all intended usages.
>
> IOASID is the fd-local software handle representing an I/O address space. 
> Each IOASID is associated with a single I/O page table. IOASIDs can be 
> nested together, implying the output address from one I/O page table 
> (represented by child IOASID) must be further translated by another I/O 
> page table (represented by parent IOASID).
>
> An I/O address space takes effect only after it is attached by a device. 
> One device is allowed to attach to multiple I/O address spaces. One I/O 
> address space can be attached by multiple devices.
>
> Device must be bound to an IOMMU fd before attach operation can be
> conducted. Though not necessary, user could bind one device to multiple
> IOMMU FD's. But no cross-FD IOASID nesting is allowed.
>
> The format of an I/O page table must be compatible to the attached 
> devices (or more specifically to the IOMMU which serves the DMA from
> the attached devices). User is responsible for specifying the format
> when allocating an IOASID, according to one or multiple devices which
> will be attached right after. Attaching a device to an IOASID with 
> incompatible format is simply rejected.
>
> Relationship between IOMMU fd, VFIO fd and KVM fd:
>
> -   IOMMU fd provides uAPI for managing IOASIDs and I/O page tables. 
> It also provides an unified capability/format reporting interface for
> each bound device. 
>
> -   VFIO fd provides uAPI for device binding and attaching. In this proposal 
> VFIO is used as the example of device passthrough frameworks. The
> routing information that identifies an I/O addre

Re: [RFC v2] /dev/iommu uAPI proposal

2021-08-04 Thread Jason Gunthorpe via iommu
On Tue, Aug 03, 2021 at 11:58:54AM +1000, David Gibson wrote:
> > I'd rather deduce the endpoint from a collection of devices than the
> > other way around...
> 
> Which I think is confusing, and in any case doesn't cover the case of
> one "device" with multiple endpoints.

Well they are both confusing, and I'd prefer to focus on the common
case without extra mandatory steps. Exposing optional endpoint sharing
information seems more in line with where everything is going than
making endpoint sharing a first class object.

AFAIK a device with multiple endpoints where those endpoints are
shared with other devices doesn't really exist/or is useful? Eg PASID
has multiple RIDs by they are not shared.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-08-04 Thread Jason Gunthorpe via iommu
On Mon, Aug 02, 2021 at 02:49:44AM +, Tian, Kevin wrote:

> Can you elaborate? IMO the user only cares about the label (device cookie 
> plus optional vPASID) which is generated by itself when doing the attaching
> call, and expects this virtual label being used in various spots 
> (invalidation,
> page fault, etc.). How the system labels the traffic (the physical RID or RID+
> PASID) should be completely invisible to userspace.

I don't think that is true if the vIOMMU driver is also emulating
PASID. Presumably the same is true for other PASID-like schemes.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC v2] /dev/iommu uAPI proposal

2021-08-02 Thread Tian, Kevin
> From: David Gibson 
> Sent: Tuesday, August 3, 2021 9:51 AM
> 
> On Wed, Jul 28, 2021 at 04:04:24AM +, Tian, Kevin wrote:
> > Hi, David,
> >
> > > From: David Gibson 
> > > Sent: Monday, July 26, 2021 12:51 PM
> > >
> > > On Fri, Jul 09, 2021 at 07:48:44AM +, Tian, Kevin wrote:
> > > > /dev/iommu provides an unified interface for managing I/O page tables
> for
> > > > devices assigned to userspace. Device passthrough frameworks (VFIO,
> > > vDPA,
> > > > etc.) are expected to use this interface instead of creating their own
> logic to
> > > > isolate untrusted device DMAs initiated by userspace.
> > > >
> > > > This proposal describes the uAPI of /dev/iommu and also sample
> > > sequences
> > > > with VFIO as example in typical usages. The driver-facing kernel API
> > > provided
> > > > by the iommu layer is still TBD, which can be discussed after consensus
> is
> > > > made on this uAPI.
> > > >
> > > > It's based on a lengthy discussion starting from here:
> > > > https://lore.kernel.org/linux-
> > > iommu/20210330132830.go2356...@nvidia.com/
> > > >
> > > > v1 can be found here:
> > > > https://lore.kernel.org/linux-
> > >
> iommu/PH0PR12MB54811863B392C644E5365446DC3E9@PH0PR12MB5481.n
> > > amprd12.prod.outlook.com/T/
> > > >
> > > > This doc is also tracked on github, though it's not very useful for 
> > > > v1->v2
> > > > given dramatic refactoring:
> > > > https://github.com/luxis1999/dev_iommu_uapi
> > >
> > > Thanks for all your work on this, Kevin.  Apart from the actual
> > > semantic improvements, I'm finding v2 significantly easier to read and
> > > understand than v1.
> > >
> > > [snip]
> > > > 1.2. Attach Device to I/O address space
> > > > +++
> > > >
> > > > Device attach/bind is initiated through passthrough framework uAPI.
> > > >
> > > > Device attaching is allowed only after a device is successfully bound to
> > > > the IOMMU fd. User should provide a device cookie when binding the
> > > > device through VFIO uAPI. This cookie is used when the user queries
> > > > device capability/format, issues per-device iotlb invalidation and
> > > > receives per-device I/O page fault data via IOMMU fd.
> > > >
> > > > Successful binding puts the device into a security context which 
> > > > isolates
> > > > its DMA from the rest system. VFIO should not allow user to access the
> > > > device before binding is completed. Similarly, VFIO should prevent the
> > > > user from unbinding the device before user access is withdrawn.
> > > >
> > > > When a device is in an iommu group which contains multiple devices,
> > > > all devices within the group must enter/exit the security context
> > > > together. Please check {1.3} for more info about group isolation via
> > > > this device-centric design.
> > > >
> > > > Successful attaching activates an I/O address space in the IOMMU,
> > > > if the device is not purely software mediated. VFIO must provide device
> > > > specific routing information for where to install the I/O page table in
> > > > the IOMMU for this device. VFIO must also guarantee that the attached
> > > > device is configured to compose DMAs with the routing information
> that
> > > > is provided in the attaching call. When handling DMA requests, IOMMU
> > > > identifies the target I/O address space according to the routing
> > > > information carried in the request. Misconfiguration breaks DMA
> > > > isolation thus could lead to severe security vulnerability.
> > > >
> > > > Routing information is per-device and bus specific. For PCI, it is
> > > > Requester ID (RID) identifying the device plus optional Process Address
> > > > Space ID (PASID). For ARM, it is Stream ID (SID) plus optional Sub-
> Stream
> > > > ID (SSID). PASID or SSID is used when multiple I/O address spaces are
> > > > enabled on a single device. For simplicity and continuity reason the
> > > > following context uses RID+PASID though SID+SSID may sound a clearer
> > > > naming from device p.o.v. We can decide the actual naming when
> coding.
> > > >
> > > > Because one I/O address space can be attached by multiple devices,
> > > > per-device routing information (plus device cookie) is tracked under
> > > > each IOASID and is used respectively when activating the I/O address
> > > > space in the IOMMU for each attached device.
> > > >
> > > > The device in the /dev/iommu context always refers to a physical one
> > > > (pdev) which is identifiable via RID. Physically each pdev can support
> > > > one default I/O address space (routed via RID) and optionally multiple
> > > > non-default I/O address spaces (via RID+PASID).
> > > >
> > > > The device in VFIO context is a logic concept, being either a physical
> > > > device (pdev) or mediated device (mdev or subdev). Each vfio device
> > > > is represented by RID+cookie in IOMMU fd. User is allowed to create
> > > > one default I/O address space (routed by vRID from user p.o.v) per
> > > > each vfio_device. 

Re: [RFC v2] /dev/iommu uAPI proposal

2021-08-02 Thread David Gibson
On Fri, Jul 30, 2021 at 11:51:23AM -0300, Jason Gunthorpe wrote:
> On Mon, Jul 26, 2021 at 02:50:48PM +1000, David Gibson wrote:
> 
> > That said, I'm still finding the various ways a device can attach to
> > an ioasid pretty confusing.  Here are some thoughts on some extra
> > concepts that might make it easier to handle [note, I haven't thought
> > this all the way through so far, so there might be fatal problems with
> > this approach].
> 
> I think you've summarized how I've been viewing this problem. All the
> concepts you pointed to should show through in the various APIs at the
> end, one way or another.
> 
> How much we need to expose to userspace, I don't know.
> 
> Does userspace need to care how the system labels traffic between DMA
> endpoint and the IOASID? At some point maybe yes since stuff like
> PASID does leak out in various spots

Yeah, I'm not sure.  I think it probably doesn't for the "main path"
of the API, though we might want to expose that for debugging and some
edge cases.

We *should* however be exposing the address type for each IOAS, since
that affects how your MAP operations will work, as well as what
endpoints are compatible with the IOAS.

> > /dev/iommu would work entirely (or nearly so) in terms of endpoint
> > handles, not device handles.  Endpoints are what get bound to an IOAS,
> > and endpoints are what get the user chosen endpoint cookie.
> 
> While an accurate modeling of groups, it feels like an
> overcomplication at this point in history where new HW largely doesn't
> need it.

So.. first, is that really true across the board?  I expect it's true
of high end server hardware, but for consumer level and embedded
hardware as well?  Then there's virtual hardware - I could point to
several things still routinely using emulated PCIe to PCI bridges in
qemu.

Second, we can't just ignore older hardware.

> The user interface VFIO and others presents is device
> centric, inserting a new endpoint object is going going back to some
> kind of group centric view of the world.

Well, kind of, yeah, because I still think the concept has value.
Part of the trouble is that "device" is pretty ambiguous.  "Device" in
the sense of PCI address for register interface may not be the same as
"device" in terms of DMA RID may not be the same as as "device" in
terms of Linux struct device 


terms of PCI register interface is not the same as "device"
in terms of RID / DMA identifiability is not the same "device" in
terms of what.

> I'd rather deduce the endpoint from a collection of devices than the
> other way around...

Which I think is confusing, and in any case doesn't cover the case of
one "device" with multiple endpoints.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC v2] /dev/iommu uAPI proposal

2021-08-02 Thread David Gibson
On Wed, Jul 28, 2021 at 04:04:24AM +, Tian, Kevin wrote:
> Hi, David,
> 
> > From: David Gibson 
> > Sent: Monday, July 26, 2021 12:51 PM
> > 
> > On Fri, Jul 09, 2021 at 07:48:44AM +, Tian, Kevin wrote:
> > > /dev/iommu provides an unified interface for managing I/O page tables for
> > > devices assigned to userspace. Device passthrough frameworks (VFIO,
> > vDPA,
> > > etc.) are expected to use this interface instead of creating their own 
> > > logic to
> > > isolate untrusted device DMAs initiated by userspace.
> > >
> > > This proposal describes the uAPI of /dev/iommu and also sample
> > sequences
> > > with VFIO as example in typical usages. The driver-facing kernel API
> > provided
> > > by the iommu layer is still TBD, which can be discussed after consensus is
> > > made on this uAPI.
> > >
> > > It's based on a lengthy discussion starting from here:
> > >   https://lore.kernel.org/linux-
> > iommu/20210330132830.go2356...@nvidia.com/
> > >
> > > v1 can be found here:
> > >   https://lore.kernel.org/linux-
> > iommu/PH0PR12MB54811863B392C644E5365446DC3E9@PH0PR12MB5481.n
> > amprd12.prod.outlook.com/T/
> > >
> > > This doc is also tracked on github, though it's not very useful for v1->v2
> > > given dramatic refactoring:
> > >   https://github.com/luxis1999/dev_iommu_uapi
> > 
> > Thanks for all your work on this, Kevin.  Apart from the actual
> > semantic improvements, I'm finding v2 significantly easier to read and
> > understand than v1.
> > 
> > [snip]
> > > 1.2. Attach Device to I/O address space
> > > +++
> > >
> > > Device attach/bind is initiated through passthrough framework uAPI.
> > >
> > > Device attaching is allowed only after a device is successfully bound to
> > > the IOMMU fd. User should provide a device cookie when binding the
> > > device through VFIO uAPI. This cookie is used when the user queries
> > > device capability/format, issues per-device iotlb invalidation and
> > > receives per-device I/O page fault data via IOMMU fd.
> > >
> > > Successful binding puts the device into a security context which isolates
> > > its DMA from the rest system. VFIO should not allow user to access the
> > > device before binding is completed. Similarly, VFIO should prevent the
> > > user from unbinding the device before user access is withdrawn.
> > >
> > > When a device is in an iommu group which contains multiple devices,
> > > all devices within the group must enter/exit the security context
> > > together. Please check {1.3} for more info about group isolation via
> > > this device-centric design.
> > >
> > > Successful attaching activates an I/O address space in the IOMMU,
> > > if the device is not purely software mediated. VFIO must provide device
> > > specific routing information for where to install the I/O page table in
> > > the IOMMU for this device. VFIO must also guarantee that the attached
> > > device is configured to compose DMAs with the routing information that
> > > is provided in the attaching call. When handling DMA requests, IOMMU
> > > identifies the target I/O address space according to the routing
> > > information carried in the request. Misconfiguration breaks DMA
> > > isolation thus could lead to severe security vulnerability.
> > >
> > > Routing information is per-device and bus specific. For PCI, it is
> > > Requester ID (RID) identifying the device plus optional Process Address
> > > Space ID (PASID). For ARM, it is Stream ID (SID) plus optional Sub-Stream
> > > ID (SSID). PASID or SSID is used when multiple I/O address spaces are
> > > enabled on a single device. For simplicity and continuity reason the
> > > following context uses RID+PASID though SID+SSID may sound a clearer
> > > naming from device p.o.v. We can decide the actual naming when coding.
> > >
> > > Because one I/O address space can be attached by multiple devices,
> > > per-device routing information (plus device cookie) is tracked under
> > > each IOASID and is used respectively when activating the I/O address
> > > space in the IOMMU for each attached device.
> > >
> > > The device in the /dev/iommu context always refers to a physical one
> > > (pdev) which is identifiable via RID. Physically each pdev can support
> > > one default I/O address space (routed via RID) and optionally multiple
> > > non-default I/O address spaces (via RID+PASID).
> > >
> > > The device in VFIO context is a logic concept, being either a physical
> > > device (pdev) or mediated device (mdev or subdev). Each vfio device
> > > is represented by RID+cookie in IOMMU fd. User is allowed to create
> > > one default I/O address space (routed by vRID from user p.o.v) per
> > > each vfio_device. VFIO decides the routing information for this default
> > > space based on device type:
> > >
> > > 1)  pdev, routed via RID;
> > >
> > > 2)  mdev/subdev with IOMMU-enforced DMA isolation, routed via
> > > the parent's RID plus the PASID marking this mdev;
> > >
> > > 3)  a purely sw-

RE: [RFC v2] /dev/iommu uAPI proposal

2021-08-01 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Friday, July 30, 2021 10:51 PM
> 
> On Mon, Jul 26, 2021 at 02:50:48PM +1000, David Gibson wrote:
> 
> > That said, I'm still finding the various ways a device can attach to
> > an ioasid pretty confusing.  Here are some thoughts on some extra
> > concepts that might make it easier to handle [note, I haven't thought
> > this all the way through so far, so there might be fatal problems with
> > this approach].
> 
> I think you've summarized how I've been viewing this problem. All the
> concepts you pointed to should show through in the various APIs at the
> end, one way or another.

I still didn't get the value of making endpoint explicit in /dev/iommu uAPI. 
>From IOMMU p.o.v it only cares how to route incoming DMA traffic to a
specific I/O page table, according to RID or RID+PASID info carried in DMA 
packets. This has been covered by this proposal. Which DMA endpoint in 
the source device actually triggers the traffic is not a matter for 
/dev/iommu...

> 
> How much we need to expose to userspace, I don't know.
> 
> Does userspace need to care how the system labels traffic between DMA
> endpoint and the IOASID? At some point maybe yes since stuff like
> PASID does leak out in various spots
> 

Can you elaborate? IMO the user only cares about the label (device cookie 
plus optional vPASID) which is generated by itself when doing the attaching
call, and expects this virtual label being used in various spots (invalidation,
page fault, etc.). How the system labels the traffic (the physical RID or RID+
PASID) should be completely invisible to userspace.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-30 Thread Jason Gunthorpe via iommu
On Mon, Jul 26, 2021 at 02:50:48PM +1000, David Gibson wrote:

> That said, I'm still finding the various ways a device can attach to
> an ioasid pretty confusing.  Here are some thoughts on some extra
> concepts that might make it easier to handle [note, I haven't thought
> this all the way through so far, so there might be fatal problems with
> this approach].

I think you've summarized how I've been viewing this problem. All the
concepts you pointed to should show through in the various APIs at the
end, one way or another.

How much we need to expose to userspace, I don't know.

Does userspace need to care how the system labels traffic between DMA
endpoint and the IOASID? At some point maybe yes since stuff like
PASID does leak out in various spots

> /dev/iommu would work entirely (or nearly so) in terms of endpoint
> handles, not device handles.  Endpoints are what get bound to an IOAS,
> and endpoints are what get the user chosen endpoint cookie.

While an accurate modeling of groups, it feels like an
overcomplication at this point in history where new HW largely doesn't
need it. The user interface VFIO and others presents is device
centric, inserting a new endpoint object is going going back to some
kind of group centric view of the world.

I'd rather deduce the endpoint from a collection of devices than the
other way around...

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC v2] /dev/iommu uAPI proposal

2021-07-27 Thread Tian, Kevin
> From: Jean-Philippe Brucker 
> Sent: Monday, July 26, 2021 4:15 PM
> 
> Hi Kevin,
> 
> On Fri, Jul 09, 2021 at 07:48:44AM +, Tian, Kevin wrote:
> > /dev/iommu provides an unified interface for managing I/O page tables for
> > devices assigned to userspace. Device passthrough frameworks (VFIO,
> vDPA,
> > etc.) are expected to use this interface instead of creating their own 
> > logic to
> > isolate untrusted device DMAs initiated by userspace.
> >
> > This proposal describes the uAPI of /dev/iommu and also sample
> sequences
> > with VFIO as example in typical usages. The driver-facing kernel API
> provided
> > by the iommu layer is still TBD, which can be discussed after consensus is
> > made on this uAPI.
> 
> The document looks good to me, I don't have other concerns at the moment
> 

Thanks for your review.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC v2] /dev/iommu uAPI proposal

2021-07-27 Thread Tian, Kevin
Hi, David,

> From: David Gibson 
> Sent: Monday, July 26, 2021 12:51 PM
> 
> On Fri, Jul 09, 2021 at 07:48:44AM +, Tian, Kevin wrote:
> > /dev/iommu provides an unified interface for managing I/O page tables for
> > devices assigned to userspace. Device passthrough frameworks (VFIO,
> vDPA,
> > etc.) are expected to use this interface instead of creating their own 
> > logic to
> > isolate untrusted device DMAs initiated by userspace.
> >
> > This proposal describes the uAPI of /dev/iommu and also sample
> sequences
> > with VFIO as example in typical usages. The driver-facing kernel API
> provided
> > by the iommu layer is still TBD, which can be discussed after consensus is
> > made on this uAPI.
> >
> > It's based on a lengthy discussion starting from here:
> > https://lore.kernel.org/linux-
> iommu/20210330132830.go2356...@nvidia.com/
> >
> > v1 can be found here:
> > https://lore.kernel.org/linux-
> iommu/PH0PR12MB54811863B392C644E5365446DC3E9@PH0PR12MB5481.n
> amprd12.prod.outlook.com/T/
> >
> > This doc is also tracked on github, though it's not very useful for v1->v2
> > given dramatic refactoring:
> > https://github.com/luxis1999/dev_iommu_uapi
> 
> Thanks for all your work on this, Kevin.  Apart from the actual
> semantic improvements, I'm finding v2 significantly easier to read and
> understand than v1.
> 
> [snip]
> > 1.2. Attach Device to I/O address space
> > +++
> >
> > Device attach/bind is initiated through passthrough framework uAPI.
> >
> > Device attaching is allowed only after a device is successfully bound to
> > the IOMMU fd. User should provide a device cookie when binding the
> > device through VFIO uAPI. This cookie is used when the user queries
> > device capability/format, issues per-device iotlb invalidation and
> > receives per-device I/O page fault data via IOMMU fd.
> >
> > Successful binding puts the device into a security context which isolates
> > its DMA from the rest system. VFIO should not allow user to access the
> > device before binding is completed. Similarly, VFIO should prevent the
> > user from unbinding the device before user access is withdrawn.
> >
> > When a device is in an iommu group which contains multiple devices,
> > all devices within the group must enter/exit the security context
> > together. Please check {1.3} for more info about group isolation via
> > this device-centric design.
> >
> > Successful attaching activates an I/O address space in the IOMMU,
> > if the device is not purely software mediated. VFIO must provide device
> > specific routing information for where to install the I/O page table in
> > the IOMMU for this device. VFIO must also guarantee that the attached
> > device is configured to compose DMAs with the routing information that
> > is provided in the attaching call. When handling DMA requests, IOMMU
> > identifies the target I/O address space according to the routing
> > information carried in the request. Misconfiguration breaks DMA
> > isolation thus could lead to severe security vulnerability.
> >
> > Routing information is per-device and bus specific. For PCI, it is
> > Requester ID (RID) identifying the device plus optional Process Address
> > Space ID (PASID). For ARM, it is Stream ID (SID) plus optional Sub-Stream
> > ID (SSID). PASID or SSID is used when multiple I/O address spaces are
> > enabled on a single device. For simplicity and continuity reason the
> > following context uses RID+PASID though SID+SSID may sound a clearer
> > naming from device p.o.v. We can decide the actual naming when coding.
> >
> > Because one I/O address space can be attached by multiple devices,
> > per-device routing information (plus device cookie) is tracked under
> > each IOASID and is used respectively when activating the I/O address
> > space in the IOMMU for each attached device.
> >
> > The device in the /dev/iommu context always refers to a physical one
> > (pdev) which is identifiable via RID. Physically each pdev can support
> > one default I/O address space (routed via RID) and optionally multiple
> > non-default I/O address spaces (via RID+PASID).
> >
> > The device in VFIO context is a logic concept, being either a physical
> > device (pdev) or mediated device (mdev or subdev). Each vfio device
> > is represented by RID+cookie in IOMMU fd. User is allowed to create
> > one default I/O address space (routed by vRID from user p.o.v) per
> > each vfio_device. VFIO decides the routing information for this default
> > space based on device type:
> >
> > 1)  pdev, routed via RID;
> >
> > 2)  mdev/subdev with IOMMU-enforced DMA isolation, routed via
> > the parent's RID plus the PASID marking this mdev;
> >
> > 3)  a purely sw-mediated device (sw mdev), no routing required i.e. no
> > need to install the I/O page table in the IOMMU. sw mdev just uses
> > the metadata to assist its internal DMA isolation logic on top of
> > the parent's IOMMU page table;
> >
> > In a

Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-26 Thread Jean-Philippe Brucker
Hi Kevin,

On Fri, Jul 09, 2021 at 07:48:44AM +, Tian, Kevin wrote:
> /dev/iommu provides an unified interface for managing I/O page tables for 
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, 
> etc.) are expected to use this interface instead of creating their own logic 
> to 
> isolate untrusted device DMAs initiated by userspace. 
> 
> This proposal describes the uAPI of /dev/iommu and also sample sequences 
> with VFIO as example in typical usages. The driver-facing kernel API provided 
> by the iommu layer is still TBD, which can be discussed after consensus is 
> made on this uAPI.

The document looks good to me, I don't have other concerns at the moment

Thanks,
Jean

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-25 Thread David Gibson
On Fri, Jul 09, 2021 at 07:48:44AM +, Tian, Kevin wrote:
> /dev/iommu provides an unified interface for managing I/O page tables for 
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, 
> etc.) are expected to use this interface instead of creating their own logic 
> to 
> isolate untrusted device DMAs initiated by userspace. 
> 
> This proposal describes the uAPI of /dev/iommu and also sample sequences 
> with VFIO as example in typical usages. The driver-facing kernel API provided 
> by the iommu layer is still TBD, which can be discussed after consensus is 
> made on this uAPI.
> 
> It's based on a lengthy discussion starting from here:
>   
> https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/ 
> 
> v1 can be found here:
>   
> https://lore.kernel.org/linux-iommu/ph0pr12mb54811863b392c644e5365446dc...@ph0pr12mb5481.namprd12.prod.outlook.com/T/
> 
> This doc is also tracked on github, though it's not very useful for v1->v2 
> given dramatic refactoring:
>   https://github.com/luxis1999/dev_iommu_uapi

Thanks for all your work on this, Kevin.  Apart from the actual
semantic improvements, I'm finding v2 significantly easier to read and
understand than v1.

[snip]
> 1.2. Attach Device to I/O address space
> +++
> 
> Device attach/bind is initiated through passthrough framework uAPI.
> 
> Device attaching is allowed only after a device is successfully bound to
> the IOMMU fd. User should provide a device cookie when binding the 
> device through VFIO uAPI. This cookie is used when the user queries 
> device capability/format, issues per-device iotlb invalidation and 
> receives per-device I/O page fault data via IOMMU fd.
> 
> Successful binding puts the device into a security context which isolates 
> its DMA from the rest system. VFIO should not allow user to access the 
> device before binding is completed. Similarly, VFIO should prevent the 
> user from unbinding the device before user access is withdrawn.
> 
> When a device is in an iommu group which contains multiple devices,
> all devices within the group must enter/exit the security context
> together. Please check {1.3} for more info about group isolation via
> this device-centric design.
> 
> Successful attaching activates an I/O address space in the IOMMU,
> if the device is not purely software mediated. VFIO must provide device
> specific routing information for where to install the I/O page table in 
> the IOMMU for this device. VFIO must also guarantee that the attached 
> device is configured to compose DMAs with the routing information that 
> is provided in the attaching call. When handling DMA requests, IOMMU 
> identifies the target I/O address space according to the routing 
> information carried in the request. Misconfiguration breaks DMA
> isolation thus could lead to severe security vulnerability.
> 
> Routing information is per-device and bus specific. For PCI, it is 
> Requester ID (RID) identifying the device plus optional Process Address 
> Space ID (PASID). For ARM, it is Stream ID (SID) plus optional Sub-Stream 
> ID (SSID). PASID or SSID is used when multiple I/O address spaces are 
> enabled on a single device. For simplicity and continuity reason the 
> following context uses RID+PASID though SID+SSID may sound a clearer 
> naming from device p.o.v. We can decide the actual naming when coding.
> 
> Because one I/O address space can be attached by multiple devices, 
> per-device routing information (plus device cookie) is tracked under 
> each IOASID and is used respectively when activating the I/O address 
> space in the IOMMU for each attached device.
> 
> The device in the /dev/iommu context always refers to a physical one 
> (pdev) which is identifiable via RID. Physically each pdev can support 
> one default I/O address space (routed via RID) and optionally multiple 
> non-default I/O address spaces (via RID+PASID).
> 
> The device in VFIO context is a logic concept, being either a physical
> device (pdev) or mediated device (mdev or subdev). Each vfio device
> is represented by RID+cookie in IOMMU fd. User is allowed to create 
> one default I/O address space (routed by vRID from user p.o.v) per 
> each vfio_device. VFIO decides the routing information for this default
> space based on device type:
> 
> 1)  pdev, routed via RID;
> 
> 2)  mdev/subdev with IOMMU-enforced DMA isolation, routed via 
> the parent's RID plus the PASID marking this mdev;
> 
> 3)  a purely sw-mediated device (sw mdev), no routing required i.e. no
> need to install the I/O page table in the IOMMU. sw mdev just uses 
> the metadata to assist its internal DMA isolation logic on top of 
> the parent's IOMMU page table;
> 
> In addition, VFIO may allow user to create additional I/O address spaces
> on a vfio_device based on the hardware capability. In such case the user 
> has its own view of the virtual routing information (vPASID) when 

Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-22 Thread Jason Gunthorpe via iommu
On Wed, Jul 21, 2021 at 02:13:23AM +, Tian, Kevin wrote:
> > From: Shenming Lu
> > Sent: Friday, July 16, 2021 8:20 PM
> > 
> > On 2021/7/16 9:20, Tian, Kevin wrote:
> >  > To summarize, for vIOMMU we can work with the spec owner to
> > > define a proper interface to feedback such restriction into the guest
> > > if necessary. For the kernel part, it's clear that IOMMU fd should
> > > disallow two devices attached to a single [RID] or [RID, PASID] slot
> > > in the first place.
> > >
> > > Then the next question is how to communicate such restriction
> > > to the userspace. It sounds like a group, but different in concept.
> > > An iommu group describes the minimal isolation boundary thus all
> > > devices in the group can be only assigned to a single user. But this
> > > case is opposite - the two mdevs (both support ENQCMD submission)
> > > with the same parent have problem when assigned to a single VM
> > > (in this case vPASID is vm-wide translated thus a same pPASID will be
> > > used cross both mdevs) while they instead work pretty well when
> > > assigned to different VMs (completely different vPASID spaces thus
> > > different pPASIDs).
> > >
> > > One thought is to have vfio device driver deal with it. In this proposal
> > > it is the vfio device driver to define the PASID virtualization policy and
> > > report it to userspace via VFIO_DEVICE_GET_INFO. The driver understands
> > > the restriction thus could just hide the vPASID capability when the user
> > > calls GET_INFO on the 2nd mdev in above scenario. In this way the
> > > user even doesn't need to know such restriction at all and both mdevs
> > > can be assigned to a single VM w/o any problem.
> > >
> > 
> > The restriction only probably happens when two mdevs are assigned to one
> > VM,
> > how could the vfio device driver get to know this info to accurately hide
> > the vPASID capability for the 2nd mdev when VFIO_DEVICE_GET_INFO?
> > There is no
> > need to do this in other cases.
> > 
> 
> I suppose the driver can detect it via whether two mdevs are opened by a
> single process.

Just have the kernel some ID for the PASID numberspace - devices with
the same ID have to be represented as a single RID.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC v2] /dev/iommu uAPI proposal

2021-07-20 Thread Tian, Kevin
> From: Shenming Lu
> Sent: Friday, July 16, 2021 8:20 PM
> 
> On 2021/7/16 9:20, Tian, Kevin wrote:
>  > To summarize, for vIOMMU we can work with the spec owner to
> > define a proper interface to feedback such restriction into the guest
> > if necessary. For the kernel part, it's clear that IOMMU fd should
> > disallow two devices attached to a single [RID] or [RID, PASID] slot
> > in the first place.
> >
> > Then the next question is how to communicate such restriction
> > to the userspace. It sounds like a group, but different in concept.
> > An iommu group describes the minimal isolation boundary thus all
> > devices in the group can be only assigned to a single user. But this
> > case is opposite - the two mdevs (both support ENQCMD submission)
> > with the same parent have problem when assigned to a single VM
> > (in this case vPASID is vm-wide translated thus a same pPASID will be
> > used cross both mdevs) while they instead work pretty well when
> > assigned to different VMs (completely different vPASID spaces thus
> > different pPASIDs).
> >
> > One thought is to have vfio device driver deal with it. In this proposal
> > it is the vfio device driver to define the PASID virtualization policy and
> > report it to userspace via VFIO_DEVICE_GET_INFO. The driver understands
> > the restriction thus could just hide the vPASID capability when the user
> > calls GET_INFO on the 2nd mdev in above scenario. In this way the
> > user even doesn't need to know such restriction at all and both mdevs
> > can be assigned to a single VM w/o any problem.
> >
> 
> The restriction only probably happens when two mdevs are assigned to one
> VM,
> how could the vfio device driver get to know this info to accurately hide
> the vPASID capability for the 2nd mdev when VFIO_DEVICE_GET_INFO?
> There is no
> need to do this in other cases.
> 

I suppose the driver can detect it via whether two mdevs are opened by a
single process.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC v2] /dev/iommu uAPI proposal

2021-07-20 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Saturday, July 17, 2021 2:30 AM
> 
> On Fri, Jul 16, 2021 at 01:20:15AM +, Tian, Kevin wrote:
> 
> > One thought is to have vfio device driver deal with it. In this proposal
> > it is the vfio device driver to define the PASID virtualization policy and
> > report it to userspace via VFIO_DEVICE_GET_INFO. The driver understands
> > the restriction thus could just hide the vPASID capability when the user
> > calls GET_INFO on the 2nd mdev in above scenario. In this way the
> > user even doesn't need to know such restriction at all and both mdevs
> > can be assigned to a single VM w/o any problem.
> 
> I think it makes more sense to expose some kind of "pasid group" to
> qemu that identifies that each PASID must be unique across the
> group. For vIOMMUs that are doing funky things with the RID This means
> a single PASID group must not be exposed as two RIDs to the guest.
> 

It's an interesting idea. Some aspects are still unclear to me now 
e.g. how to describe such restriction in a way that it's applied only
to a single user owning the group (not the case when assigned to
different users), whether it can be generalized cross subsystems
(vPASID being a vfio-managed resource), etc. Let's refine it when 
working on the actual implementation. 

> If the kernel blocks it then it can never be fixed by updating the
> vIOMMU design.
> 

But the mdev driver can choose to do so. Should we prevent it?

btw just be curious whether you have got a chance to have a full 
review on v2. I wonder when might be a good time to discuss 
the execution plan following this proposal, if no major open remains...

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-16 Thread Jason Gunthorpe via iommu
On Fri, Jul 16, 2021 at 01:20:15AM +, Tian, Kevin wrote:

> One thought is to have vfio device driver deal with it. In this proposal
> it is the vfio device driver to define the PASID virtualization policy and
> report it to userspace via VFIO_DEVICE_GET_INFO. The driver understands
> the restriction thus could just hide the vPASID capability when the user 
> calls GET_INFO on the 2nd mdev in above scenario. In this way the 
> user even doesn't need to know such restriction at all and both mdevs
> can be assigned to a single VM w/o any problem.

I think it makes more sense to expose some kind of "pasid group" to
qemu that identifies that each PASID must be unique across the
group. For vIOMMUs that are doing funky things with the RID This means
a single PASID group must not be exposed as two RIDs to the guest.

If the kernel blocks it then it can never be fixed by updating the
vIOMMU design.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-16 Thread Shenming Lu
On 2021/7/16 9:20, Tian, Kevin wrote:
 > To summarize, for vIOMMU we can work with the spec owner to
> define a proper interface to feedback such restriction into the guest 
> if necessary. For the kernel part, it's clear that IOMMU fd should 
> disallow two devices attached to a single [RID] or [RID, PASID] slot 
> in the first place.
> 
> Then the next question is how to communicate such restriction
> to the userspace. It sounds like a group, but different in concept.
> An iommu group describes the minimal isolation boundary thus all
> devices in the group can be only assigned to a single user. But this
> case is opposite - the two mdevs (both support ENQCMD submission)
> with the same parent have problem when assigned to a single VM 
> (in this case vPASID is vm-wide translated thus a same pPASID will be 
> used cross both mdevs) while they instead work pretty well when 
> assigned to different VMs (completely different vPASID spaces thus 
> different pPASIDs).
> 
> One thought is to have vfio device driver deal with it. In this proposal
> it is the vfio device driver to define the PASID virtualization policy and
> report it to userspace via VFIO_DEVICE_GET_INFO. The driver understands
> the restriction thus could just hide the vPASID capability when the user 
> calls GET_INFO on the 2nd mdev in above scenario. In this way the 
> user even doesn't need to know such restriction at all and both mdevs
> can be assigned to a single VM w/o any problem.
> 

The restriction only probably happens when two mdevs are assigned to one VM,
how could the vfio device driver get to know this info to accurately hide
the vPASID capability for the 2nd mdev when VFIO_DEVICE_GET_INFO? There is no
need to do this in other cases.

Thanks,
Shenming
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC v2] /dev/iommu uAPI proposal

2021-07-15 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Friday, July 16, 2021 2:13 AM
> 
> On Thu, Jul 15, 2021 at 11:05:45AM -0700, Raj, Ashok wrote:
> > On Thu, Jul 15, 2021 at 02:53:36PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Jul 15, 2021 at 10:48:36AM -0700, Raj, Ashok wrote:
> > >
> > > > > > Do we have any isolation requirements here? its the same process.
> So if the
> > > > > > page-request it sent to guest and even if you report it for mdev1,
> after
> > > > > > the PRQ is resolved by guest, the request from mdev2 from the
> same guest
> > > > > > should simply work?
> > > > >
> > > > > I think we already talked about this and said it should not be done.
> > > >
> > > > I get the should not be done, I'm wondering where should that be
> > > > implemented?
> > >
> > > The iommu layer cannot have ambiguity. Every RID or RID,PASID slot
> > > must have only one device attached to it. Attempting to connect two
> > > devices to the same slot fails on the iommu layer.
> >
> > I guess we are talking about two different things. I was referring to SVM
> > side of things. Maybe you are referring to the mdev.
> 
> I'm talking about in the hypervisor.
> 
> As I've said already, the vIOMMU interface is the problem here. The
> guest VM should be able to know that it cannot use PASID 1 with two
> devices, like the hypervisor knows. At the very least it should be
> able to know that the PASID binding has failed and relay that failure
> back to the process.
> 
> Ideally the guest would know it should allocate another PASID for
> these cases.
> 
> But yes, if mdevs are going to be modeled with RIDs in the guest then
> with the current vIOMMU we cannot cause a single hypervisor RID to
> show up as two RIDs in the guest without breaking the vIOMMU model.
> 

To summarize, for vIOMMU we can work with the spec owner to 
define a proper interface to feedback such restriction into the guest 
if necessary. For the kernel part, it's clear that IOMMU fd should 
disallow two devices attached to a single [RID] or [RID, PASID] slot 
in the first place.

Then the next question is how to communicate such restriction
to the userspace. It sounds like a group, but different in concept.
An iommu group describes the minimal isolation boundary thus all
devices in the group can be only assigned to a single user. But this
case is opposite - the two mdevs (both support ENQCMD submission)
with the same parent have problem when assigned to a single VM 
(in this case vPASID is vm-wide translated thus a same pPASID will be 
used cross both mdevs) while they instead work pretty well when 
assigned to different VMs (completely different vPASID spaces thus 
different pPASIDs).

One thought is to have vfio device driver deal with it. In this proposal
it is the vfio device driver to define the PASID virtualization policy and
report it to userspace via VFIO_DEVICE_GET_INFO. The driver understands
the restriction thus could just hide the vPASID capability when the user 
calls GET_INFO on the 2nd mdev in above scenario. In this way the 
user even doesn't need to know such restriction at all and both mdevs
can be assigned to a single VM w/o any problem.

Does it sound a right approach?

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-15 Thread Jason Gunthorpe via iommu
On Thu, Jul 15, 2021 at 11:05:45AM -0700, Raj, Ashok wrote:
> On Thu, Jul 15, 2021 at 02:53:36PM -0300, Jason Gunthorpe wrote:
> > On Thu, Jul 15, 2021 at 10:48:36AM -0700, Raj, Ashok wrote:
> > 
> > > > > Do we have any isolation requirements here? its the same process. So 
> > > > > if the
> > > > > page-request it sent to guest and even if you report it for mdev1, 
> > > > > after
> > > > > the PRQ is resolved by guest, the request from mdev2 from the same 
> > > > > guest
> > > > > should simply work?
> > > > 
> > > > I think we already talked about this and said it should not be done.
> > > 
> > > I get the should not be done, I'm wondering where should that be
> > > implemented?
> > 
> > The iommu layer cannot have ambiguity. Every RID or RID,PASID slot
> > must have only one device attached to it. Attempting to connect two
> > devices to the same slot fails on the iommu layer.
> 
> I guess we are talking about two different things. I was referring to SVM
> side of things. Maybe you are referring to the mdev.

I'm talking about in the hypervisor.

As I've said already, the vIOMMU interface is the problem here. The
guest VM should be able to know that it cannot use PASID 1 with two
devices, like the hypervisor knows. At the very least it should be
able to know that the PASID binding has failed and relay that failure
back to the process.

Ideally the guest would know it should allocate another PASID for
these cases.

But yes, if mdevs are going to be modeled with RIDs in the guest then
with the current vIOMMU we cannot cause a single hypervisor RID to
show up as two RIDs in the guest without breaking the vIOMMU model.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-15 Thread Raj, Ashok
On Thu, Jul 15, 2021 at 02:53:36PM -0300, Jason Gunthorpe wrote:
> On Thu, Jul 15, 2021 at 10:48:36AM -0700, Raj, Ashok wrote:
> 
> > > > Do we have any isolation requirements here? its the same process. So if 
> > > > the
> > > > page-request it sent to guest and even if you report it for mdev1, after
> > > > the PRQ is resolved by guest, the request from mdev2 from the same guest
> > > > should simply work?
> > > 
> > > I think we already talked about this and said it should not be done.
> > 
> > I get the should not be done, I'm wondering where should that be
> > implemented?
> 
> The iommu layer cannot have ambiguity. Every RID or RID,PASID slot
> must have only one device attached to it. Attempting to connect two
> devices to the same slot fails on the iommu layer.

I guess we are talking about two different things. I was referring to SVM
side of things. Maybe you are referring to the mdev.

A single guest process should be allowed to work with 2 different
accelerators. The PASID for the process is just 1. Limiting that to just
one accelerator per process seems wrong.

Unless there is something else to prevent this, the best way seems never
expose more than 1 mdev from same pdev to the same guest. I think this is a
reasonable restriction compared to limiting a process to bind to no more
than 1 accelerator.


> 
> So the 2nd mdev will fail during IOASID binding when it tries to bind
> to the same PASID that the first mdev is already bound to.
> 
> Jason

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-15 Thread Jason Gunthorpe via iommu
On Thu, Jul 15, 2021 at 10:48:36AM -0700, Raj, Ashok wrote:

> > > Do we have any isolation requirements here? its the same process. So if 
> > > the
> > > page-request it sent to guest and even if you report it for mdev1, after
> > > the PRQ is resolved by guest, the request from mdev2 from the same guest
> > > should simply work?
> > 
> > I think we already talked about this and said it should not be done.
> 
> I get the should not be done, I'm wondering where should that be
> implemented?

The iommu layer cannot have ambiguity. Every RID or RID,PASID slot
must have only one device attached to it. Attempting to connect two
devices to the same slot fails on the iommu layer.

So the 2nd mdev will fail during IOASID binding when it tries to bind
to the same PASID that the first mdev is already bound to.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-15 Thread Raj, Ashok
On Thu, Jul 15, 2021 at 02:18:26PM -0300, Jason Gunthorpe wrote:
> On Thu, Jul 15, 2021 at 09:21:41AM -0700, Raj, Ashok wrote:
> > On Thu, Jul 15, 2021 at 12:23:25PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Jul 15, 2021 at 06:57:57AM -0700, Raj, Ashok wrote:
> > > > On Thu, Jul 15, 2021 at 09:48:13AM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, Jul 15, 2021 at 06:49:54AM +, Tian, Kevin wrote:
> > > > > 
> > > > > > No. You are right on this case. I don't think there is a way to 
> > > > > > differentiate one mdev from the other if they come from the
> > > > > > same parent and attached by the same guest process. In this
> > > > > > case the fault could be reported on either mdev (e.g. the first
> > > > > > matching one) to get it fixed in the guest.
> > > > > 
> > > > > If the IOMMU can't distinguish the two mdevs they are not isolated
> > > > > and would have to share a group. Since group sharing is not supported
> > > > > today this seems like a non-issue
> > > > 
> > > > Does this mean we have to prevent 2 mdev's from same pdev being 
> > > > assigned to
> > > > the same guest? 
> > > 
> > > No, it means that the IOMMU layer has to be able to distinguish them.
> > 
> > Ok, the guest has no control over it, as it see 2 separate pci devices and
> > thinks they are all different.
> > 
> > Only time when it can fail is during the bind operation. From guest
> > perspective a bind in vIOMMU just turns into a write to local table and a
> > invalidate will cause the host to update the real copy from the shadow.
> > 
> > There is no way to fail the bind? and Allocation of the PASID is also a
> > separate operation and has no clue how its going to be used in the guest.
> 
> You can't attach the same RID to the same PASID twice. The IOMMU code
> should prevent this.
> 
> As we've talked about several times, it seems to me the vIOMMU
> interface is misdesigned for the requirements you have. The hypervisor
> should have a role in allocating the PASID since there are invisible
> hypervisor restrictions. This is one of them.

Allocating a PASID is a separate step from binding, isn't it? In vt-d we
have a virtual command interface that can fail an allocation of PASID. But
which device its bound to is a dynamic thing that only gets at bind_mm()
right?

> 
> > Do we have any isolation requirements here? its the same process. So if the
> > page-request it sent to guest and even if you report it for mdev1, after
> > the PRQ is resolved by guest, the request from mdev2 from the same guest
> > should simply work?
> 
> I think we already talked about this and said it should not be done.

I get the should not be done, I'm wondering where should that be
implemented?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-15 Thread Jason Gunthorpe via iommu
On Thu, Jul 15, 2021 at 09:21:41AM -0700, Raj, Ashok wrote:
> On Thu, Jul 15, 2021 at 12:23:25PM -0300, Jason Gunthorpe wrote:
> > On Thu, Jul 15, 2021 at 06:57:57AM -0700, Raj, Ashok wrote:
> > > On Thu, Jul 15, 2021 at 09:48:13AM -0300, Jason Gunthorpe wrote:
> > > > On Thu, Jul 15, 2021 at 06:49:54AM +, Tian, Kevin wrote:
> > > > 
> > > > > No. You are right on this case. I don't think there is a way to 
> > > > > differentiate one mdev from the other if they come from the
> > > > > same parent and attached by the same guest process. In this
> > > > > case the fault could be reported on either mdev (e.g. the first
> > > > > matching one) to get it fixed in the guest.
> > > > 
> > > > If the IOMMU can't distinguish the two mdevs they are not isolated
> > > > and would have to share a group. Since group sharing is not supported
> > > > today this seems like a non-issue
> > > 
> > > Does this mean we have to prevent 2 mdev's from same pdev being assigned 
> > > to
> > > the same guest? 
> > 
> > No, it means that the IOMMU layer has to be able to distinguish them.
> 
> Ok, the guest has no control over it, as it see 2 separate pci devices and
> thinks they are all different.
> 
> Only time when it can fail is during the bind operation. From guest
> perspective a bind in vIOMMU just turns into a write to local table and a
> invalidate will cause the host to update the real copy from the shadow.
> 
> There is no way to fail the bind? and Allocation of the PASID is also a
> separate operation and has no clue how its going to be used in the guest.

You can't attach the same RID to the same PASID twice. The IOMMU code
should prevent this.

As we've talked about several times, it seems to me the vIOMMU
interface is misdesigned for the requirements you have. The hypervisor
should have a role in allocating the PASID since there are invisible
hypervisor restrictions. This is one of them.

> Do we have any isolation requirements here? its the same process. So if the
> page-request it sent to guest and even if you report it for mdev1, after
> the PRQ is resolved by guest, the request from mdev2 from the same guest
> should simply work?

I think we already talked about this and said it should not be done.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-15 Thread Raj, Ashok
On Thu, Jul 15, 2021 at 12:23:25PM -0300, Jason Gunthorpe wrote:
> On Thu, Jul 15, 2021 at 06:57:57AM -0700, Raj, Ashok wrote:
> > On Thu, Jul 15, 2021 at 09:48:13AM -0300, Jason Gunthorpe wrote:
> > > On Thu, Jul 15, 2021 at 06:49:54AM +, Tian, Kevin wrote:
> > > 
> > > > No. You are right on this case. I don't think there is a way to 
> > > > differentiate one mdev from the other if they come from the
> > > > same parent and attached by the same guest process. In this
> > > > case the fault could be reported on either mdev (e.g. the first
> > > > matching one) to get it fixed in the guest.
> > > 
> > > If the IOMMU can't distinguish the two mdevs they are not isolated
> > > and would have to share a group. Since group sharing is not supported
> > > today this seems like a non-issue
> > 
> > Does this mean we have to prevent 2 mdev's from same pdev being assigned to
> > the same guest? 
> 
> No, it means that the IOMMU layer has to be able to distinguish them.

Ok, the guest has no control over it, as it see 2 separate pci devices and
thinks they are all different.

Only time when it can fail is during the bind operation. From guest
perspective a bind in vIOMMU just turns into a write to local table and a
invalidate will cause the host to update the real copy from the shadow.

There is no way to fail the bind? and Allocation of the PASID is also a
separate operation and has no clue how its going to be used in the guest.

> 
> This either means they are "SW mdevs" which does not involve the IOMMU
> layer and puts both the responsibility for isolation and idenfication
> on the mdev driver.

When you mean SW mdev, is it the GPU like case where mdev is purely a SW
construct? or SIOV type where RID+PASID case?

> 
> Or they are some "PASID mdev" which does allow the IOMMU to isolate
> them.
> 
> What can't happen is to comingle /dev/iommu control over the pdev
> between two mdevs.
> 
> ie we can't talk about faults for IOMMU on SW mdevs - faults do not
> come from the IOMMU layer, they have to come from inside the mdev it
> self, somehow.

Recoverable faults for guest needs to be sent to guest? A page-request from
mdev1 and from mdev2 will both look alike when the process is sharing it.

Do we have any isolation requirements here? its the same process. So if the
page-request it sent to guest and even if you report it for mdev1, after
the PRQ is resolved by guest, the request from mdev2 from the same guest
should simply work?


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-15 Thread Jason Gunthorpe via iommu
On Thu, Jul 15, 2021 at 06:57:57AM -0700, Raj, Ashok wrote:
> On Thu, Jul 15, 2021 at 09:48:13AM -0300, Jason Gunthorpe wrote:
> > On Thu, Jul 15, 2021 at 06:49:54AM +, Tian, Kevin wrote:
> > 
> > > No. You are right on this case. I don't think there is a way to 
> > > differentiate one mdev from the other if they come from the
> > > same parent and attached by the same guest process. In this
> > > case the fault could be reported on either mdev (e.g. the first
> > > matching one) to get it fixed in the guest.
> > 
> > If the IOMMU can't distinguish the two mdevs they are not isolated
> > and would have to share a group. Since group sharing is not supported
> > today this seems like a non-issue
> 
> Does this mean we have to prevent 2 mdev's from same pdev being assigned to
> the same guest? 

No, it means that the IOMMU layer has to be able to distinguish them.

This either means they are "SW mdevs" which does not involve the IOMMU
layer and puts both the responsibility for isolation and idenfication
on the mdev driver.

Or they are some "PASID mdev" which does allow the IOMMU to isolate
them.

What can't happen is to comingle /dev/iommu control over the pdev
between two mdevs.

ie we can't talk about faults for IOMMU on SW mdevs - faults do not
come from the IOMMU layer, they have to come from inside the mdev it
self, somehow.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-15 Thread Raj, Ashok
On Thu, Jul 15, 2021 at 09:48:13AM -0300, Jason Gunthorpe wrote:
> On Thu, Jul 15, 2021 at 06:49:54AM +, Tian, Kevin wrote:
> 
> > No. You are right on this case. I don't think there is a way to 
> > differentiate one mdev from the other if they come from the
> > same parent and attached by the same guest process. In this
> > case the fault could be reported on either mdev (e.g. the first
> > matching one) to get it fixed in the guest.
> 
> If the IOMMU can't distinguish the two mdevs they are not isolated
> and would have to share a group. Since group sharing is not supported
> today this seems like a non-issue

Does this mean we have to prevent 2 mdev's from same pdev being assigned to
the same guest? 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-15 Thread Jason Gunthorpe via iommu
On Thu, Jul 15, 2021 at 06:49:54AM +, Tian, Kevin wrote:

> No. You are right on this case. I don't think there is a way to 
> differentiate one mdev from the other if they come from the
> same parent and attached by the same guest process. In this
> case the fault could be reported on either mdev (e.g. the first
> matching one) to get it fixed in the guest.

If the IOMMU can't distinguish the two mdevs they are not isolated
and would have to share a group. Since group sharing is not supported
today this seems like a non-issue

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-15 Thread Shenming Lu
On 2021/7/15 14:49, Tian, Kevin wrote:
>> From: Shenming Lu 
>> Sent: Thursday, July 15, 2021 2:29 PM
>>
>> On 2021/7/15 11:55, Tian, Kevin wrote:
 From: Shenming Lu 
 Sent: Thursday, July 15, 2021 11:21 AM

 On 2021/7/9 15:48, Tian, Kevin wrote:
> 4.6. I/O page fault
> +++
>
> uAPI is TBD. Here is just about the high-level flow from host IOMMU
>> driver
> to guest IOMMU driver and backwards. This flow assumes that I/O page
 faults
> are reported via IOMMU interrupts. Some devices report faults via
>> device
> specific way instead of going through the IOMMU. That usage is not
 covered
> here:
>
> -   Host IOMMU driver receives a I/O page fault with raw fault_data {rid,
> pasid, addr};
>
> -   Host IOMMU driver identifies the faulting I/O page table according to
> {rid, pasid} and calls the corresponding fault handler with an opaque
> object (registered by the handler) and raw fault_data (rid, pasid, 
> addr);
>
> -   IOASID fault handler identifies the corresponding ioasid and device
> cookie according to the opaque object, generates an user fault_data
> (ioasid, cookie, addr) in the fault region, and triggers eventfd to
> userspace;
>

 Hi, I have some doubts here:

 For mdev, it seems that the rid in the raw fault_data is the parent 
 device's,
 then in the vSVA scenario, how can we get to know the mdev(cookie) from
 the
 rid and pasid?

 And from this point of view,would it be better to register the mdev
 (iommu_register_device()) with the parent device info?

>>>
>>> This is what is proposed in this RFC. A successful binding generates a new
>>> iommu_dev object for each vfio device. For mdev this object includes
>>> its parent device, the defPASID marking this mdev, and the cookie
>>> representing it in userspace. Later it is iommu_dev being recorded in
>>> the attaching_data when the mdev is attached to an IOASID:
>>>
>>> struct iommu_attach_data *__iommu_device_attach(
>>> struct iommu_dev *dev, u32 ioasid, u32 pasid, int flags);
>>>
>>> Then when a fault is reported, the fault handler just needs to figure out
>>> iommu_dev according to {rid, pasid} in the raw fault data.
>>>
>>
>> Yeah, we have the defPASID that marks the mdev and refers to the default
>> I/O address space, but how about the non-default I/O address spaces?
>> Is there a case that two different mdevs (on the same parent device)
>> are used by the same process in the guest, thus have a same pasid route
>> in the physical IOMMU? It seems that we can't figure out the mdev from
>> the rid and pasid in this case...
>>
>> Did I misunderstand something?... :-)
>>
> 
> No. You are right on this case. I don't think there is a way to 
> differentiate one mdev from the other if they come from the
> same parent and attached by the same guest process. In this
> case the fault could be reported on either mdev (e.g. the first
> matching one) to get it fixed in the guest.
> 

OK. Thanks,

Shenming
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC v2] /dev/iommu uAPI proposal

2021-07-14 Thread Tian, Kevin
> From: Shenming Lu 
> Sent: Thursday, July 15, 2021 2:29 PM
> 
> On 2021/7/15 11:55, Tian, Kevin wrote:
> >> From: Shenming Lu 
> >> Sent: Thursday, July 15, 2021 11:21 AM
> >>
> >> On 2021/7/9 15:48, Tian, Kevin wrote:
> >>> 4.6. I/O page fault
> >>> +++
> >>>
> >>> uAPI is TBD. Here is just about the high-level flow from host IOMMU
> driver
> >>> to guest IOMMU driver and backwards. This flow assumes that I/O page
> >> faults
> >>> are reported via IOMMU interrupts. Some devices report faults via
> device
> >>> specific way instead of going through the IOMMU. That usage is not
> >> covered
> >>> here:
> >>>
> >>> -   Host IOMMU driver receives a I/O page fault with raw fault_data {rid,
> >>> pasid, addr};
> >>>
> >>> -   Host IOMMU driver identifies the faulting I/O page table according to
> >>> {rid, pasid} and calls the corresponding fault handler with an opaque
> >>> object (registered by the handler) and raw fault_data (rid, pasid, 
> >>> addr);
> >>>
> >>> -   IOASID fault handler identifies the corresponding ioasid and device
> >>> cookie according to the opaque object, generates an user fault_data
> >>> (ioasid, cookie, addr) in the fault region, and triggers eventfd to
> >>> userspace;
> >>>
> >>
> >> Hi, I have some doubts here:
> >>
> >> For mdev, it seems that the rid in the raw fault_data is the parent 
> >> device's,
> >> then in the vSVA scenario, how can we get to know the mdev(cookie) from
> >> the
> >> rid and pasid?
> >>
> >> And from this point of view,would it be better to register the mdev
> >> (iommu_register_device()) with the parent device info?
> >>
> >
> > This is what is proposed in this RFC. A successful binding generates a new
> > iommu_dev object for each vfio device. For mdev this object includes
> > its parent device, the defPASID marking this mdev, and the cookie
> > representing it in userspace. Later it is iommu_dev being recorded in
> > the attaching_data when the mdev is attached to an IOASID:
> >
> > struct iommu_attach_data *__iommu_device_attach(
> > struct iommu_dev *dev, u32 ioasid, u32 pasid, int flags);
> >
> > Then when a fault is reported, the fault handler just needs to figure out
> > iommu_dev according to {rid, pasid} in the raw fault data.
> >
> 
> Yeah, we have the defPASID that marks the mdev and refers to the default
> I/O address space, but how about the non-default I/O address spaces?
> Is there a case that two different mdevs (on the same parent device)
> are used by the same process in the guest, thus have a same pasid route
> in the physical IOMMU? It seems that we can't figure out the mdev from
> the rid and pasid in this case...
> 
> Did I misunderstand something?... :-)
> 

No. You are right on this case. I don't think there is a way to 
differentiate one mdev from the other if they come from the
same parent and attached by the same guest process. In this
case the fault could be reported on either mdev (e.g. the first
matching one) to get it fixed in the guest.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-14 Thread Shenming Lu
On 2021/7/15 11:55, Tian, Kevin wrote:
>> From: Shenming Lu 
>> Sent: Thursday, July 15, 2021 11:21 AM
>>
>> On 2021/7/9 15:48, Tian, Kevin wrote:
>>> 4.6. I/O page fault
>>> +++
>>>
>>> uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
>>> to guest IOMMU driver and backwards. This flow assumes that I/O page
>> faults
>>> are reported via IOMMU interrupts. Some devices report faults via device
>>> specific way instead of going through the IOMMU. That usage is not
>> covered
>>> here:
>>>
>>> -   Host IOMMU driver receives a I/O page fault with raw fault_data {rid,
>>> pasid, addr};
>>>
>>> -   Host IOMMU driver identifies the faulting I/O page table according to
>>> {rid, pasid} and calls the corresponding fault handler with an opaque
>>> object (registered by the handler) and raw fault_data (rid, pasid, 
>>> addr);
>>>
>>> -   IOASID fault handler identifies the corresponding ioasid and device
>>> cookie according to the opaque object, generates an user fault_data
>>> (ioasid, cookie, addr) in the fault region, and triggers eventfd to
>>> userspace;
>>>
>>
>> Hi, I have some doubts here:
>>
>> For mdev, it seems that the rid in the raw fault_data is the parent device's,
>> then in the vSVA scenario, how can we get to know the mdev(cookie) from
>> the
>> rid and pasid?
>>
>> And from this point of view,would it be better to register the mdev
>> (iommu_register_device()) with the parent device info?
>>
> 
> This is what is proposed in this RFC. A successful binding generates a new
> iommu_dev object for each vfio device. For mdev this object includes 
> its parent device, the defPASID marking this mdev, and the cookie 
> representing it in userspace. Later it is iommu_dev being recorded in
> the attaching_data when the mdev is attached to an IOASID:
> 
>   struct iommu_attach_data *__iommu_device_attach(
>   struct iommu_dev *dev, u32 ioasid, u32 pasid, int flags);
> 
> Then when a fault is reported, the fault handler just needs to figure out 
> iommu_dev according to {rid, pasid} in the raw fault data.
> 

Yeah, we have the defPASID that marks the mdev and refers to the default
I/O address space, but how about the non-default I/O address spaces?
Is there a case that two different mdevs (on the same parent device)
are used by the same process in the guest, thus have a same pasid route
in the physical IOMMU? It seems that we can't figure out the mdev from
the rid and pasid in this case...

Did I misunderstand something?... :-)

Thanks,
Shenming
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC v2] /dev/iommu uAPI proposal

2021-07-14 Thread Tian, Kevin
> From: Shenming Lu 
> Sent: Thursday, July 15, 2021 11:21 AM
> 
> On 2021/7/9 15:48, Tian, Kevin wrote:
> > 4.6. I/O page fault
> > +++
> >
> > uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> > to guest IOMMU driver and backwards. This flow assumes that I/O page
> faults
> > are reported via IOMMU interrupts. Some devices report faults via device
> > specific way instead of going through the IOMMU. That usage is not
> covered
> > here:
> >
> > -   Host IOMMU driver receives a I/O page fault with raw fault_data {rid,
> > pasid, addr};
> >
> > -   Host IOMMU driver identifies the faulting I/O page table according to
> > {rid, pasid} and calls the corresponding fault handler with an opaque
> > object (registered by the handler) and raw fault_data (rid, pasid, 
> > addr);
> >
> > -   IOASID fault handler identifies the corresponding ioasid and device
> > cookie according to the opaque object, generates an user fault_data
> > (ioasid, cookie, addr) in the fault region, and triggers eventfd to
> > userspace;
> >
> 
> Hi, I have some doubts here:
> 
> For mdev, it seems that the rid in the raw fault_data is the parent device's,
> then in the vSVA scenario, how can we get to know the mdev(cookie) from
> the
> rid and pasid?
> 
> And from this point of view,would it be better to register the mdev
> (iommu_register_device()) with the parent device info?
> 

This is what is proposed in this RFC. A successful binding generates a new
iommu_dev object for each vfio device. For mdev this object includes 
its parent device, the defPASID marking this mdev, and the cookie 
representing it in userspace. Later it is iommu_dev being recorded in
the attaching_data when the mdev is attached to an IOASID:

struct iommu_attach_data *__iommu_device_attach(
struct iommu_dev *dev, u32 ioasid, u32 pasid, int flags);

Then when a fault is reported, the fault handler just needs to figure out 
iommu_dev according to {rid, pasid} in the raw fault data.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-14 Thread Shenming Lu
On 2021/7/9 15:48, Tian, Kevin wrote:
> 4.6. I/O page fault
> +++
> 
> uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> to guest IOMMU driver and backwards. This flow assumes that I/O page faults
> are reported via IOMMU interrupts. Some devices report faults via device
> specific way instead of going through the IOMMU. That usage is not covered
> here:
> 
> -   Host IOMMU driver receives a I/O page fault with raw fault_data {rid, 
> pasid, addr};
> 
> -   Host IOMMU driver identifies the faulting I/O page table according to
> {rid, pasid} and calls the corresponding fault handler with an opaque
> object (registered by the handler) and raw fault_data (rid, pasid, addr);
> 
> -   IOASID fault handler identifies the corresponding ioasid and device 
> cookie according to the opaque object, generates an user fault_data 
> (ioasid, cookie, addr) in the fault region, and triggers eventfd to 
> userspace;
> 

Hi, I have some doubts here:

For mdev, it seems that the rid in the raw fault_data is the parent device's,
then in the vSVA scenario, how can we get to know the mdev(cookie) from the
rid and pasid?

And from this point of view,would it be better to register the mdev
(iommu_register_device()) with the parent device info?

Thanks,
Shenming
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [RFC v2] /dev/iommu uAPI proposal

2021-07-13 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, July 14, 2021 7:23 AM
> 
> On Tue, Jul 13, 2021 at 11:20:12PM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Wednesday, July 14, 2021 7:03 AM
> > >
> > > On Tue, Jul 13, 2021 at 10:48:38PM +, Tian, Kevin wrote:
> > >
> > > > We can still bind to the parent with cookie, but with
> > > > iommu_register_ sw_device() IOMMU fd knows that this binding
> doesn't
> > > > need to establish any security context via IOMMU API.
> > >
> > > AFAIK there is no reason to involve the parent PCI or other device in
> > > SW mode. The iommufd doesn't need to be aware of anything there.
> > >
> >
> > Yes. but does it makes sense to have an unified model in IOMMU fd
> > which always have a [struct device, cookie] with flags to indicate whether
> > the binding/attaching should be specially handled for sw mdev? Or
> > are you suggesting that lacking of struct device is actually the indicator
> > for such trick?
> 
> I think you've veered into such micro implementation details that it
> is better to wait and see how things look.
> 
> The important point here is that whatever physical device is under a
> SW mdev does not need to be passed to the iommufd because there is
> nothing it can do with that information.
> 

Make sense
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-13 Thread Jason Gunthorpe
On Tue, Jul 13, 2021 at 11:20:12PM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Wednesday, July 14, 2021 7:03 AM
> > 
> > On Tue, Jul 13, 2021 at 10:48:38PM +, Tian, Kevin wrote:
> > 
> > > We can still bind to the parent with cookie, but with
> > > iommu_register_ sw_device() IOMMU fd knows that this binding doesn't
> > > need to establish any security context via IOMMU API.
> > 
> > AFAIK there is no reason to involve the parent PCI or other device in
> > SW mode. The iommufd doesn't need to be aware of anything there.
> > 
> 
> Yes. but does it makes sense to have an unified model in IOMMU fd
> which always have a [struct device, cookie] with flags to indicate whether 
> the binding/attaching should be specially handled for sw mdev? Or
> are you suggesting that lacking of struct device is actually the indicator
> for such trick?

I think you've veered into such micro implementation details that it
is better to wait and see how things look.

The important point here is that whatever physical device is under a
SW mdev does not need to be passed to the iommufd because there is
nothing it can do with that information.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC v2] /dev/iommu uAPI proposal

2021-07-13 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, July 14, 2021 7:03 AM
> 
> On Tue, Jul 13, 2021 at 10:48:38PM +, Tian, Kevin wrote:
> 
> > We can still bind to the parent with cookie, but with
> > iommu_register_ sw_device() IOMMU fd knows that this binding doesn't
> > need to establish any security context via IOMMU API.
> 
> AFAIK there is no reason to involve the parent PCI or other device in
> SW mode. The iommufd doesn't need to be aware of anything there.
> 

Yes. but does it makes sense to have an unified model in IOMMU fd
which always have a [struct device, cookie] with flags to indicate whether 
the binding/attaching should be specially handled for sw mdev? Or
are you suggesting that lacking of struct device is actually the indicator
for such trick?

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-13 Thread Jason Gunthorpe
On Tue, Jul 13, 2021 at 10:48:38PM +, Tian, Kevin wrote:

> We can still bind to the parent with cookie, but with
> iommu_register_ sw_device() IOMMU fd knows that this binding doesn't
> need to establish any security context via IOMMU API.

AFAIK there is no reason to involve the parent PCI or other device in
SW mode. The iommufd doesn't need to be aware of anything there.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC v2] /dev/iommu uAPI proposal

2021-07-13 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, July 14, 2021 12:33 AM
> 
> On Tue, Jul 13, 2021 at 10:26:07AM -0600, Alex Williamson wrote:
> > Quoting this proposal again:
> >
> > > 1)  A successful binding call for the first device in the group creates
> > > the security context for the entire group, by:
> > >
> > > * Verifying group viability in a similar way as VFIO does;
> > >
> > > * Calling IOMMU-API to move the group into a block-dma state,
> > >   which makes all devices in the group attached to an block-dma
> > >   domain with an empty I/O page table;
> > >
> > > VFIO should not allow the user to mmap the MMIO bar of the bound
> > > device until the binding call succeeds.
> >
> > The attach step is irrelevant to my question, the bind step is where
> > the device/group gets into a secure state for device access.
> 
> Binding is similar to attach, it will need to indicate the drivers
> intention and a SW driver will not attach to the PCI device underneath
> it.

Yes. I need to clarify this part in next version. In v1 the binding operation
was purely a software operation within IOMMU fd thus there was no
intention to differentiate device types in this step. But now with v2 the
binding actually involves calling IOMMU API for devices other than sw
mdev. Then we do need similar per-type binding wrappers as defined
for attaching calls.

> 
> > AIUI the operation of VFIO_DEVICE_BIND_IOMMU_FD looks like this:
> >
> > iommu_ctx = iommu_ctx_fdget(iommu_fd);
> >
> > mdev = mdev_from_dev(vdev->dev);
> > dev = mdev ? mdev_parent_dev(mdev) : vdev->dev;
> >
> > iommu_dev = iommu_register_device(iommu_ctx, dev, cookie);
> 
> A default of binding to vdev->dev might turn out to be OK, but this
> needs to be an overridable op in vfio_device and the SW mdevs will
> have to do some 'iommu_register_sw_device()' and not pass in a dev at
> all.
> 

We can still bind to the parent with cookie, but with iommu_register_
sw_device() IOMMU fd knows that this binding doesn't need to
establish any security context via IOMMU API. 

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-13 Thread Jason Gunthorpe
On Tue, Jul 13, 2021 at 10:26:07AM -0600, Alex Williamson wrote:
> Quoting this proposal again:
> 
> > 1)  A successful binding call for the first device in the group creates 
> > the security context for the entire group, by:
> > 
> > * Verifying group viability in a similar way as VFIO does;
> > 
> > * Calling IOMMU-API to move the group into a block-dma state,
> >   which makes all devices in the group attached to an block-dma
> >   domain with an empty I/O page table;
> > 
> > VFIO should not allow the user to mmap the MMIO bar of the bound
> > device until the binding call succeeds.
> 
> The attach step is irrelevant to my question, the bind step is where
> the device/group gets into a secure state for device access.

Binding is similar to attach, it will need to indicate the drivers
intention and a SW driver will not attach to the PCI device underneath
it.

> AIUI the operation of VFIO_DEVICE_BIND_IOMMU_FD looks like this:
> 
>   iommu_ctx = iommu_ctx_fdget(iommu_fd);
> 
>   mdev = mdev_from_dev(vdev->dev);
>   dev = mdev ? mdev_parent_dev(mdev) : vdev->dev;
> 
>   iommu_dev = iommu_register_device(iommu_ctx, dev, cookie);

A default of binding to vdev->dev might turn out to be OK, but this
needs to be an overridable op in vfio_device and the SW mdevs will
have to do some 'iommu_register_sw_device()' and not pass in a dev at
all.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-13 Thread Alex Williamson
On Tue, 13 Jul 2021 09:55:03 -0300
Jason Gunthorpe  wrote:

> On Mon, Jul 12, 2021 at 11:56:24PM +, Tian, Kevin wrote:
> 
> > Maybe I misunderstood your question. Are you specifically worried
> > about establishing the security context for a mdev vs. for its
> > parent?  
> 
> The way to think about the cookie, and the device bind/attach in
> general, is as taking control of a portion of the IOMMU routing:
> 
>  - RID
>  - RID + PASID
>  - "software"
> 
> For the first two there can be only one device attachment per value so
> the cookie is unambiguous.
> 
> For "software" the iommu layer has little to do with this - everything
> is constructed outside by the mdev. If the mdev wishes to communicate
> on /dev/iommu using the cookie then it has to do so using some iommufd
> api and we can convay the proper device at that point.
> 
> Kevin didn't show it, but along side the PCI attaches:
> 
> struct iommu_attach_data * iommu_pci_device_attach(
> struct iommu_dev *dev, struct pci_device *pdev,
> u32 ioasid);
> 
> There would also be a software attach for mdev:
> 
> struct iommu_attach_data * iommu_sw_device_attach(
> struct iommu_dev *dev, struct device *pdev, u32 ioasid);
> 
> Which does not connect anything to the iommu layer.
> 
> It would have to return something that allows querying the IO page
> table, and the mdev would use that API instead of vfio_pin_pages().


Quoting this proposal again:

> 1)  A successful binding call for the first device in the group creates 
> the security context for the entire group, by:
> 
> * Verifying group viability in a similar way as VFIO does;
> 
> * Calling IOMMU-API to move the group into a block-dma state,
>   which makes all devices in the group attached to an block-dma
>   domain with an empty I/O page table;
> 
> VFIO should not allow the user to mmap the MMIO bar of the bound
> device until the binding call succeeds.

The attach step is irrelevant to my question, the bind step is where
the device/group gets into a secure state for device access.

So for IGD we have two scenarios, direct assignment and software mdevs.

AIUI the operation of VFIO_DEVICE_BIND_IOMMU_FD looks like this:

iommu_ctx = iommu_ctx_fdget(iommu_fd);

mdev = mdev_from_dev(vdev->dev);
dev = mdev ? mdev_parent_dev(mdev) : vdev->dev;

iommu_dev = iommu_register_device(iommu_ctx, dev, cookie);

In either case, this last line is either registering the IGD itself
(ie. the struct device representing PCI device :00:02.0) or the
parent of the GVT-g mdev (ie. the struct device representing PCI device
:00:02.0).  They're the same!  AIUI, the cookie is simply an
arbitrary user generated value which they'll use to refer to this
device via the iommu_fd uAPI.

So what magic is iommu_register_device() doing to infer my intentions
as to whether I'm asking for the IGD RID to be isolated or I'm only
creating a software context for an mdev?  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-13 Thread Jason Gunthorpe
On Mon, Jul 12, 2021 at 11:56:24PM +, Tian, Kevin wrote:

> Maybe I misunderstood your question. Are you specifically worried
> about establishing the security context for a mdev vs. for its
> parent?

The way to think about the cookie, and the device bind/attach in
general, is as taking control of a portion of the IOMMU routing:

 - RID
 - RID + PASID
 - "software"

For the first two there can be only one device attachment per value so
the cookie is unambiguous.

For "software" the iommu layer has little to do with this - everything
is constructed outside by the mdev. If the mdev wishes to communicate
on /dev/iommu using the cookie then it has to do so using some iommufd
api and we can convay the proper device at that point.

Kevin didn't show it, but along side the PCI attaches:

struct iommu_attach_data * iommu_pci_device_attach(
struct iommu_dev *dev, struct pci_device *pdev,
u32 ioasid);

There would also be a software attach for mdev:

struct iommu_attach_data * iommu_sw_device_attach(
struct iommu_dev *dev, struct device *pdev, u32 ioasid);

Which does not connect anything to the iommu layer.

It would have to return something that allows querying the IO page
table, and the mdev would use that API instead of vfio_pin_pages().

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC v2] /dev/iommu uAPI proposal

2021-07-12 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Tuesday, July 13, 2021 2:42 AM
> 
> On Mon, 12 Jul 2021 01:22:11 +
> "Tian, Kevin"  wrote:
> > > From: Alex Williamson 
> > > Sent: Saturday, July 10, 2021 5:51 AM
> > > On Fri, 9 Jul 2021 07:48:44 +
> > > "Tian, Kevin"  wrote:
> 
> > > > For mdev the struct device should be the pointer to the parent device.
> > >
> > > I don't get how iommu_register_device() differentiates an mdev from a
> > > pdev in this case.
> >
> > via device cookie.
> 
> 
> Let me re-add this section for more context:
> 
> > 3. Sample structures and helper functions
> > 
> >
> > Three helper functions are provided to support VFIO_BIND_IOMMU_FD:
> >
> > struct iommu_ctx *iommu_ctx_fdget(int fd);
> > struct iommu_dev *iommu_register_device(struct iommu_ctx *ctx,
> > struct device *device, u64 cookie);
> > int iommu_unregister_device(struct iommu_dev *dev);
> >
> > An iommu_ctx is created for each fd:
> >
> > struct iommu_ctx {
> > // a list of allocated IOASID data's
> > struct xarray   ioasid_xa;
> >
> > // a list of registered devices
> > struct xarray   dev_xa;
> > };
> >
> > Later some group-tracking fields will be also introduced to support
> > multi-devices group.
> >
> > Each registered device is represented by iommu_dev:
> >
> > struct iommu_dev {
> > struct iommu_ctx*ctx;
> > // always be the physical device
> > struct device   *device;
> > u64 cookie;
> > struct kref kref;
> > };
> >
> > A successful binding establishes a security context for the bound
> > device and returns struct iommu_dev pointer to the caller. After this
> > point, the user is allowed to query device capabilities via IOMMU_
> > DEVICE_GET_INFO.
> >
> > For mdev the struct device should be the pointer to the parent device.
> 
> 
> So we'll have a VFIO_DEVICE_BIND_IOMMU_FD ioctl where the user
> provides
> the iommu_fd and a cookie.  vfio will use iommu_ctx_fdget() to get an
> iommu_ctx* for that iommu_fd, then we'll call iommu_register_device()
> using that iommu_ctx* we got from the iommu_fd, the cookie provided by
> the user, and for an mdev, the parent of the device the user owns
> (the device_fd on which this ioctl is called)...
> 
> How does an arbitrary user provided cookie let you differentiate that
> the request is actually for an mdev versus the parent device itself?
> 

Maybe I misunderstood your question. Are you specifically worried
about establishing the security context for a mdev vs. for its parent?
At least in concept we should not change the security context of
the parent if this binding call is just for the mdev. And for mdev it will be
 in a security context as long as the associated PASID entry is disabled 
at the binding time. If this is the case, possibly we also need VFIO to 
provide defPASID marking the mdev when calling iommu_register_device()
then IOMMU fd also provides defPASID when calling IOMMU API to
establish the security context.

Thanks,
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC v2] /dev/iommu uAPI proposal

2021-07-12 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Tuesday, July 13, 2021 2:42 AM
> 
> On Mon, 12 Jul 2021 01:22:11 +
> "Tian, Kevin"  wrote:
> > > From: Alex Williamson 
> > > Sent: Saturday, July 10, 2021 5:51 AM
> > > On Fri, 9 Jul 2021 07:48:44 +
> > > "Tian, Kevin"  wrote:
> 
> > > > For mdev the struct device should be the pointer to the parent device.
> > >
> > > I don't get how iommu_register_device() differentiates an mdev from a
> > > pdev in this case.
> >
> > via device cookie.
> 
> 
> Let me re-add this section for more context:
> 
> > 3. Sample structures and helper functions
> > 
> >
> > Three helper functions are provided to support VFIO_BIND_IOMMU_FD:
> >
> > struct iommu_ctx *iommu_ctx_fdget(int fd);
> > struct iommu_dev *iommu_register_device(struct iommu_ctx *ctx,
> > struct device *device, u64 cookie);
> > int iommu_unregister_device(struct iommu_dev *dev);
> >
> > An iommu_ctx is created for each fd:
> >
> > struct iommu_ctx {
> > // a list of allocated IOASID data's
> > struct xarray   ioasid_xa;
> >
> > // a list of registered devices
> > struct xarray   dev_xa;
> > };
> >
> > Later some group-tracking fields will be also introduced to support
> > multi-devices group.
> >
> > Each registered device is represented by iommu_dev:
> >
> > struct iommu_dev {
> > struct iommu_ctx*ctx;
> > // always be the physical device
> > struct device   *device;
> > u64 cookie;
> > struct kref kref;
> > };
> >
> > A successful binding establishes a security context for the bound
> > device and returns struct iommu_dev pointer to the caller. After this
> > point, the user is allowed to query device capabilities via IOMMU_
> > DEVICE_GET_INFO.
> >
> > For mdev the struct device should be the pointer to the parent device.
> 
> 
> So we'll have a VFIO_DEVICE_BIND_IOMMU_FD ioctl where the user
> provides
> the iommu_fd and a cookie.  vfio will use iommu_ctx_fdget() to get an
> iommu_ctx* for that iommu_fd, then we'll call iommu_register_device()
> using that iommu_ctx* we got from the iommu_fd, the cookie provided by
> the user, and for an mdev, the parent of the device the user owns
> (the device_fd on which this ioctl is called)...
> 
> How does an arbitrary user provided cookie let you differentiate that
> the request is actually for an mdev versus the parent device itself?
> 
> For instance, how can the IOMMU layer distinguish GVT-g (mdev) vs GVT-d
> (direct assignment) when both use the same struct device* and cookie is
> just a user provided value?  Still confused.  Thanks,
> 

GVT-g is a special case here since it's purely software-emulated mdev 
and reuse the default domain of the parent device. In this case IOASID
is treated as metadata for GVT-g device driver to conduct DMA isolation
in software. We won't install a new page table in the IOMMU just for
GVT-g mdev (this does reminds a missing flag in the attaching call to
indicate this requirement).

What you really care about is about SIOV mdev (with PASID-granular
DMA isolation in the IOMMU) and its parent. In this case mdev and
parent assignment are exclusive. When the parent is already assigned 
to an user, it's not managed by the kernel anymore thus no mdev
per se. If mdev is created then it implies that the parent must be
managed by the kernel. In either case the user-provided cookie is
contained only within IOMMU fd. When calling IOMMU-API, it's
always about the routing information (RID, or RID+PASID) provided 
in the attaching call.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-12 Thread Alex Williamson
On Mon, 12 Jul 2021 01:22:11 +
"Tian, Kevin"  wrote:
> > From: Alex Williamson 
> > Sent: Saturday, July 10, 2021 5:51 AM
> > On Fri, 9 Jul 2021 07:48:44 +
> > "Tian, Kevin"  wrote:  
 
> > > For mdev the struct device should be the pointer to the parent device.  
> > 
> > I don't get how iommu_register_device() differentiates an mdev from a
> > pdev in this case.  
> 
> via device cookie.


Let me re-add this section for more context:

> 3. Sample structures and helper functions
> 
> 
> Three helper functions are provided to support VFIO_BIND_IOMMU_FD:
> 
>   struct iommu_ctx *iommu_ctx_fdget(int fd);
>   struct iommu_dev *iommu_register_device(struct iommu_ctx *ctx,
>   struct device *device, u64 cookie);
>   int iommu_unregister_device(struct iommu_dev *dev);
> 
> An iommu_ctx is created for each fd:
> 
>   struct iommu_ctx {
>   // a list of allocated IOASID data's
>   struct xarray   ioasid_xa;
> 
>   // a list of registered devices
>   struct xarray   dev_xa;
>   };
> 
> Later some group-tracking fields will be also introduced to support 
> multi-devices group.
> 
> Each registered device is represented by iommu_dev:
> 
>   struct iommu_dev {
>   struct iommu_ctx*ctx;
>   // always be the physical device
>   struct device   *device;
>   u64 cookie;
>   struct kref kref;
>   };
> 
> A successful binding establishes a security context for the bound
> device and returns struct iommu_dev pointer to the caller. After this
> point, the user is allowed to query device capabilities via IOMMU_
> DEVICE_GET_INFO.
> 
> For mdev the struct device should be the pointer to the parent device. 


So we'll have a VFIO_DEVICE_BIND_IOMMU_FD ioctl where the user provides
the iommu_fd and a cookie.  vfio will use iommu_ctx_fdget() to get an
iommu_ctx* for that iommu_fd, then we'll call iommu_register_device()
using that iommu_ctx* we got from the iommu_fd, the cookie provided by
the user, and for an mdev, the parent of the device the user owns
(the device_fd on which this ioctl is called)...

How does an arbitrary user provided cookie let you differentiate that
the request is actually for an mdev versus the parent device itself?

For instance, how can the IOMMU layer distinguish GVT-g (mdev) vs GVT-d
(direct assignment) when both use the same struct device* and cookie is
just a user provided value?  Still confused.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC v2] /dev/iommu uAPI proposal

2021-07-11 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Saturday, July 10, 2021 5:51 AM
> 
> Hi Kevin,
> 
> A couple first pass comments...
> 
> On Fri, 9 Jul 2021 07:48:44 +
> "Tian, Kevin"  wrote:
> > 2.2. /dev/vfio device uAPI
> > ++
> >
> > /*
> >   * Bind a vfio_device to the specified IOMMU fd
> >   *
> >   * The user should provide a device cookie when calling this ioctl. The
> >   * cookie is later used in IOMMU fd for capability query, iotlb 
> > invalidation
> >   * and I/O fault handling.
> >   *
> >   * User is not allowed to access the device before the binding operation
> >   * is completed.
> >   *
> >   * Unbind is automatically conducted when device fd is closed.
> >   *
> >   * Input parameters:
> >   * - iommu_fd;
> >   * - cookie;
> >   *
> >   * Return: 0 on success, -errno on failure.
> >   */
> > #define VFIO_BIND_IOMMU_FD  _IO(VFIO_TYPE, VFIO_BASE + 22)
> 
> I believe this is an ioctl on the device fd, therefore it should be
> named VFIO_DEVICE_BIND_IOMMU_FD.

make sense.

> 
> >
> >
> > /*
> >   * Report vPASID info to userspace via VFIO_DEVICE_GET_INFO
> >   *
> >   * Add a new device capability. The presence indicates that the user
> >   * is allowed to create multiple I/O address spaces on this device. The
> >   * capability further includes following flags:
> >   *
> >   * - PASID_DELEGATED, if clear every vPASID must be registered to
> >   *   the kernel;
> >   * - PASID_CPU, if set vPASID is allowed to be carried in the CPU
> >   *   instructions (e.g. ENQCMD);
> >   * - PASID_CPU_VIRT, if set require vPASID translation in the CPU;
> >   *
> >   * The user must check that all devices with PASID_CPU set have the
> >   * same setting on PASID_CPU_VIRT. If mismatching, it should enable
> >   * vPASID only in one category (all set, or all clear).
> >   *
> >   * When the user enables vPASID on the device with PASID_CPU_VIRT
> >   * set, it must enable vPASID CPU translation via kvm fd before attempting
> >   * to use ENQCMD to submit work items. The command portal is blocked
> >   * by the kernel until the CPU translation is enabled.
> >   */
> > #define VFIO_DEVICE_INFO_CAP_PASID  5
> >
> >
> > /*
> >   * Attach a vfio device to the specified IOASID
> >   *
> >   * Multiple vfio devices can be attached to the same IOASID, and vice
> >   * versa.
> >   *
> >   * User may optionally provide a "virtual PASID" to mark an I/O page
> >   * table on this vfio device, if PASID_DELEGATED is not set in device info.
> >   * Whether the virtual PASID is physically used or converted to another
> >   * kernel-allocated PASID is a policy in the kernel.
> >   *
> >   * Because one device is allowed to bind to multiple IOMMU fd's, the
> >   * user should provide both iommu_fd and ioasid for this attach operation.
> >   *
> >   * Input parameter:
> >   * - iommu_fd;
> >   * - ioasid;
> >   * - flag;
> >   * - vpasid (if specified);
> >   *
> >   * Return: 0 on success, -errno on failure.
> >   */
> > #define VFIO_ATTACH_IOASID  _IO(VFIO_TYPE, VFIO_BASE +
> 23)
> > #define VFIO_DETACH_IOASID  _IO(VFIO_TYPE, VFIO_BASE +
> 24)
> 
> Likewise, VFIO_DEVICE_{ATTACH,DETACH}_IOASID
> 
> ...
> > 3. Sample structures and helper functions
> > 
> >
> > Three helper functions are provided to support VFIO_BIND_IOMMU_FD:
> >
> > struct iommu_ctx *iommu_ctx_fdget(int fd);
> > struct iommu_dev *iommu_register_device(struct iommu_ctx *ctx,
> > struct device *device, u64 cookie);
> > int iommu_unregister_device(struct iommu_dev *dev);
> >
> > An iommu_ctx is created for each fd:
> >
> > struct iommu_ctx {
> > // a list of allocated IOASID data's
> > struct xarray   ioasid_xa;
> >
> > // a list of registered devices
> > struct xarray   dev_xa;
> > };
> >
> > Later some group-tracking fields will be also introduced to support
> > multi-devices group.
> >
> > Each registered device is represented by iommu_dev:
> >
> > struct iommu_dev {
> > struct iommu_ctx*ctx;
> > // always be the physical device
> > struct device   *device;
> > u64 cookie;
> > struct kref kref;
> > };
> >
> > A successful binding establishes a security context for the bound
> > device and returns struct iommu_dev pointer to the caller. After this
> > point, the user is allowed to query device capabilities via IOMMU_
> > DEVICE_GET_INFO.
> 
> If we have an initial singleton group only restriction, I assume that
> both iommu_register_device() would fail for any devices that are not in
> a singleton group and vfio would only expose direct device files for
> the devices in singleton groups.  The latter implementation could
> change when multi-device group support is added so that userspace can
> assume that if the vfio device file exists, this interface is avail

Re: [RFC v2] /dev/iommu uAPI proposal

2021-07-09 Thread Alex Williamson
Hi Kevin,

A couple first pass comments...

On Fri, 9 Jul 2021 07:48:44 +
"Tian, Kevin"  wrote:
> 2.2. /dev/vfio device uAPI
> ++
> 
> /*
>   * Bind a vfio_device to the specified IOMMU fd
>   *
>   * The user should provide a device cookie when calling this ioctl. The 
>   * cookie is later used in IOMMU fd for capability query, iotlb invalidation
>   * and I/O fault handling.
>   *
>   * User is not allowed to access the device before the binding operation
>   * is completed.
>   *
>   * Unbind is automatically conducted when device fd is closed.
>   *
>   * Input parameters:
>   *   - iommu_fd;
>   *   - cookie;
>   *
>   * Return: 0 on success, -errno on failure.
>   */
> #define VFIO_BIND_IOMMU_FD_IO(VFIO_TYPE, VFIO_BASE + 22)

I believe this is an ioctl on the device fd, therefore it should be
named VFIO_DEVICE_BIND_IOMMU_FD.

> 
> 
> /*
>   * Report vPASID info to userspace via VFIO_DEVICE_GET_INFO
>   *
>   * Add a new device capability. The presence indicates that the user
>   * is allowed to create multiple I/O address spaces on this device. The
>   * capability further includes following flags:
>   *
>   *   - PASID_DELEGATED, if clear every vPASID must be registered to 
>   * the kernel;
>   *   - PASID_CPU, if set vPASID is allowed to be carried in the CPU 
>   * instructions (e.g. ENQCMD);
>   *   - PASID_CPU_VIRT, if set require vPASID translation in the CPU; 
>   * 
>   * The user must check that all devices with PASID_CPU set have the 
>   * same setting on PASID_CPU_VIRT. If mismatching, it should enable 
>   * vPASID only in one category (all set, or all clear).
>   *
>   * When the user enables vPASID on the device with PASID_CPU_VIRT
>   * set, it must enable vPASID CPU translation via kvm fd before attempting
>   * to use ENQCMD to submit work items. The command portal is blocked 
>   * by the kernel until the CPU translation is enabled.
>   */
> #define VFIO_DEVICE_INFO_CAP_PASID5
> 
> 
> /*
>   * Attach a vfio device to the specified IOASID
>   *
>   * Multiple vfio devices can be attached to the same IOASID, and vice 
>   * versa. 
>   *
>   * User may optionally provide a "virtual PASID" to mark an I/O page 
>   * table on this vfio device, if PASID_DELEGATED is not set in device info. 
>   * Whether the virtual PASID is physically used or converted to another 
>   * kernel-allocated PASID is a policy in the kernel.
>   *
>   * Because one device is allowed to bind to multiple IOMMU fd's, the
>   * user should provide both iommu_fd and ioasid for this attach operation.
>   *
>   * Input parameter:
>   *   - iommu_fd;
>   *   - ioasid;
>   *   - flag;
>   *   - vpasid (if specified);
>   * 
>   * Return: 0 on success, -errno on failure.
>   */
> #define VFIO_ATTACH_IOASID_IO(VFIO_TYPE, VFIO_BASE + 23)
> #define VFIO_DETACH_IOASID_IO(VFIO_TYPE, VFIO_BASE + 24)

Likewise, VFIO_DEVICE_{ATTACH,DETACH}_IOASID

...
> 3. Sample structures and helper functions
> 
> 
> Three helper functions are provided to support VFIO_BIND_IOMMU_FD:
> 
>   struct iommu_ctx *iommu_ctx_fdget(int fd);
>   struct iommu_dev *iommu_register_device(struct iommu_ctx *ctx,
>   struct device *device, u64 cookie);
>   int iommu_unregister_device(struct iommu_dev *dev);
> 
> An iommu_ctx is created for each fd:
> 
>   struct iommu_ctx {
>   // a list of allocated IOASID data's
>   struct xarray   ioasid_xa;
> 
>   // a list of registered devices
>   struct xarray   dev_xa;
>   };
> 
> Later some group-tracking fields will be also introduced to support 
> multi-devices group.
> 
> Each registered device is represented by iommu_dev:
> 
>   struct iommu_dev {
>   struct iommu_ctx*ctx;
>   // always be the physical device
>   struct device   *device;
>   u64 cookie;
>   struct kref kref;
>   };
> 
> A successful binding establishes a security context for the bound
> device and returns struct iommu_dev pointer to the caller. After this
> point, the user is allowed to query device capabilities via IOMMU_
> DEVICE_GET_INFO.

If we have an initial singleton group only restriction, I assume that
both iommu_register_device() would fail for any devices that are not in
a singleton group and vfio would only expose direct device files for
the devices in singleton groups.  The latter implementation could
change when multi-device group support is added so that userspace can
assume that if the vfio device file exists, this interface is available.
I think this is confirmed further below.

> For mdev the struct device should be the pointer to the parent device. 

I don't get how iommu_register_device() differentiates an mdev from a
pdev in this case.

...
> 4.3. IOASID nesting (software)
> 

[RFC v2] /dev/iommu uAPI proposal

2021-07-09 Thread Tian, Kevin
/dev/iommu provides an unified interface for managing I/O page tables for 
devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, 
etc.) are expected to use this interface instead of creating their own logic to 
isolate untrusted device DMAs initiated by userspace. 

This proposal describes the uAPI of /dev/iommu and also sample sequences 
with VFIO as example in typical usages. The driver-facing kernel API provided 
by the iommu layer is still TBD, which can be discussed after consensus is 
made on this uAPI.

It's based on a lengthy discussion starting from here:

https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/ 

v1 can be found here:

https://lore.kernel.org/linux-iommu/ph0pr12mb54811863b392c644e5365446dc...@ph0pr12mb5481.namprd12.prod.outlook.com/T/

This doc is also tracked on github, though it's not very useful for v1->v2 
given dramatic refactoring:
https://github.com/luxis1999/dev_iommu_uapi 

Changelog (v1->v2):
- Rename /dev/ioasid to /dev/iommu (Jason);
- Add a section for device-centric vs. group-centric design (many);
- Add a section for handling no-snoop DMA (Jason/Alex/Paolo);
- Add definition of user/kernel/shared I/O page tables (Baolu/Jason);
- Allow one device bound to multiple iommu fd's (Jason);
- No need to track user I/O page tables in kernel on ARM/AMD (Jean/Jason);
- Add a device cookie for iotlb invalidation and fault handling (Jean/Jason);
- Add capability/format query interface per device cookie (Jason);
- Specify format/attribute when creating an IOASID, leading to several v1
  uAPI commands removed (Jason);
- Explain the value of software nesting (Jean);
- Replace IOASID_REGISTER_VIRTUAL_MEMORY with software nesting (David/Jason);
- Cover software mdev usage (Jason);
- No restriction on map/unmap vs. bind/invalidate (Jason/David);
- Report permitted IOVA range instead of reserved range (David);
- Refine the sample structures and helper functions (Jason);
- Add definition of default and non-default I/O address spaces;
- Expand and clarify the design for PASID virtualization;
- and lots of subtle refinement according to above changes;

TOC

1. Terminologies and Concepts
1.1. Manage I/O address space
1.2. Attach device to I/O address space
1.3. Group isolation
1.4. PASID virtualization
1.4.1. Devices which don't support DMWr
1.4.2. Devices which support DMWr
1.4.3. Mix different types together
1.4.4. User sequence
1.5. No-snoop DMA
2. uAPI Proposal
2.1. /dev/iommu uAPI
2.2. /dev/vfio device uAPI
2.3. /dev/kvm uAPI
3. Sample Structures and Helper Functions
4. Use Cases and Flows
4.1. A simple example
4.2. Multiple IOASIDs (no nesting)
4.3. IOASID nesting (software)
4.4. IOASID nesting (hardware)
4.5. Guest SVA (vSVA)
4.6. I/O page fault


1. Terminologies and Concepts
-

IOMMU fd is the container holding multiple I/O address spaces. User 
manages those address spaces through fd operations. Multiple fd's are 
allowed per process, but with this proposal one fd should be sufficient for 
all intended usages.

IOASID is the fd-local software handle representing an I/O address space. 
Each IOASID is associated with a single I/O page table. IOASIDs can be 
nested together, implying the output address from one I/O page table 
(represented by child IOASID) must be further translated by another I/O 
page table (represented by parent IOASID).

An I/O address space takes effect only after it is attached by a device. 
One device is allowed to attach to multiple I/O address spaces. One I/O 
address space can be attached by multiple devices.

Device must be bound to an IOMMU fd before attach operation can be
conducted. Though not necessary, user could bind one device to multiple
IOMMU FD's. But no cross-FD IOASID nesting is allowed.

The format of an I/O page table must be compatible to the attached 
devices (or more specifically to the IOMMU which serves the DMA from
the attached devices). User is responsible for specifying the format
when allocating an IOASID, according to one or multiple devices which
will be attached right after. Attaching a device to an IOASID with 
incompatible format is simply rejected.

Relationship between IOMMU fd, VFIO fd and KVM fd:

-   IOMMU fd provides uAPI for managing IOASIDs and I/O page tables. 
It also provides an unified capability/format reporting interface for
each bound device. 

-   VFIO fd provides uAPI for device binding and attaching. In this proposal 
VFIO is used as the example of device passthrough frameworks. The
routing information that identifies an I/O address space in the wire is 
per-device and registered to IOMMU fd via VFIO uAPI.

-   KVM fd provides uAPI for handling no-snoop DMA and PASID virtualization
in CPU (when PASID is carried in instruction payload).

1.1. Manage I/O address space
+++