Re: [RFC v2] /dev/iommu uAPI proposal
On Fri, Aug 06, 2021 at 09:32:11AM -0300, Jason Gunthorpe wrote: > On Fri, Aug 06, 2021 at 02:45:26PM +1000, David Gibson wrote: > > > Well, that's kind of what I'm doing. PCI currently has the notion of > > "default" address space for a RID, but there's no guarantee that other > > buses (or even future PCI extensions) will. The idea is that > > "endpoint" means exactly the (RID, PASID) or (SID, SSID) or whatever > > future variations are on that. > > This is already happening in this proposal, it is why I insisted that > the driver facing API has to be very explicit. That API specifies > exactly what the device silicon is doing. > > However, that is placed at the IOASID level. There is no reason to > create endpoint objects that are 1:1 with IOASID objects, eg for > PASID. They're not 1:1 though. You can have multiple endpoints in the same IOAS, that's the whole point. > We need to have clear software layers and responsibilities, I think > this is where the VFIO container design has fallen behind. > > The device driver is responsible to delcare what TLPs the device it > controls will issue Right.. and I'm envisaging an endpoint as a abstraction to represent a single TLP. > The system layer is responsible to determine how those TLPs can be > matched to IO page tables, if at all > > The IO page table layer is responsible to map the TLPs to physical > memory. > > Each must stay in its box and we should not create objects that smush > together, say, the device and system layers because it will only make > a mess of the software design. I agree... and endpoints are explicitly an attempt to do that. I don't see how you think they're smushing things together. > Since the system layer doesn't have any concrete objects in our > environment (which is based on devices and IO page tables) it has to > exist as metadata attached to the other two objects. Whereas I'm suggesting clarifying this by *creating* concrete objects to represent the concept we need. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC v2] /dev/iommu uAPI proposal
> From: Eric Auger > Sent: Tuesday, August 10, 2021 3:17 PM > > Hi Kevin, > > On 8/5/21 2:36 AM, Tian, Kevin wrote: > >> From: Eric Auger > >> Sent: Wednesday, August 4, 2021 11:59 PM > >> > > [...] > >>> 1.2. Attach Device to I/O address space > >>> +++ > >>> > >>> Device attach/bind is initiated through passthrough framework uAPI. > >>> > >>> Device attaching is allowed only after a device is successfully bound to > >>> the IOMMU fd. User should provide a device cookie when binding the > >>> device through VFIO uAPI. This cookie is used when the user queries > >>> device capability/format, issues per-device iotlb invalidation and > >>> receives per-device I/O page fault data via IOMMU fd. > >>> > >>> Successful binding puts the device into a security context which isolates > >>> its DMA from the rest system. VFIO should not allow user to access the > >> s/from the rest system/from the rest of the system > >>> device before binding is completed. Similarly, VFIO should prevent the > >>> user from unbinding the device before user access is withdrawn. > >> With Intel scalable IOV, I understand you could assign an RID/PASID to > >> one VM and another one to another VM (which is not the case for ARM). > Is > >> it a targetted use case?How would it be handled? Is it related to the > >> sub-groups evoked hereafter? > > Not related to sub-group. Each mdev is bound to the IOMMU fd > respectively > > with the defPASID which represents the mdev. > But how does it work in term of security. The device (RID) is bound to > an IOMMU fd. But then each SID/PASID may be working for a different VM. > How do you detect this is safe as each SID can work safely for a > different VM versus the ARM case where it is not possible. PASID is managed by the parent driver, which knows which PASID to be used given a mdev when later attaching it to an IOASID. > > 1.3 says > " > > 1) A successful binding call for the first device in the group creates > the security context for the entire group, by: > " > What does it mean for above scalable IOV use case? > This is a good question (as Alex raised) which needs more explanation in next version: https://lore.kernel.org/linux-iommu/20210712124150.2bf421d1.alex.william...@redhat.com/ In general we need provide different helpers for binding pdev/mdev/ sw mdev. 1.3 in v2 describes the behavior for pdev via iommu_register_ device(). for mdev a new helper (e.g. iommu_register_device_pasid()) is required and then the IOMMU-API will also provide a pasid variation for creating security context per pasid. sw mdev will also have its binding helper to indicate no routing info required in ioasid attaching. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
Hi Kevin, On 8/5/21 2:36 AM, Tian, Kevin wrote: >> From: Eric Auger >> Sent: Wednesday, August 4, 2021 11:59 PM >> > [...] >>> 1.2. Attach Device to I/O address space >>> +++ >>> >>> Device attach/bind is initiated through passthrough framework uAPI. >>> >>> Device attaching is allowed only after a device is successfully bound to >>> the IOMMU fd. User should provide a device cookie when binding the >>> device through VFIO uAPI. This cookie is used when the user queries >>> device capability/format, issues per-device iotlb invalidation and >>> receives per-device I/O page fault data via IOMMU fd. >>> >>> Successful binding puts the device into a security context which isolates >>> its DMA from the rest system. VFIO should not allow user to access the >> s/from the rest system/from the rest of the system >>> device before binding is completed. Similarly, VFIO should prevent the >>> user from unbinding the device before user access is withdrawn. >> With Intel scalable IOV, I understand you could assign an RID/PASID to >> one VM and another one to another VM (which is not the case for ARM). Is >> it a targetted use case?How would it be handled? Is it related to the >> sub-groups evoked hereafter? > Not related to sub-group. Each mdev is bound to the IOMMU fd respectively > with the defPASID which represents the mdev. But how does it work in term of security. The device (RID) is bound to an IOMMU fd. But then each SID/PASID may be working for a different VM. How do you detect this is safe as each SID can work safely for a different VM versus the ARM case where it is not possible. 1.3 says " 1) A successful binding call for the first device in the group creates the security context for the entire group, by: " What does it mean for above scalable IOV use case? > >> Actually all devices bound to an IOMMU fd should have the same parent >> I/O address space or root address space, am I correct? If so, maybe add >> this comment explicitly? > in most cases yes but it's not mandatory. multiple roots are allowed > (e.g. with vIOMMU but no nesting). OK, right, this corresponds to example 4.2 for example. I misinterpreted the notion of security context. The security context does not match the IOMMU fd but is something implicit created on 1st device binding. > > [...] >>> The device in the /dev/iommu context always refers to a physical one >>> (pdev) which is identifiable via RID. Physically each pdev can support >>> one default I/O address space (routed via RID) and optionally multiple >>> non-default I/O address spaces (via RID+PASID). >>> >>> The device in VFIO context is a logic concept, being either a physical >>> device (pdev) or mediated device (mdev or subdev). Each vfio device >>> is represented by RID+cookie in IOMMU fd. User is allowed to create >>> one default I/O address space (routed by vRID from user p.o.v) per >>> each vfio_device. >> The concept of default address space is not fully clear for me. I >> currently understand this is a >> root address space (not nesting). Is that coorect.This may need >> clarification. > w/o PASID there is only one address space (either GPA or GIOVA) > per device. This one is called default. whether it's root is orthogonal > (e.g. GIOVA could be also nested) to the device view of this space. > > w/ PASID additional address spaces can be targeted by the device. > those are called non-default. > > I could also rename default to RID address space and non-default to > RID+PASID address space if doing so makes it clearer. Yes I think it is worth having a kind of glossary and defining root as, default as as you clearly defined child/parent. > >>> VFIO decides the routing information for this default >>> space based on device type: >>> >>> 1) pdev, routed via RID; >>> >>> 2) mdev/subdev with IOMMU-enforced DMA isolation, routed via >>> the parent's RID plus the PASID marking this mdev; >>> >>> 3) a purely sw-mediated device (sw mdev), no routing required i.e. no >>> need to install the I/O page table in the IOMMU. sw mdev just uses >>> the metadata to assist its internal DMA isolation logic on top of >>> the parent's IOMMU page table; >> Maybe you should introduce this concept of SW mediated device earlier >> because it seems to special case the way the attach behaves. I am >> especially refering to >> >> "Successful attaching activates an I/O address space in the IOMMU, if the >> device is not purely software mediated" > makes sense. > >>> In addition, VFIO may allow user to create additional I/O address spaces >>> on a vfio_device based on the hardware capability. In such case the user >>> has its own view of the virtual routing information (vPASID) when marking >>> these non-default address spaces. >> I do not catch what does mean "marking these non default address space". > as explained above, those non-default address spaces are identified/routed > via PASID. > >>> 1.3. Group isolation >>> >
RE: [RFC v2] /dev/iommu uAPI proposal
> From: David Gibson > Sent: Tuesday, August 10, 2021 12:48 PM > > On Mon, Aug 09, 2021 at 08:34:06AM +, Tian, Kevin wrote: > > > From: David Gibson > > > Sent: Friday, August 6, 2021 12:45 PM > > > > > > In concept I feel the purpose of DMA endpoint is equivalent to the > > > routing > > > > > > info in this proposal. > > > > > > > > > > Maybe? I'm afraid I never quite managed to understand the role of > the > > > > > routing info in your proposal. > > > > > > > > > > > > > the IOMMU routes incoming DMA packets to a specific I/O page table, > > > > according to RID or RID+PASID carried in the packet. RID or RID+PASID > > > > is the routing information (represented by device cookie +PASID in > > > > proposed uAPI) and what the iommu driver really cares when activating > > > > the I/O page table in the iommu. > > > > > > Ok, so yes, endpoint is roughly equivalent to that. But my point is > > > that the IOMMU layer really only cares about that (device+routing) > > > combination, not other aspects of what the device is. So that's the > > > concept we should give a name and put front and center in the API. > > > > > > > This is how this proposal works, centered around the routing info. the > > uAPI doesn't care what the device is. It just requires the user to specify > > the user view of routing info (device fd + optional pasid) to tag an IOAS. > > Which works as long as (just device fd) and (device fd + PASID) covers > all the options. If we have new possibilities we need new interfaces. > And, that can't even handle the case of one endpoint for multiple > devices (e.g. ACS-incapable bridge). why not? We have went through a long debate in v1 to reach conclusion that a device-centric uAPI can cover above group scenario (see section 1.3) > > > the user view is then converted to the kernel view of routing (rid or > > rid+pasid) by vfio driver and then passed to iommu fd in the attaching > > operation. and GET_INFO interface is provided for the user to check > > whether a device supports multiple IOASes and whether pasid space is > > delegated to the user. We just need a better name if pasid is considered > > too pci specific... > > > > But creating an endpoint per ioasid and making it centered in uAPI is not > > what the IOMMU layer cares about. > > It's not an endpoint per ioasid. You can have multiple endpoints per > ioasid, just not the other way around. As it is multiple IOASes per you need create one endpoint per device fd to attach to gpa_ioasid. then one endpoint per device fd to attach to pasidtbl_ioasid on arm/amd. then one endpoint per pasid to attach to gva_ioasid on intel. In the end you just create one endpoint per each attached ioasid given a device. > device means *some* sort of disambiguation (generally by PASID) which > is hard to describe generall. Having endpoints as a first-class > concept makes that simpler. > I don't think pasid causes any disambiguation (except the name itself being pci specific). with multiple IOASes you always need an id to tag it. This id is what iommu layer cares about. which endpoint on the device uses the id is not a business to iommu. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On Mon, Aug 09, 2021 at 08:34:06AM +, Tian, Kevin wrote: > > From: David Gibson > > Sent: Friday, August 6, 2021 12:45 PM > > > > > In concept I feel the purpose of DMA endpoint is equivalent to the > > routing > > > > > info in this proposal. > > > > > > > > Maybe? I'm afraid I never quite managed to understand the role of the > > > > routing info in your proposal. > > > > > > > > > > the IOMMU routes incoming DMA packets to a specific I/O page table, > > > according to RID or RID+PASID carried in the packet. RID or RID+PASID > > > is the routing information (represented by device cookie +PASID in > > > proposed uAPI) and what the iommu driver really cares when activating > > > the I/O page table in the iommu. > > > > Ok, so yes, endpoint is roughly equivalent to that. But my point is > > that the IOMMU layer really only cares about that (device+routing) > > combination, not other aspects of what the device is. So that's the > > concept we should give a name and put front and center in the API. > > > > This is how this proposal works, centered around the routing info. the > uAPI doesn't care what the device is. It just requires the user to specify > the user view of routing info (device fd + optional pasid) to tag an IOAS. Which works as long as (just device fd) and (device fd + PASID) covers all the options. If we have new possibilities we need new interfaces. And, that can't even handle the case of one endpoint for multiple devices (e.g. ACS-incapable bridge). > the user view is then converted to the kernel view of routing (rid or > rid+pasid) by vfio driver and then passed to iommu fd in the attaching > operation. and GET_INFO interface is provided for the user to check > whether a device supports multiple IOASes and whether pasid space is > delegated to the user. We just need a better name if pasid is considered > too pci specific... > > But creating an endpoint per ioasid and making it centered in uAPI is not > what the IOMMU layer cares about. It's not an endpoint per ioasid. You can have multiple endpoints per ioasid, just not the other way around. As it is multiple IOASes per device means *some* sort of disambiguation (generally by PASID) which is hard to describe generall. Having endpoints as a first-class concept makes that simpler. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC v2] /dev/iommu uAPI proposal
> From: David Gibson > Sent: Friday, August 6, 2021 12:45 PM > > > > In concept I feel the purpose of DMA endpoint is equivalent to the > routing > > > > info in this proposal. > > > > > > Maybe? I'm afraid I never quite managed to understand the role of the > > > routing info in your proposal. > > > > > > > the IOMMU routes incoming DMA packets to a specific I/O page table, > > according to RID or RID+PASID carried in the packet. RID or RID+PASID > > is the routing information (represented by device cookie +PASID in > > proposed uAPI) and what the iommu driver really cares when activating > > the I/O page table in the iommu. > > Ok, so yes, endpoint is roughly equivalent to that. But my point is > that the IOMMU layer really only cares about that (device+routing) > combination, not other aspects of what the device is. So that's the > concept we should give a name and put front and center in the API. > This is how this proposal works, centered around the routing info. the uAPI doesn't care what the device is. It just requires the user to specify the user view of routing info (device fd + optional pasid) to tag an IOAS. the user view is then converted to the kernel view of routing (rid or rid+pasid) by vfio driver and then passed to iommu fd in the attaching operation. and GET_INFO interface is provided for the user to check whether a device supports multiple IOASes and whether pasid space is delegated to the user. We just need a better name if pasid is considered too pci specific... But creating an endpoint per ioasid and making it centered in uAPI is not what the IOMMU layer cares about. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On Fri, Aug 06, 2021 at 02:45:26PM +1000, David Gibson wrote: > Well, that's kind of what I'm doing. PCI currently has the notion of > "default" address space for a RID, but there's no guarantee that other > buses (or even future PCI extensions) will. The idea is that > "endpoint" means exactly the (RID, PASID) or (SID, SSID) or whatever > future variations are on that. This is already happening in this proposal, it is why I insisted that the driver facing API has to be very explicit. That API specifies exactly what the device silicon is doing. However, that is placed at the IOASID level. There is no reason to create endpoint objects that are 1:1 with IOASID objects, eg for PASID. We need to have clear software layers and responsibilities, I think this is where the VFIO container design has fallen behind. The device driver is responsible to delcare what TLPs the device it controls will issue The system layer is responsible to determine how those TLPs can be matched to IO page tables, if at all The IO page table layer is responsible to map the TLPs to physical memory. Each must stay in its box and we should not create objects that smush together, say, the device and system layers because it will only make a mess of the software design. Since the system layer doesn't have any concrete objects in our environment (which is based on devices and IO page tables) it has to exist as metadata attached to the other two objects. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On Tue, Aug 03, 2021 at 03:19:26AM +, Tian, Kevin wrote: > > From: David Gibson > > Sent: Tuesday, August 3, 2021 9:51 AM > > > > On Wed, Jul 28, 2021 at 04:04:24AM +, Tian, Kevin wrote: > > > Hi, David, > > > > > > > From: David Gibson > > > > Sent: Monday, July 26, 2021 12:51 PM > > > > > > > > On Fri, Jul 09, 2021 at 07:48:44AM +, Tian, Kevin wrote: > > > > > /dev/iommu provides an unified interface for managing I/O page tables > > for > > > > > devices assigned to userspace. Device passthrough frameworks (VFIO, > > > > vDPA, > > > > > etc.) are expected to use this interface instead of creating their own > > logic to > > > > > isolate untrusted device DMAs initiated by userspace. > > > > > > > > > > This proposal describes the uAPI of /dev/iommu and also sample > > > > sequences > > > > > with VFIO as example in typical usages. The driver-facing kernel API > > > > provided > > > > > by the iommu layer is still TBD, which can be discussed after > > > > > consensus > > is > > > > > made on this uAPI. > > > > > > > > > > It's based on a lengthy discussion starting from here: > > > > > https://lore.kernel.org/linux- > > > > iommu/20210330132830.go2356...@nvidia.com/ > > > > > > > > > > v1 can be found here: > > > > > https://lore.kernel.org/linux- > > > > > > iommu/PH0PR12MB54811863B392C644E5365446DC3E9@PH0PR12MB5481.n > > > > amprd12.prod.outlook.com/T/ > > > > > > > > > > This doc is also tracked on github, though it's not very useful for > > > > > v1->v2 > > > > > given dramatic refactoring: > > > > > https://github.com/luxis1999/dev_iommu_uapi > > > > > > > > Thanks for all your work on this, Kevin. Apart from the actual > > > > semantic improvements, I'm finding v2 significantly easier to read and > > > > understand than v1. > > > > > > > > [snip] > > > > > 1.2. Attach Device to I/O address space > > > > > +++ > > > > > > > > > > Device attach/bind is initiated through passthrough framework uAPI. > > > > > > > > > > Device attaching is allowed only after a device is successfully bound > > > > > to > > > > > the IOMMU fd. User should provide a device cookie when binding the > > > > > device through VFIO uAPI. This cookie is used when the user queries > > > > > device capability/format, issues per-device iotlb invalidation and > > > > > receives per-device I/O page fault data via IOMMU fd. > > > > > > > > > > Successful binding puts the device into a security context which > > > > > isolates > > > > > its DMA from the rest system. VFIO should not allow user to access the > > > > > device before binding is completed. Similarly, VFIO should prevent the > > > > > user from unbinding the device before user access is withdrawn. > > > > > > > > > > When a device is in an iommu group which contains multiple devices, > > > > > all devices within the group must enter/exit the security context > > > > > together. Please check {1.3} for more info about group isolation via > > > > > this device-centric design. > > > > > > > > > > Successful attaching activates an I/O address space in the IOMMU, > > > > > if the device is not purely software mediated. VFIO must provide > > > > > device > > > > > specific routing information for where to install the I/O page table > > > > > in > > > > > the IOMMU for this device. VFIO must also guarantee that the attached > > > > > device is configured to compose DMAs with the routing information > > that > > > > > is provided in the attaching call. When handling DMA requests, IOMMU > > > > > identifies the target I/O address space according to the routing > > > > > information carried in the request. Misconfiguration breaks DMA > > > > > isolation thus could lead to severe security vulnerability. > > > > > > > > > > Routing information is per-device and bus specific. For PCI, it is > > > > > Requester ID (RID) identifying the device plus optional Process > > > > > Address > > > > > Space ID (PASID). For ARM, it is Stream ID (SID) plus optional Sub- > > Stream > > > > > ID (SSID). PASID or SSID is used when multiple I/O address spaces are > > > > > enabled on a single device. For simplicity and continuity reason the > > > > > following context uses RID+PASID though SID+SSID may sound a clearer > > > > > naming from device p.o.v. We can decide the actual naming when > > coding. > > > > > > > > > > Because one I/O address space can be attached by multiple devices, > > > > > per-device routing information (plus device cookie) is tracked under > > > > > each IOASID and is used respectively when activating the I/O address > > > > > space in the IOMMU for each attached device. > > > > > > > > > > The device in the /dev/iommu context always refers to a physical one > > > > > (pdev) which is identifiable via RID. Physically each pdev can support > > > > > one default I/O address space (routed via RID) and optionally multiple > > > > > non-default I/O address spaces (via RID+PASID). > > > > > > > > > > The device in
Re: [RFC v2] /dev/iommu uAPI proposal
On Wed, Aug 04, 2021 at 11:04:47AM -0300, Jason Gunthorpe wrote: > On Mon, Aug 02, 2021 at 02:49:44AM +, Tian, Kevin wrote: > > > Can you elaborate? IMO the user only cares about the label (device cookie > > plus optional vPASID) which is generated by itself when doing the attaching > > call, and expects this virtual label being used in various spots > > (invalidation, > > page fault, etc.). How the system labels the traffic (the physical RID or > > RID+ > > PASID) should be completely invisible to userspace. > > I don't think that is true if the vIOMMU driver is also emulating > PASID. Presumably the same is true for other PASID-like schemes. Right. The idea for an SVA capable vIOMMU in my scheme is that the hypervisor would set up an IOAS of address type "PASID+address" with the mappings made by the guest according to its vIOMMU semantics. Then SVA capable devices would be plugged into that IOAS by using "PASID+address" type endpoints from those devices. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On Wed, Aug 04, 2021 at 11:07:42AM -0300, Jason Gunthorpe wrote: > On Tue, Aug 03, 2021 at 11:58:54AM +1000, David Gibson wrote: > > > I'd rather deduce the endpoint from a collection of devices than the > > > other way around... > > > > Which I think is confusing, and in any case doesn't cover the case of > > one "device" with multiple endpoints. > > Well they are both confusing, and I'd prefer to focus on the common > case without extra mandatory steps. Exposing optional endpoint sharing > information seems more in line with where everything is going than > making endpoint sharing a first class object. > > AFAIK a device with multiple endpoints where those endpoints are > shared with other devices doesn't really exist/or is useful? Eg PASID > has multiple RIDs by they are not shared. No, I can't think of a (non-contrived) example where a device would have *both* multiple endpoints and those endpoints are shared amongst multiple devices. I can easily think of examples where a device has multiple (non shared) endpoints and where multiple devices share a single endpoint. The point is that making endpoints explicit separates the various options here from the logic of the IOMMU layer itself. New device types with new possibilities here means new interfaces *on those devices*, but not new interfaces on /dev/iommu. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC v2] /dev/iommu uAPI proposal
> From: Jason Gunthorpe > Sent: Thursday, August 5, 2021 7:27 PM > > On Wed, Aug 04, 2021 at 10:59:21PM +, Tian, Kevin wrote: > > > From: Jason Gunthorpe > > > Sent: Wednesday, August 4, 2021 10:05 PM > > > > > > On Mon, Aug 02, 2021 at 02:49:44AM +, Tian, Kevin wrote: > > > > > > > Can you elaborate? IMO the user only cares about the label (device > cookie > > > > plus optional vPASID) which is generated by itself when doing the > attaching > > > > call, and expects this virtual label being used in various spots > (invalidation, > > > > page fault, etc.). How the system labels the traffic (the physical RID > > > > or > RID+ > > > > PASID) should be completely invisible to userspace. > > > > > > I don't think that is true if the vIOMMU driver is also emulating > > > PASID. Presumably the same is true for other PASID-like schemes. > > > > > > > I'm getting even more confused with this comment. Isn't it the > > consensus from day one that physical PASID should not be exposed > > to userspace as doing so breaks live migration? > > Uh, no? > > > with PASID emulation vIOMMU only cares about vPASID instead of > > pPASID, and the uAPI only requires user to register vPASID instead > > of reporting pPASID back to userspace... > > vPASID is only a feature of one device in existance, so we can't make > vPASID mandatory. > sure. my point is just that if vPASID is being emulated there is no need of exposing pPASID to user space. Can you give a concrete example where pPASID must be exposed and how the user wants to use this information? Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On Wed, Aug 04, 2021 at 10:59:21PM +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Wednesday, August 4, 2021 10:05 PM > > > > On Mon, Aug 02, 2021 at 02:49:44AM +, Tian, Kevin wrote: > > > > > Can you elaborate? IMO the user only cares about the label (device cookie > > > plus optional vPASID) which is generated by itself when doing the > > > attaching > > > call, and expects this virtual label being used in various spots > > > (invalidation, > > > page fault, etc.). How the system labels the traffic (the physical RID or > > > RID+ > > > PASID) should be completely invisible to userspace. > > > > I don't think that is true if the vIOMMU driver is also emulating > > PASID. Presumably the same is true for other PASID-like schemes. > > > > I'm getting even more confused with this comment. Isn't it the > consensus from day one that physical PASID should not be exposed > to userspace as doing so breaks live migration? Uh, no? > with PASID emulation vIOMMU only cares about vPASID instead of > pPASID, and the uAPI only requires user to register vPASID instead > of reporting pPASID back to userspace... vPASID is only a feature of one device in existance, so we can't make vPASID mandatory. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC v2] /dev/iommu uAPI proposal
> From: Eric Auger > Sent: Wednesday, August 4, 2021 11:59 PM > [...] > > 1.2. Attach Device to I/O address space > > +++ > > > > Device attach/bind is initiated through passthrough framework uAPI. > > > > Device attaching is allowed only after a device is successfully bound to > > the IOMMU fd. User should provide a device cookie when binding the > > device through VFIO uAPI. This cookie is used when the user queries > > device capability/format, issues per-device iotlb invalidation and > > receives per-device I/O page fault data via IOMMU fd. > > > > Successful binding puts the device into a security context which isolates > > its DMA from the rest system. VFIO should not allow user to access the > s/from the rest system/from the rest of the system > > device before binding is completed. Similarly, VFIO should prevent the > > user from unbinding the device before user access is withdrawn. > With Intel scalable IOV, I understand you could assign an RID/PASID to > one VM and another one to another VM (which is not the case for ARM). Is > it a targetted use case?How would it be handled? Is it related to the > sub-groups evoked hereafter? Not related to sub-group. Each mdev is bound to the IOMMU fd respectively with the defPASID which represents the mdev. > > Actually all devices bound to an IOMMU fd should have the same parent > I/O address space or root address space, am I correct? If so, maybe add > this comment explicitly? in most cases yes but it's not mandatory. multiple roots are allowed (e.g. with vIOMMU but no nesting). [...] > > The device in the /dev/iommu context always refers to a physical one > > (pdev) which is identifiable via RID. Physically each pdev can support > > one default I/O address space (routed via RID) and optionally multiple > > non-default I/O address spaces (via RID+PASID). > > > > The device in VFIO context is a logic concept, being either a physical > > device (pdev) or mediated device (mdev or subdev). Each vfio device > > is represented by RID+cookie in IOMMU fd. User is allowed to create > > one default I/O address space (routed by vRID from user p.o.v) per > > each vfio_device. > The concept of default address space is not fully clear for me. I > currently understand this is a > root address space (not nesting). Is that coorect.This may need > clarification. w/o PASID there is only one address space (either GPA or GIOVA) per device. This one is called default. whether it's root is orthogonal (e.g. GIOVA could be also nested) to the device view of this space. w/ PASID additional address spaces can be targeted by the device. those are called non-default. I could also rename default to RID address space and non-default to RID+PASID address space if doing so makes it clearer. > > VFIO decides the routing information for this default > > space based on device type: > > > > 1) pdev, routed via RID; > > > > 2) mdev/subdev with IOMMU-enforced DMA isolation, routed via > > the parent's RID plus the PASID marking this mdev; > > > > 3) a purely sw-mediated device (sw mdev), no routing required i.e. no > > need to install the I/O page table in the IOMMU. sw mdev just uses > > the metadata to assist its internal DMA isolation logic on top of > > the parent's IOMMU page table; > Maybe you should introduce this concept of SW mediated device earlier > because it seems to special case the way the attach behaves. I am > especially refering to > > "Successful attaching activates an I/O address space in the IOMMU, if the > device is not purely software mediated" makes sense. > > > > > In addition, VFIO may allow user to create additional I/O address spaces > > on a vfio_device based on the hardware capability. In such case the user > > has its own view of the virtual routing information (vPASID) when marking > > these non-default address spaces. > I do not catch what does mean "marking these non default address space". as explained above, those non-default address spaces are identified/routed via PASID. > > > > 1.3. Group isolation > > [...] > > > > 1) A successful binding call for the first device in the group creates > > the security context for the entire group, by: > > > > * Verifying group viability in a similar way as VFIO does; > > > > * Calling IOMMU-API to move the group into a block-dma state, > > which makes all devices in the group attached to an block-dma > > domain with an empty I/O page table; > this block-dma state/domain would deserve to be better defined (I know > you already evoked it in 1.1 with the dma mapping protocol though) > activates an empty I/O page table in the IOMMU (if the device is not > purely SW mediated)? sure. some explanations are scattered in following paragraph, but I can consider to further clarify it. > How does that relate to the default address space? Is it the same? different. this block-dma domain doesn't hold any valid mappin
RE: [RFC v2] /dev/iommu uAPI proposal
> From: Jason Gunthorpe > Sent: Wednesday, August 4, 2021 10:05 PM > > On Mon, Aug 02, 2021 at 02:49:44AM +, Tian, Kevin wrote: > > > Can you elaborate? IMO the user only cares about the label (device cookie > > plus optional vPASID) which is generated by itself when doing the attaching > > call, and expects this virtual label being used in various spots > > (invalidation, > > page fault, etc.). How the system labels the traffic (the physical RID or > > RID+ > > PASID) should be completely invisible to userspace. > > I don't think that is true if the vIOMMU driver is also emulating > PASID. Presumably the same is true for other PASID-like schemes. > I'm getting even more confused with this comment. Isn't it the consensus from day one that physical PASID should not be exposed to userspace as doing so breaks live migration? with PASID emulation vIOMMU only cares about vPASID instead of pPASID, and the uAPI only requires user to register vPASID instead of reporting pPASID back to userspace... Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
Hi Kevin, Few comments/questions below. On 7/9/21 9:48 AM, Tian, Kevin wrote: > /dev/iommu provides an unified interface for managing I/O page tables for > devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, > etc.) are expected to use this interface instead of creating their own logic > to > isolate untrusted device DMAs initiated by userspace. > > This proposal describes the uAPI of /dev/iommu and also sample sequences > with VFIO as example in typical usages. The driver-facing kernel API provided > by the iommu layer is still TBD, which can be discussed after consensus is > made on this uAPI. > > It's based on a lengthy discussion starting from here: > > https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/ > > v1 can be found here: > > https://lore.kernel.org/linux-iommu/ph0pr12mb54811863b392c644e5365446dc...@ph0pr12mb5481.namprd12.prod.outlook.com/T/ > > This doc is also tracked on github, though it's not very useful for v1->v2 > given dramatic refactoring: > https://github.com/luxis1999/dev_iommu_uapi > > Changelog (v1->v2): > - Rename /dev/ioasid to /dev/iommu (Jason); > - Add a section for device-centric vs. group-centric design (many); > - Add a section for handling no-snoop DMA (Jason/Alex/Paolo); > - Add definition of user/kernel/shared I/O page tables (Baolu/Jason); > - Allow one device bound to multiple iommu fd's (Jason); > - No need to track user I/O page tables in kernel on ARM/AMD (Jean/Jason); > - Add a device cookie for iotlb invalidation and fault handling (Jean/Jason); > - Add capability/format query interface per device cookie (Jason); > - Specify format/attribute when creating an IOASID, leading to several v1 > uAPI commands removed (Jason); > - Explain the value of software nesting (Jean); > - Replace IOASID_REGISTER_VIRTUAL_MEMORY with software nesting (David/Jason); > - Cover software mdev usage (Jason); > - No restriction on map/unmap vs. bind/invalidate (Jason/David); > - Report permitted IOVA range instead of reserved range (David); > - Refine the sample structures and helper functions (Jason); > - Add definition of default and non-default I/O address spaces; > - Expand and clarify the design for PASID virtualization; > - and lots of subtle refinement according to above changes; > > TOC > > 1. Terminologies and Concepts > 1.1. Manage I/O address space > 1.2. Attach device to I/O address space > 1.3. Group isolation > 1.4. PASID virtualization > 1.4.1. Devices which don't support DMWr > 1.4.2. Devices which support DMWr > 1.4.3. Mix different types together > 1.4.4. User sequence > 1.5. No-snoop DMA > 2. uAPI Proposal > 2.1. /dev/iommu uAPI > 2.2. /dev/vfio device uAPI > 2.3. /dev/kvm uAPI > 3. Sample Structures and Helper Functions > 4. Use Cases and Flows > 4.1. A simple example > 4.2. Multiple IOASIDs (no nesting) > 4.3. IOASID nesting (software) > 4.4. IOASID nesting (hardware) > 4.5. Guest SVA (vSVA) > 4.6. I/O page fault > > > 1. Terminologies and Concepts > - > > IOMMU fd is the container holding multiple I/O address spaces. User > manages those address spaces through fd operations. Multiple fd's are > allowed per process, but with this proposal one fd should be sufficient for > all intended usages. > > IOASID is the fd-local software handle representing an I/O address space. > Each IOASID is associated with a single I/O page table. IOASIDs can be > nested together, implying the output address from one I/O page table > (represented by child IOASID) must be further translated by another I/O > page table (represented by parent IOASID). > > An I/O address space takes effect only after it is attached by a device. > One device is allowed to attach to multiple I/O address spaces. One I/O > address space can be attached by multiple devices. > > Device must be bound to an IOMMU fd before attach operation can be > conducted. Though not necessary, user could bind one device to multiple > IOMMU FD's. But no cross-FD IOASID nesting is allowed. > > The format of an I/O page table must be compatible to the attached > devices (or more specifically to the IOMMU which serves the DMA from > the attached devices). User is responsible for specifying the format > when allocating an IOASID, according to one or multiple devices which > will be attached right after. Attaching a device to an IOASID with > incompatible format is simply rejected. > > Relationship between IOMMU fd, VFIO fd and KVM fd: > > - IOMMU fd provides uAPI for managing IOASIDs and I/O page tables. > It also provides an unified capability/format reporting interface for > each bound device. > > - VFIO fd provides uAPI for device binding and attaching. In this proposal > VFIO is used as the example of device passthrough frameworks. The > routing information that identifies an I/O addre
Re: [RFC v2] /dev/iommu uAPI proposal
On Tue, Aug 03, 2021 at 11:58:54AM +1000, David Gibson wrote: > > I'd rather deduce the endpoint from a collection of devices than the > > other way around... > > Which I think is confusing, and in any case doesn't cover the case of > one "device" with multiple endpoints. Well they are both confusing, and I'd prefer to focus on the common case without extra mandatory steps. Exposing optional endpoint sharing information seems more in line with where everything is going than making endpoint sharing a first class object. AFAIK a device with multiple endpoints where those endpoints are shared with other devices doesn't really exist/or is useful? Eg PASID has multiple RIDs by they are not shared. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On Mon, Aug 02, 2021 at 02:49:44AM +, Tian, Kevin wrote: > Can you elaborate? IMO the user only cares about the label (device cookie > plus optional vPASID) which is generated by itself when doing the attaching > call, and expects this virtual label being used in various spots > (invalidation, > page fault, etc.). How the system labels the traffic (the physical RID or RID+ > PASID) should be completely invisible to userspace. I don't think that is true if the vIOMMU driver is also emulating PASID. Presumably the same is true for other PASID-like schemes. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC v2] /dev/iommu uAPI proposal
> From: David Gibson > Sent: Tuesday, August 3, 2021 9:51 AM > > On Wed, Jul 28, 2021 at 04:04:24AM +, Tian, Kevin wrote: > > Hi, David, > > > > > From: David Gibson > > > Sent: Monday, July 26, 2021 12:51 PM > > > > > > On Fri, Jul 09, 2021 at 07:48:44AM +, Tian, Kevin wrote: > > > > /dev/iommu provides an unified interface for managing I/O page tables > for > > > > devices assigned to userspace. Device passthrough frameworks (VFIO, > > > vDPA, > > > > etc.) are expected to use this interface instead of creating their own > logic to > > > > isolate untrusted device DMAs initiated by userspace. > > > > > > > > This proposal describes the uAPI of /dev/iommu and also sample > > > sequences > > > > with VFIO as example in typical usages. The driver-facing kernel API > > > provided > > > > by the iommu layer is still TBD, which can be discussed after consensus > is > > > > made on this uAPI. > > > > > > > > It's based on a lengthy discussion starting from here: > > > > https://lore.kernel.org/linux- > > > iommu/20210330132830.go2356...@nvidia.com/ > > > > > > > > v1 can be found here: > > > > https://lore.kernel.org/linux- > > > > iommu/PH0PR12MB54811863B392C644E5365446DC3E9@PH0PR12MB5481.n > > > amprd12.prod.outlook.com/T/ > > > > > > > > This doc is also tracked on github, though it's not very useful for > > > > v1->v2 > > > > given dramatic refactoring: > > > > https://github.com/luxis1999/dev_iommu_uapi > > > > > > Thanks for all your work on this, Kevin. Apart from the actual > > > semantic improvements, I'm finding v2 significantly easier to read and > > > understand than v1. > > > > > > [snip] > > > > 1.2. Attach Device to I/O address space > > > > +++ > > > > > > > > Device attach/bind is initiated through passthrough framework uAPI. > > > > > > > > Device attaching is allowed only after a device is successfully bound to > > > > the IOMMU fd. User should provide a device cookie when binding the > > > > device through VFIO uAPI. This cookie is used when the user queries > > > > device capability/format, issues per-device iotlb invalidation and > > > > receives per-device I/O page fault data via IOMMU fd. > > > > > > > > Successful binding puts the device into a security context which > > > > isolates > > > > its DMA from the rest system. VFIO should not allow user to access the > > > > device before binding is completed. Similarly, VFIO should prevent the > > > > user from unbinding the device before user access is withdrawn. > > > > > > > > When a device is in an iommu group which contains multiple devices, > > > > all devices within the group must enter/exit the security context > > > > together. Please check {1.3} for more info about group isolation via > > > > this device-centric design. > > > > > > > > Successful attaching activates an I/O address space in the IOMMU, > > > > if the device is not purely software mediated. VFIO must provide device > > > > specific routing information for where to install the I/O page table in > > > > the IOMMU for this device. VFIO must also guarantee that the attached > > > > device is configured to compose DMAs with the routing information > that > > > > is provided in the attaching call. When handling DMA requests, IOMMU > > > > identifies the target I/O address space according to the routing > > > > information carried in the request. Misconfiguration breaks DMA > > > > isolation thus could lead to severe security vulnerability. > > > > > > > > Routing information is per-device and bus specific. For PCI, it is > > > > Requester ID (RID) identifying the device plus optional Process Address > > > > Space ID (PASID). For ARM, it is Stream ID (SID) plus optional Sub- > Stream > > > > ID (SSID). PASID or SSID is used when multiple I/O address spaces are > > > > enabled on a single device. For simplicity and continuity reason the > > > > following context uses RID+PASID though SID+SSID may sound a clearer > > > > naming from device p.o.v. We can decide the actual naming when > coding. > > > > > > > > Because one I/O address space can be attached by multiple devices, > > > > per-device routing information (plus device cookie) is tracked under > > > > each IOASID and is used respectively when activating the I/O address > > > > space in the IOMMU for each attached device. > > > > > > > > The device in the /dev/iommu context always refers to a physical one > > > > (pdev) which is identifiable via RID. Physically each pdev can support > > > > one default I/O address space (routed via RID) and optionally multiple > > > > non-default I/O address spaces (via RID+PASID). > > > > > > > > The device in VFIO context is a logic concept, being either a physical > > > > device (pdev) or mediated device (mdev or subdev). Each vfio device > > > > is represented by RID+cookie in IOMMU fd. User is allowed to create > > > > one default I/O address space (routed by vRID from user p.o.v) per > > > > each vfio_device.
Re: [RFC v2] /dev/iommu uAPI proposal
On Fri, Jul 30, 2021 at 11:51:23AM -0300, Jason Gunthorpe wrote: > On Mon, Jul 26, 2021 at 02:50:48PM +1000, David Gibson wrote: > > > That said, I'm still finding the various ways a device can attach to > > an ioasid pretty confusing. Here are some thoughts on some extra > > concepts that might make it easier to handle [note, I haven't thought > > this all the way through so far, so there might be fatal problems with > > this approach]. > > I think you've summarized how I've been viewing this problem. All the > concepts you pointed to should show through in the various APIs at the > end, one way or another. > > How much we need to expose to userspace, I don't know. > > Does userspace need to care how the system labels traffic between DMA > endpoint and the IOASID? At some point maybe yes since stuff like > PASID does leak out in various spots Yeah, I'm not sure. I think it probably doesn't for the "main path" of the API, though we might want to expose that for debugging and some edge cases. We *should* however be exposing the address type for each IOAS, since that affects how your MAP operations will work, as well as what endpoints are compatible with the IOAS. > > /dev/iommu would work entirely (or nearly so) in terms of endpoint > > handles, not device handles. Endpoints are what get bound to an IOAS, > > and endpoints are what get the user chosen endpoint cookie. > > While an accurate modeling of groups, it feels like an > overcomplication at this point in history where new HW largely doesn't > need it. So.. first, is that really true across the board? I expect it's true of high end server hardware, but for consumer level and embedded hardware as well? Then there's virtual hardware - I could point to several things still routinely using emulated PCIe to PCI bridges in qemu. Second, we can't just ignore older hardware. > The user interface VFIO and others presents is device > centric, inserting a new endpoint object is going going back to some > kind of group centric view of the world. Well, kind of, yeah, because I still think the concept has value. Part of the trouble is that "device" is pretty ambiguous. "Device" in the sense of PCI address for register interface may not be the same as "device" in terms of DMA RID may not be the same as as "device" in terms of Linux struct device terms of PCI register interface is not the same as "device" in terms of RID / DMA identifiability is not the same "device" in terms of what. > I'd rather deduce the endpoint from a collection of devices than the > other way around... Which I think is confusing, and in any case doesn't cover the case of one "device" with multiple endpoints. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On Wed, Jul 28, 2021 at 04:04:24AM +, Tian, Kevin wrote: > Hi, David, > > > From: David Gibson > > Sent: Monday, July 26, 2021 12:51 PM > > > > On Fri, Jul 09, 2021 at 07:48:44AM +, Tian, Kevin wrote: > > > /dev/iommu provides an unified interface for managing I/O page tables for > > > devices assigned to userspace. Device passthrough frameworks (VFIO, > > vDPA, > > > etc.) are expected to use this interface instead of creating their own > > > logic to > > > isolate untrusted device DMAs initiated by userspace. > > > > > > This proposal describes the uAPI of /dev/iommu and also sample > > sequences > > > with VFIO as example in typical usages. The driver-facing kernel API > > provided > > > by the iommu layer is still TBD, which can be discussed after consensus is > > > made on this uAPI. > > > > > > It's based on a lengthy discussion starting from here: > > > https://lore.kernel.org/linux- > > iommu/20210330132830.go2356...@nvidia.com/ > > > > > > v1 can be found here: > > > https://lore.kernel.org/linux- > > iommu/PH0PR12MB54811863B392C644E5365446DC3E9@PH0PR12MB5481.n > > amprd12.prod.outlook.com/T/ > > > > > > This doc is also tracked on github, though it's not very useful for v1->v2 > > > given dramatic refactoring: > > > https://github.com/luxis1999/dev_iommu_uapi > > > > Thanks for all your work on this, Kevin. Apart from the actual > > semantic improvements, I'm finding v2 significantly easier to read and > > understand than v1. > > > > [snip] > > > 1.2. Attach Device to I/O address space > > > +++ > > > > > > Device attach/bind is initiated through passthrough framework uAPI. > > > > > > Device attaching is allowed only after a device is successfully bound to > > > the IOMMU fd. User should provide a device cookie when binding the > > > device through VFIO uAPI. This cookie is used when the user queries > > > device capability/format, issues per-device iotlb invalidation and > > > receives per-device I/O page fault data via IOMMU fd. > > > > > > Successful binding puts the device into a security context which isolates > > > its DMA from the rest system. VFIO should not allow user to access the > > > device before binding is completed. Similarly, VFIO should prevent the > > > user from unbinding the device before user access is withdrawn. > > > > > > When a device is in an iommu group which contains multiple devices, > > > all devices within the group must enter/exit the security context > > > together. Please check {1.3} for more info about group isolation via > > > this device-centric design. > > > > > > Successful attaching activates an I/O address space in the IOMMU, > > > if the device is not purely software mediated. VFIO must provide device > > > specific routing information for where to install the I/O page table in > > > the IOMMU for this device. VFIO must also guarantee that the attached > > > device is configured to compose DMAs with the routing information that > > > is provided in the attaching call. When handling DMA requests, IOMMU > > > identifies the target I/O address space according to the routing > > > information carried in the request. Misconfiguration breaks DMA > > > isolation thus could lead to severe security vulnerability. > > > > > > Routing information is per-device and bus specific. For PCI, it is > > > Requester ID (RID) identifying the device plus optional Process Address > > > Space ID (PASID). For ARM, it is Stream ID (SID) plus optional Sub-Stream > > > ID (SSID). PASID or SSID is used when multiple I/O address spaces are > > > enabled on a single device. For simplicity and continuity reason the > > > following context uses RID+PASID though SID+SSID may sound a clearer > > > naming from device p.o.v. We can decide the actual naming when coding. > > > > > > Because one I/O address space can be attached by multiple devices, > > > per-device routing information (plus device cookie) is tracked under > > > each IOASID and is used respectively when activating the I/O address > > > space in the IOMMU for each attached device. > > > > > > The device in the /dev/iommu context always refers to a physical one > > > (pdev) which is identifiable via RID. Physically each pdev can support > > > one default I/O address space (routed via RID) and optionally multiple > > > non-default I/O address spaces (via RID+PASID). > > > > > > The device in VFIO context is a logic concept, being either a physical > > > device (pdev) or mediated device (mdev or subdev). Each vfio device > > > is represented by RID+cookie in IOMMU fd. User is allowed to create > > > one default I/O address space (routed by vRID from user p.o.v) per > > > each vfio_device. VFIO decides the routing information for this default > > > space based on device type: > > > > > > 1) pdev, routed via RID; > > > > > > 2) mdev/subdev with IOMMU-enforced DMA isolation, routed via > > > the parent's RID plus the PASID marking this mdev; > > > > > > 3) a purely sw-
RE: [RFC v2] /dev/iommu uAPI proposal
> From: Jason Gunthorpe > Sent: Friday, July 30, 2021 10:51 PM > > On Mon, Jul 26, 2021 at 02:50:48PM +1000, David Gibson wrote: > > > That said, I'm still finding the various ways a device can attach to > > an ioasid pretty confusing. Here are some thoughts on some extra > > concepts that might make it easier to handle [note, I haven't thought > > this all the way through so far, so there might be fatal problems with > > this approach]. > > I think you've summarized how I've been viewing this problem. All the > concepts you pointed to should show through in the various APIs at the > end, one way or another. I still didn't get the value of making endpoint explicit in /dev/iommu uAPI. >From IOMMU p.o.v it only cares how to route incoming DMA traffic to a specific I/O page table, according to RID or RID+PASID info carried in DMA packets. This has been covered by this proposal. Which DMA endpoint in the source device actually triggers the traffic is not a matter for /dev/iommu... > > How much we need to expose to userspace, I don't know. > > Does userspace need to care how the system labels traffic between DMA > endpoint and the IOASID? At some point maybe yes since stuff like > PASID does leak out in various spots > Can you elaborate? IMO the user only cares about the label (device cookie plus optional vPASID) which is generated by itself when doing the attaching call, and expects this virtual label being used in various spots (invalidation, page fault, etc.). How the system labels the traffic (the physical RID or RID+ PASID) should be completely invisible to userspace. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On Mon, Jul 26, 2021 at 02:50:48PM +1000, David Gibson wrote: > That said, I'm still finding the various ways a device can attach to > an ioasid pretty confusing. Here are some thoughts on some extra > concepts that might make it easier to handle [note, I haven't thought > this all the way through so far, so there might be fatal problems with > this approach]. I think you've summarized how I've been viewing this problem. All the concepts you pointed to should show through in the various APIs at the end, one way or another. How much we need to expose to userspace, I don't know. Does userspace need to care how the system labels traffic between DMA endpoint and the IOASID? At some point maybe yes since stuff like PASID does leak out in various spots > /dev/iommu would work entirely (or nearly so) in terms of endpoint > handles, not device handles. Endpoints are what get bound to an IOAS, > and endpoints are what get the user chosen endpoint cookie. While an accurate modeling of groups, it feels like an overcomplication at this point in history where new HW largely doesn't need it. The user interface VFIO and others presents is device centric, inserting a new endpoint object is going going back to some kind of group centric view of the world. I'd rather deduce the endpoint from a collection of devices than the other way around... Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC v2] /dev/iommu uAPI proposal
> From: Jean-Philippe Brucker > Sent: Monday, July 26, 2021 4:15 PM > > Hi Kevin, > > On Fri, Jul 09, 2021 at 07:48:44AM +, Tian, Kevin wrote: > > /dev/iommu provides an unified interface for managing I/O page tables for > > devices assigned to userspace. Device passthrough frameworks (VFIO, > vDPA, > > etc.) are expected to use this interface instead of creating their own > > logic to > > isolate untrusted device DMAs initiated by userspace. > > > > This proposal describes the uAPI of /dev/iommu and also sample > sequences > > with VFIO as example in typical usages. The driver-facing kernel API > provided > > by the iommu layer is still TBD, which can be discussed after consensus is > > made on this uAPI. > > The document looks good to me, I don't have other concerns at the moment > Thanks for your review. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC v2] /dev/iommu uAPI proposal
Hi, David, > From: David Gibson > Sent: Monday, July 26, 2021 12:51 PM > > On Fri, Jul 09, 2021 at 07:48:44AM +, Tian, Kevin wrote: > > /dev/iommu provides an unified interface for managing I/O page tables for > > devices assigned to userspace. Device passthrough frameworks (VFIO, > vDPA, > > etc.) are expected to use this interface instead of creating their own > > logic to > > isolate untrusted device DMAs initiated by userspace. > > > > This proposal describes the uAPI of /dev/iommu and also sample > sequences > > with VFIO as example in typical usages. The driver-facing kernel API > provided > > by the iommu layer is still TBD, which can be discussed after consensus is > > made on this uAPI. > > > > It's based on a lengthy discussion starting from here: > > https://lore.kernel.org/linux- > iommu/20210330132830.go2356...@nvidia.com/ > > > > v1 can be found here: > > https://lore.kernel.org/linux- > iommu/PH0PR12MB54811863B392C644E5365446DC3E9@PH0PR12MB5481.n > amprd12.prod.outlook.com/T/ > > > > This doc is also tracked on github, though it's not very useful for v1->v2 > > given dramatic refactoring: > > https://github.com/luxis1999/dev_iommu_uapi > > Thanks for all your work on this, Kevin. Apart from the actual > semantic improvements, I'm finding v2 significantly easier to read and > understand than v1. > > [snip] > > 1.2. Attach Device to I/O address space > > +++ > > > > Device attach/bind is initiated through passthrough framework uAPI. > > > > Device attaching is allowed only after a device is successfully bound to > > the IOMMU fd. User should provide a device cookie when binding the > > device through VFIO uAPI. This cookie is used when the user queries > > device capability/format, issues per-device iotlb invalidation and > > receives per-device I/O page fault data via IOMMU fd. > > > > Successful binding puts the device into a security context which isolates > > its DMA from the rest system. VFIO should not allow user to access the > > device before binding is completed. Similarly, VFIO should prevent the > > user from unbinding the device before user access is withdrawn. > > > > When a device is in an iommu group which contains multiple devices, > > all devices within the group must enter/exit the security context > > together. Please check {1.3} for more info about group isolation via > > this device-centric design. > > > > Successful attaching activates an I/O address space in the IOMMU, > > if the device is not purely software mediated. VFIO must provide device > > specific routing information for where to install the I/O page table in > > the IOMMU for this device. VFIO must also guarantee that the attached > > device is configured to compose DMAs with the routing information that > > is provided in the attaching call. When handling DMA requests, IOMMU > > identifies the target I/O address space according to the routing > > information carried in the request. Misconfiguration breaks DMA > > isolation thus could lead to severe security vulnerability. > > > > Routing information is per-device and bus specific. For PCI, it is > > Requester ID (RID) identifying the device plus optional Process Address > > Space ID (PASID). For ARM, it is Stream ID (SID) plus optional Sub-Stream > > ID (SSID). PASID or SSID is used when multiple I/O address spaces are > > enabled on a single device. For simplicity and continuity reason the > > following context uses RID+PASID though SID+SSID may sound a clearer > > naming from device p.o.v. We can decide the actual naming when coding. > > > > Because one I/O address space can be attached by multiple devices, > > per-device routing information (plus device cookie) is tracked under > > each IOASID and is used respectively when activating the I/O address > > space in the IOMMU for each attached device. > > > > The device in the /dev/iommu context always refers to a physical one > > (pdev) which is identifiable via RID. Physically each pdev can support > > one default I/O address space (routed via RID) and optionally multiple > > non-default I/O address spaces (via RID+PASID). > > > > The device in VFIO context is a logic concept, being either a physical > > device (pdev) or mediated device (mdev or subdev). Each vfio device > > is represented by RID+cookie in IOMMU fd. User is allowed to create > > one default I/O address space (routed by vRID from user p.o.v) per > > each vfio_device. VFIO decides the routing information for this default > > space based on device type: > > > > 1) pdev, routed via RID; > > > > 2) mdev/subdev with IOMMU-enforced DMA isolation, routed via > > the parent's RID plus the PASID marking this mdev; > > > > 3) a purely sw-mediated device (sw mdev), no routing required i.e. no > > need to install the I/O page table in the IOMMU. sw mdev just uses > > the metadata to assist its internal DMA isolation logic on top of > > the parent's IOMMU page table; > > > > In a
Re: [RFC v2] /dev/iommu uAPI proposal
Hi Kevin, On Fri, Jul 09, 2021 at 07:48:44AM +, Tian, Kevin wrote: > /dev/iommu provides an unified interface for managing I/O page tables for > devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, > etc.) are expected to use this interface instead of creating their own logic > to > isolate untrusted device DMAs initiated by userspace. > > This proposal describes the uAPI of /dev/iommu and also sample sequences > with VFIO as example in typical usages. The driver-facing kernel API provided > by the iommu layer is still TBD, which can be discussed after consensus is > made on this uAPI. The document looks good to me, I don't have other concerns at the moment Thanks, Jean ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On Fri, Jul 09, 2021 at 07:48:44AM +, Tian, Kevin wrote: > /dev/iommu provides an unified interface for managing I/O page tables for > devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, > etc.) are expected to use this interface instead of creating their own logic > to > isolate untrusted device DMAs initiated by userspace. > > This proposal describes the uAPI of /dev/iommu and also sample sequences > with VFIO as example in typical usages. The driver-facing kernel API provided > by the iommu layer is still TBD, which can be discussed after consensus is > made on this uAPI. > > It's based on a lengthy discussion starting from here: > > https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/ > > v1 can be found here: > > https://lore.kernel.org/linux-iommu/ph0pr12mb54811863b392c644e5365446dc...@ph0pr12mb5481.namprd12.prod.outlook.com/T/ > > This doc is also tracked on github, though it's not very useful for v1->v2 > given dramatic refactoring: > https://github.com/luxis1999/dev_iommu_uapi Thanks for all your work on this, Kevin. Apart from the actual semantic improvements, I'm finding v2 significantly easier to read and understand than v1. [snip] > 1.2. Attach Device to I/O address space > +++ > > Device attach/bind is initiated through passthrough framework uAPI. > > Device attaching is allowed only after a device is successfully bound to > the IOMMU fd. User should provide a device cookie when binding the > device through VFIO uAPI. This cookie is used when the user queries > device capability/format, issues per-device iotlb invalidation and > receives per-device I/O page fault data via IOMMU fd. > > Successful binding puts the device into a security context which isolates > its DMA from the rest system. VFIO should not allow user to access the > device before binding is completed. Similarly, VFIO should prevent the > user from unbinding the device before user access is withdrawn. > > When a device is in an iommu group which contains multiple devices, > all devices within the group must enter/exit the security context > together. Please check {1.3} for more info about group isolation via > this device-centric design. > > Successful attaching activates an I/O address space in the IOMMU, > if the device is not purely software mediated. VFIO must provide device > specific routing information for where to install the I/O page table in > the IOMMU for this device. VFIO must also guarantee that the attached > device is configured to compose DMAs with the routing information that > is provided in the attaching call. When handling DMA requests, IOMMU > identifies the target I/O address space according to the routing > information carried in the request. Misconfiguration breaks DMA > isolation thus could lead to severe security vulnerability. > > Routing information is per-device and bus specific. For PCI, it is > Requester ID (RID) identifying the device plus optional Process Address > Space ID (PASID). For ARM, it is Stream ID (SID) plus optional Sub-Stream > ID (SSID). PASID or SSID is used when multiple I/O address spaces are > enabled on a single device. For simplicity and continuity reason the > following context uses RID+PASID though SID+SSID may sound a clearer > naming from device p.o.v. We can decide the actual naming when coding. > > Because one I/O address space can be attached by multiple devices, > per-device routing information (plus device cookie) is tracked under > each IOASID and is used respectively when activating the I/O address > space in the IOMMU for each attached device. > > The device in the /dev/iommu context always refers to a physical one > (pdev) which is identifiable via RID. Physically each pdev can support > one default I/O address space (routed via RID) and optionally multiple > non-default I/O address spaces (via RID+PASID). > > The device in VFIO context is a logic concept, being either a physical > device (pdev) or mediated device (mdev or subdev). Each vfio device > is represented by RID+cookie in IOMMU fd. User is allowed to create > one default I/O address space (routed by vRID from user p.o.v) per > each vfio_device. VFIO decides the routing information for this default > space based on device type: > > 1) pdev, routed via RID; > > 2) mdev/subdev with IOMMU-enforced DMA isolation, routed via > the parent's RID plus the PASID marking this mdev; > > 3) a purely sw-mediated device (sw mdev), no routing required i.e. no > need to install the I/O page table in the IOMMU. sw mdev just uses > the metadata to assist its internal DMA isolation logic on top of > the parent's IOMMU page table; > > In addition, VFIO may allow user to create additional I/O address spaces > on a vfio_device based on the hardware capability. In such case the user > has its own view of the virtual routing information (vPASID) when
Re: [RFC v2] /dev/iommu uAPI proposal
On Wed, Jul 21, 2021 at 02:13:23AM +, Tian, Kevin wrote: > > From: Shenming Lu > > Sent: Friday, July 16, 2021 8:20 PM > > > > On 2021/7/16 9:20, Tian, Kevin wrote: > > > To summarize, for vIOMMU we can work with the spec owner to > > > define a proper interface to feedback such restriction into the guest > > > if necessary. For the kernel part, it's clear that IOMMU fd should > > > disallow two devices attached to a single [RID] or [RID, PASID] slot > > > in the first place. > > > > > > Then the next question is how to communicate such restriction > > > to the userspace. It sounds like a group, but different in concept. > > > An iommu group describes the minimal isolation boundary thus all > > > devices in the group can be only assigned to a single user. But this > > > case is opposite - the two mdevs (both support ENQCMD submission) > > > with the same parent have problem when assigned to a single VM > > > (in this case vPASID is vm-wide translated thus a same pPASID will be > > > used cross both mdevs) while they instead work pretty well when > > > assigned to different VMs (completely different vPASID spaces thus > > > different pPASIDs). > > > > > > One thought is to have vfio device driver deal with it. In this proposal > > > it is the vfio device driver to define the PASID virtualization policy and > > > report it to userspace via VFIO_DEVICE_GET_INFO. The driver understands > > > the restriction thus could just hide the vPASID capability when the user > > > calls GET_INFO on the 2nd mdev in above scenario. In this way the > > > user even doesn't need to know such restriction at all and both mdevs > > > can be assigned to a single VM w/o any problem. > > > > > > > The restriction only probably happens when two mdevs are assigned to one > > VM, > > how could the vfio device driver get to know this info to accurately hide > > the vPASID capability for the 2nd mdev when VFIO_DEVICE_GET_INFO? > > There is no > > need to do this in other cases. > > > > I suppose the driver can detect it via whether two mdevs are opened by a > single process. Just have the kernel some ID for the PASID numberspace - devices with the same ID have to be represented as a single RID. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC v2] /dev/iommu uAPI proposal
> From: Shenming Lu > Sent: Friday, July 16, 2021 8:20 PM > > On 2021/7/16 9:20, Tian, Kevin wrote: > > To summarize, for vIOMMU we can work with the spec owner to > > define a proper interface to feedback such restriction into the guest > > if necessary. For the kernel part, it's clear that IOMMU fd should > > disallow two devices attached to a single [RID] or [RID, PASID] slot > > in the first place. > > > > Then the next question is how to communicate such restriction > > to the userspace. It sounds like a group, but different in concept. > > An iommu group describes the minimal isolation boundary thus all > > devices in the group can be only assigned to a single user. But this > > case is opposite - the two mdevs (both support ENQCMD submission) > > with the same parent have problem when assigned to a single VM > > (in this case vPASID is vm-wide translated thus a same pPASID will be > > used cross both mdevs) while they instead work pretty well when > > assigned to different VMs (completely different vPASID spaces thus > > different pPASIDs). > > > > One thought is to have vfio device driver deal with it. In this proposal > > it is the vfio device driver to define the PASID virtualization policy and > > report it to userspace via VFIO_DEVICE_GET_INFO. The driver understands > > the restriction thus could just hide the vPASID capability when the user > > calls GET_INFO on the 2nd mdev in above scenario. In this way the > > user even doesn't need to know such restriction at all and both mdevs > > can be assigned to a single VM w/o any problem. > > > > The restriction only probably happens when two mdevs are assigned to one > VM, > how could the vfio device driver get to know this info to accurately hide > the vPASID capability for the 2nd mdev when VFIO_DEVICE_GET_INFO? > There is no > need to do this in other cases. > I suppose the driver can detect it via whether two mdevs are opened by a single process. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC v2] /dev/iommu uAPI proposal
> From: Jason Gunthorpe > Sent: Saturday, July 17, 2021 2:30 AM > > On Fri, Jul 16, 2021 at 01:20:15AM +, Tian, Kevin wrote: > > > One thought is to have vfio device driver deal with it. In this proposal > > it is the vfio device driver to define the PASID virtualization policy and > > report it to userspace via VFIO_DEVICE_GET_INFO. The driver understands > > the restriction thus could just hide the vPASID capability when the user > > calls GET_INFO on the 2nd mdev in above scenario. In this way the > > user even doesn't need to know such restriction at all and both mdevs > > can be assigned to a single VM w/o any problem. > > I think it makes more sense to expose some kind of "pasid group" to > qemu that identifies that each PASID must be unique across the > group. For vIOMMUs that are doing funky things with the RID This means > a single PASID group must not be exposed as two RIDs to the guest. > It's an interesting idea. Some aspects are still unclear to me now e.g. how to describe such restriction in a way that it's applied only to a single user owning the group (not the case when assigned to different users), whether it can be generalized cross subsystems (vPASID being a vfio-managed resource), etc. Let's refine it when working on the actual implementation. > If the kernel blocks it then it can never be fixed by updating the > vIOMMU design. > But the mdev driver can choose to do so. Should we prevent it? btw just be curious whether you have got a chance to have a full review on v2. I wonder when might be a good time to discuss the execution plan following this proposal, if no major open remains... Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On Fri, Jul 16, 2021 at 01:20:15AM +, Tian, Kevin wrote: > One thought is to have vfio device driver deal with it. In this proposal > it is the vfio device driver to define the PASID virtualization policy and > report it to userspace via VFIO_DEVICE_GET_INFO. The driver understands > the restriction thus could just hide the vPASID capability when the user > calls GET_INFO on the 2nd mdev in above scenario. In this way the > user even doesn't need to know such restriction at all and both mdevs > can be assigned to a single VM w/o any problem. I think it makes more sense to expose some kind of "pasid group" to qemu that identifies that each PASID must be unique across the group. For vIOMMUs that are doing funky things with the RID This means a single PASID group must not be exposed as two RIDs to the guest. If the kernel blocks it then it can never be fixed by updating the vIOMMU design. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On 2021/7/16 9:20, Tian, Kevin wrote: > To summarize, for vIOMMU we can work with the spec owner to > define a proper interface to feedback such restriction into the guest > if necessary. For the kernel part, it's clear that IOMMU fd should > disallow two devices attached to a single [RID] or [RID, PASID] slot > in the first place. > > Then the next question is how to communicate such restriction > to the userspace. It sounds like a group, but different in concept. > An iommu group describes the minimal isolation boundary thus all > devices in the group can be only assigned to a single user. But this > case is opposite - the two mdevs (both support ENQCMD submission) > with the same parent have problem when assigned to a single VM > (in this case vPASID is vm-wide translated thus a same pPASID will be > used cross both mdevs) while they instead work pretty well when > assigned to different VMs (completely different vPASID spaces thus > different pPASIDs). > > One thought is to have vfio device driver deal with it. In this proposal > it is the vfio device driver to define the PASID virtualization policy and > report it to userspace via VFIO_DEVICE_GET_INFO. The driver understands > the restriction thus could just hide the vPASID capability when the user > calls GET_INFO on the 2nd mdev in above scenario. In this way the > user even doesn't need to know such restriction at all and both mdevs > can be assigned to a single VM w/o any problem. > The restriction only probably happens when two mdevs are assigned to one VM, how could the vfio device driver get to know this info to accurately hide the vPASID capability for the 2nd mdev when VFIO_DEVICE_GET_INFO? There is no need to do this in other cases. Thanks, Shenming ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC v2] /dev/iommu uAPI proposal
> From: Jason Gunthorpe > Sent: Friday, July 16, 2021 2:13 AM > > On Thu, Jul 15, 2021 at 11:05:45AM -0700, Raj, Ashok wrote: > > On Thu, Jul 15, 2021 at 02:53:36PM -0300, Jason Gunthorpe wrote: > > > On Thu, Jul 15, 2021 at 10:48:36AM -0700, Raj, Ashok wrote: > > > > > > > > > Do we have any isolation requirements here? its the same process. > So if the > > > > > > page-request it sent to guest and even if you report it for mdev1, > after > > > > > > the PRQ is resolved by guest, the request from mdev2 from the > same guest > > > > > > should simply work? > > > > > > > > > > I think we already talked about this and said it should not be done. > > > > > > > > I get the should not be done, I'm wondering where should that be > > > > implemented? > > > > > > The iommu layer cannot have ambiguity. Every RID or RID,PASID slot > > > must have only one device attached to it. Attempting to connect two > > > devices to the same slot fails on the iommu layer. > > > > I guess we are talking about two different things. I was referring to SVM > > side of things. Maybe you are referring to the mdev. > > I'm talking about in the hypervisor. > > As I've said already, the vIOMMU interface is the problem here. The > guest VM should be able to know that it cannot use PASID 1 with two > devices, like the hypervisor knows. At the very least it should be > able to know that the PASID binding has failed and relay that failure > back to the process. > > Ideally the guest would know it should allocate another PASID for > these cases. > > But yes, if mdevs are going to be modeled with RIDs in the guest then > with the current vIOMMU we cannot cause a single hypervisor RID to > show up as two RIDs in the guest without breaking the vIOMMU model. > To summarize, for vIOMMU we can work with the spec owner to define a proper interface to feedback such restriction into the guest if necessary. For the kernel part, it's clear that IOMMU fd should disallow two devices attached to a single [RID] or [RID, PASID] slot in the first place. Then the next question is how to communicate such restriction to the userspace. It sounds like a group, but different in concept. An iommu group describes the minimal isolation boundary thus all devices in the group can be only assigned to a single user. But this case is opposite - the two mdevs (both support ENQCMD submission) with the same parent have problem when assigned to a single VM (in this case vPASID is vm-wide translated thus a same pPASID will be used cross both mdevs) while they instead work pretty well when assigned to different VMs (completely different vPASID spaces thus different pPASIDs). One thought is to have vfio device driver deal with it. In this proposal it is the vfio device driver to define the PASID virtualization policy and report it to userspace via VFIO_DEVICE_GET_INFO. The driver understands the restriction thus could just hide the vPASID capability when the user calls GET_INFO on the 2nd mdev in above scenario. In this way the user even doesn't need to know such restriction at all and both mdevs can be assigned to a single VM w/o any problem. Does it sound a right approach? Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On Thu, Jul 15, 2021 at 11:05:45AM -0700, Raj, Ashok wrote: > On Thu, Jul 15, 2021 at 02:53:36PM -0300, Jason Gunthorpe wrote: > > On Thu, Jul 15, 2021 at 10:48:36AM -0700, Raj, Ashok wrote: > > > > > > > Do we have any isolation requirements here? its the same process. So > > > > > if the > > > > > page-request it sent to guest and even if you report it for mdev1, > > > > > after > > > > > the PRQ is resolved by guest, the request from mdev2 from the same > > > > > guest > > > > > should simply work? > > > > > > > > I think we already talked about this and said it should not be done. > > > > > > I get the should not be done, I'm wondering where should that be > > > implemented? > > > > The iommu layer cannot have ambiguity. Every RID or RID,PASID slot > > must have only one device attached to it. Attempting to connect two > > devices to the same slot fails on the iommu layer. > > I guess we are talking about two different things. I was referring to SVM > side of things. Maybe you are referring to the mdev. I'm talking about in the hypervisor. As I've said already, the vIOMMU interface is the problem here. The guest VM should be able to know that it cannot use PASID 1 with two devices, like the hypervisor knows. At the very least it should be able to know that the PASID binding has failed and relay that failure back to the process. Ideally the guest would know it should allocate another PASID for these cases. But yes, if mdevs are going to be modeled with RIDs in the guest then with the current vIOMMU we cannot cause a single hypervisor RID to show up as two RIDs in the guest without breaking the vIOMMU model. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On Thu, Jul 15, 2021 at 02:53:36PM -0300, Jason Gunthorpe wrote: > On Thu, Jul 15, 2021 at 10:48:36AM -0700, Raj, Ashok wrote: > > > > > Do we have any isolation requirements here? its the same process. So if > > > > the > > > > page-request it sent to guest and even if you report it for mdev1, after > > > > the PRQ is resolved by guest, the request from mdev2 from the same guest > > > > should simply work? > > > > > > I think we already talked about this and said it should not be done. > > > > I get the should not be done, I'm wondering where should that be > > implemented? > > The iommu layer cannot have ambiguity. Every RID or RID,PASID slot > must have only one device attached to it. Attempting to connect two > devices to the same slot fails on the iommu layer. I guess we are talking about two different things. I was referring to SVM side of things. Maybe you are referring to the mdev. A single guest process should be allowed to work with 2 different accelerators. The PASID for the process is just 1. Limiting that to just one accelerator per process seems wrong. Unless there is something else to prevent this, the best way seems never expose more than 1 mdev from same pdev to the same guest. I think this is a reasonable restriction compared to limiting a process to bind to no more than 1 accelerator. > > So the 2nd mdev will fail during IOASID binding when it tries to bind > to the same PASID that the first mdev is already bound to. > > Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On Thu, Jul 15, 2021 at 10:48:36AM -0700, Raj, Ashok wrote: > > > Do we have any isolation requirements here? its the same process. So if > > > the > > > page-request it sent to guest and even if you report it for mdev1, after > > > the PRQ is resolved by guest, the request from mdev2 from the same guest > > > should simply work? > > > > I think we already talked about this and said it should not be done. > > I get the should not be done, I'm wondering where should that be > implemented? The iommu layer cannot have ambiguity. Every RID or RID,PASID slot must have only one device attached to it. Attempting to connect two devices to the same slot fails on the iommu layer. So the 2nd mdev will fail during IOASID binding when it tries to bind to the same PASID that the first mdev is already bound to. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On Thu, Jul 15, 2021 at 02:18:26PM -0300, Jason Gunthorpe wrote: > On Thu, Jul 15, 2021 at 09:21:41AM -0700, Raj, Ashok wrote: > > On Thu, Jul 15, 2021 at 12:23:25PM -0300, Jason Gunthorpe wrote: > > > On Thu, Jul 15, 2021 at 06:57:57AM -0700, Raj, Ashok wrote: > > > > On Thu, Jul 15, 2021 at 09:48:13AM -0300, Jason Gunthorpe wrote: > > > > > On Thu, Jul 15, 2021 at 06:49:54AM +, Tian, Kevin wrote: > > > > > > > > > > > No. You are right on this case. I don't think there is a way to > > > > > > differentiate one mdev from the other if they come from the > > > > > > same parent and attached by the same guest process. In this > > > > > > case the fault could be reported on either mdev (e.g. the first > > > > > > matching one) to get it fixed in the guest. > > > > > > > > > > If the IOMMU can't distinguish the two mdevs they are not isolated > > > > > and would have to share a group. Since group sharing is not supported > > > > > today this seems like a non-issue > > > > > > > > Does this mean we have to prevent 2 mdev's from same pdev being > > > > assigned to > > > > the same guest? > > > > > > No, it means that the IOMMU layer has to be able to distinguish them. > > > > Ok, the guest has no control over it, as it see 2 separate pci devices and > > thinks they are all different. > > > > Only time when it can fail is during the bind operation. From guest > > perspective a bind in vIOMMU just turns into a write to local table and a > > invalidate will cause the host to update the real copy from the shadow. > > > > There is no way to fail the bind? and Allocation of the PASID is also a > > separate operation and has no clue how its going to be used in the guest. > > You can't attach the same RID to the same PASID twice. The IOMMU code > should prevent this. > > As we've talked about several times, it seems to me the vIOMMU > interface is misdesigned for the requirements you have. The hypervisor > should have a role in allocating the PASID since there are invisible > hypervisor restrictions. This is one of them. Allocating a PASID is a separate step from binding, isn't it? In vt-d we have a virtual command interface that can fail an allocation of PASID. But which device its bound to is a dynamic thing that only gets at bind_mm() right? > > > Do we have any isolation requirements here? its the same process. So if the > > page-request it sent to guest and even if you report it for mdev1, after > > the PRQ is resolved by guest, the request from mdev2 from the same guest > > should simply work? > > I think we already talked about this and said it should not be done. I get the should not be done, I'm wondering where should that be implemented? ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On Thu, Jul 15, 2021 at 09:21:41AM -0700, Raj, Ashok wrote: > On Thu, Jul 15, 2021 at 12:23:25PM -0300, Jason Gunthorpe wrote: > > On Thu, Jul 15, 2021 at 06:57:57AM -0700, Raj, Ashok wrote: > > > On Thu, Jul 15, 2021 at 09:48:13AM -0300, Jason Gunthorpe wrote: > > > > On Thu, Jul 15, 2021 at 06:49:54AM +, Tian, Kevin wrote: > > > > > > > > > No. You are right on this case. I don't think there is a way to > > > > > differentiate one mdev from the other if they come from the > > > > > same parent and attached by the same guest process. In this > > > > > case the fault could be reported on either mdev (e.g. the first > > > > > matching one) to get it fixed in the guest. > > > > > > > > If the IOMMU can't distinguish the two mdevs they are not isolated > > > > and would have to share a group. Since group sharing is not supported > > > > today this seems like a non-issue > > > > > > Does this mean we have to prevent 2 mdev's from same pdev being assigned > > > to > > > the same guest? > > > > No, it means that the IOMMU layer has to be able to distinguish them. > > Ok, the guest has no control over it, as it see 2 separate pci devices and > thinks they are all different. > > Only time when it can fail is during the bind operation. From guest > perspective a bind in vIOMMU just turns into a write to local table and a > invalidate will cause the host to update the real copy from the shadow. > > There is no way to fail the bind? and Allocation of the PASID is also a > separate operation and has no clue how its going to be used in the guest. You can't attach the same RID to the same PASID twice. The IOMMU code should prevent this. As we've talked about several times, it seems to me the vIOMMU interface is misdesigned for the requirements you have. The hypervisor should have a role in allocating the PASID since there are invisible hypervisor restrictions. This is one of them. > Do we have any isolation requirements here? its the same process. So if the > page-request it sent to guest and even if you report it for mdev1, after > the PRQ is resolved by guest, the request from mdev2 from the same guest > should simply work? I think we already talked about this and said it should not be done. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On Thu, Jul 15, 2021 at 12:23:25PM -0300, Jason Gunthorpe wrote: > On Thu, Jul 15, 2021 at 06:57:57AM -0700, Raj, Ashok wrote: > > On Thu, Jul 15, 2021 at 09:48:13AM -0300, Jason Gunthorpe wrote: > > > On Thu, Jul 15, 2021 at 06:49:54AM +, Tian, Kevin wrote: > > > > > > > No. You are right on this case. I don't think there is a way to > > > > differentiate one mdev from the other if they come from the > > > > same parent and attached by the same guest process. In this > > > > case the fault could be reported on either mdev (e.g. the first > > > > matching one) to get it fixed in the guest. > > > > > > If the IOMMU can't distinguish the two mdevs they are not isolated > > > and would have to share a group. Since group sharing is not supported > > > today this seems like a non-issue > > > > Does this mean we have to prevent 2 mdev's from same pdev being assigned to > > the same guest? > > No, it means that the IOMMU layer has to be able to distinguish them. Ok, the guest has no control over it, as it see 2 separate pci devices and thinks they are all different. Only time when it can fail is during the bind operation. From guest perspective a bind in vIOMMU just turns into a write to local table and a invalidate will cause the host to update the real copy from the shadow. There is no way to fail the bind? and Allocation of the PASID is also a separate operation and has no clue how its going to be used in the guest. > > This either means they are "SW mdevs" which does not involve the IOMMU > layer and puts both the responsibility for isolation and idenfication > on the mdev driver. When you mean SW mdev, is it the GPU like case where mdev is purely a SW construct? or SIOV type where RID+PASID case? > > Or they are some "PASID mdev" which does allow the IOMMU to isolate > them. > > What can't happen is to comingle /dev/iommu control over the pdev > between two mdevs. > > ie we can't talk about faults for IOMMU on SW mdevs - faults do not > come from the IOMMU layer, they have to come from inside the mdev it > self, somehow. Recoverable faults for guest needs to be sent to guest? A page-request from mdev1 and from mdev2 will both look alike when the process is sharing it. Do we have any isolation requirements here? its the same process. So if the page-request it sent to guest and even if you report it for mdev1, after the PRQ is resolved by guest, the request from mdev2 from the same guest should simply work? ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On Thu, Jul 15, 2021 at 06:57:57AM -0700, Raj, Ashok wrote: > On Thu, Jul 15, 2021 at 09:48:13AM -0300, Jason Gunthorpe wrote: > > On Thu, Jul 15, 2021 at 06:49:54AM +, Tian, Kevin wrote: > > > > > No. You are right on this case. I don't think there is a way to > > > differentiate one mdev from the other if they come from the > > > same parent and attached by the same guest process. In this > > > case the fault could be reported on either mdev (e.g. the first > > > matching one) to get it fixed in the guest. > > > > If the IOMMU can't distinguish the two mdevs they are not isolated > > and would have to share a group. Since group sharing is not supported > > today this seems like a non-issue > > Does this mean we have to prevent 2 mdev's from same pdev being assigned to > the same guest? No, it means that the IOMMU layer has to be able to distinguish them. This either means they are "SW mdevs" which does not involve the IOMMU layer and puts both the responsibility for isolation and idenfication on the mdev driver. Or they are some "PASID mdev" which does allow the IOMMU to isolate them. What can't happen is to comingle /dev/iommu control over the pdev between two mdevs. ie we can't talk about faults for IOMMU on SW mdevs - faults do not come from the IOMMU layer, they have to come from inside the mdev it self, somehow. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On Thu, Jul 15, 2021 at 09:48:13AM -0300, Jason Gunthorpe wrote: > On Thu, Jul 15, 2021 at 06:49:54AM +, Tian, Kevin wrote: > > > No. You are right on this case. I don't think there is a way to > > differentiate one mdev from the other if they come from the > > same parent and attached by the same guest process. In this > > case the fault could be reported on either mdev (e.g. the first > > matching one) to get it fixed in the guest. > > If the IOMMU can't distinguish the two mdevs they are not isolated > and would have to share a group. Since group sharing is not supported > today this seems like a non-issue Does this mean we have to prevent 2 mdev's from same pdev being assigned to the same guest? ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On Thu, Jul 15, 2021 at 06:49:54AM +, Tian, Kevin wrote: > No. You are right on this case. I don't think there is a way to > differentiate one mdev from the other if they come from the > same parent and attached by the same guest process. In this > case the fault could be reported on either mdev (e.g. the first > matching one) to get it fixed in the guest. If the IOMMU can't distinguish the two mdevs they are not isolated and would have to share a group. Since group sharing is not supported today this seems like a non-issue Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On 2021/7/15 14:49, Tian, Kevin wrote: >> From: Shenming Lu >> Sent: Thursday, July 15, 2021 2:29 PM >> >> On 2021/7/15 11:55, Tian, Kevin wrote: From: Shenming Lu Sent: Thursday, July 15, 2021 11:21 AM On 2021/7/9 15:48, Tian, Kevin wrote: > 4.6. I/O page fault > +++ > > uAPI is TBD. Here is just about the high-level flow from host IOMMU >> driver > to guest IOMMU driver and backwards. This flow assumes that I/O page faults > are reported via IOMMU interrupts. Some devices report faults via >> device > specific way instead of going through the IOMMU. That usage is not covered > here: > > - Host IOMMU driver receives a I/O page fault with raw fault_data {rid, > pasid, addr}; > > - Host IOMMU driver identifies the faulting I/O page table according to > {rid, pasid} and calls the corresponding fault handler with an opaque > object (registered by the handler) and raw fault_data (rid, pasid, > addr); > > - IOASID fault handler identifies the corresponding ioasid and device > cookie according to the opaque object, generates an user fault_data > (ioasid, cookie, addr) in the fault region, and triggers eventfd to > userspace; > Hi, I have some doubts here: For mdev, it seems that the rid in the raw fault_data is the parent device's, then in the vSVA scenario, how can we get to know the mdev(cookie) from the rid and pasid? And from this point of view,would it be better to register the mdev (iommu_register_device()) with the parent device info? >>> >>> This is what is proposed in this RFC. A successful binding generates a new >>> iommu_dev object for each vfio device. For mdev this object includes >>> its parent device, the defPASID marking this mdev, and the cookie >>> representing it in userspace. Later it is iommu_dev being recorded in >>> the attaching_data when the mdev is attached to an IOASID: >>> >>> struct iommu_attach_data *__iommu_device_attach( >>> struct iommu_dev *dev, u32 ioasid, u32 pasid, int flags); >>> >>> Then when a fault is reported, the fault handler just needs to figure out >>> iommu_dev according to {rid, pasid} in the raw fault data. >>> >> >> Yeah, we have the defPASID that marks the mdev and refers to the default >> I/O address space, but how about the non-default I/O address spaces? >> Is there a case that two different mdevs (on the same parent device) >> are used by the same process in the guest, thus have a same pasid route >> in the physical IOMMU? It seems that we can't figure out the mdev from >> the rid and pasid in this case... >> >> Did I misunderstand something?... :-) >> > > No. You are right on this case. I don't think there is a way to > differentiate one mdev from the other if they come from the > same parent and attached by the same guest process. In this > case the fault could be reported on either mdev (e.g. the first > matching one) to get it fixed in the guest. > OK. Thanks, Shenming ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC v2] /dev/iommu uAPI proposal
> From: Shenming Lu > Sent: Thursday, July 15, 2021 2:29 PM > > On 2021/7/15 11:55, Tian, Kevin wrote: > >> From: Shenming Lu > >> Sent: Thursday, July 15, 2021 11:21 AM > >> > >> On 2021/7/9 15:48, Tian, Kevin wrote: > >>> 4.6. I/O page fault > >>> +++ > >>> > >>> uAPI is TBD. Here is just about the high-level flow from host IOMMU > driver > >>> to guest IOMMU driver and backwards. This flow assumes that I/O page > >> faults > >>> are reported via IOMMU interrupts. Some devices report faults via > device > >>> specific way instead of going through the IOMMU. That usage is not > >> covered > >>> here: > >>> > >>> - Host IOMMU driver receives a I/O page fault with raw fault_data {rid, > >>> pasid, addr}; > >>> > >>> - Host IOMMU driver identifies the faulting I/O page table according to > >>> {rid, pasid} and calls the corresponding fault handler with an opaque > >>> object (registered by the handler) and raw fault_data (rid, pasid, > >>> addr); > >>> > >>> - IOASID fault handler identifies the corresponding ioasid and device > >>> cookie according to the opaque object, generates an user fault_data > >>> (ioasid, cookie, addr) in the fault region, and triggers eventfd to > >>> userspace; > >>> > >> > >> Hi, I have some doubts here: > >> > >> For mdev, it seems that the rid in the raw fault_data is the parent > >> device's, > >> then in the vSVA scenario, how can we get to know the mdev(cookie) from > >> the > >> rid and pasid? > >> > >> And from this point of view,would it be better to register the mdev > >> (iommu_register_device()) with the parent device info? > >> > > > > This is what is proposed in this RFC. A successful binding generates a new > > iommu_dev object for each vfio device. For mdev this object includes > > its parent device, the defPASID marking this mdev, and the cookie > > representing it in userspace. Later it is iommu_dev being recorded in > > the attaching_data when the mdev is attached to an IOASID: > > > > struct iommu_attach_data *__iommu_device_attach( > > struct iommu_dev *dev, u32 ioasid, u32 pasid, int flags); > > > > Then when a fault is reported, the fault handler just needs to figure out > > iommu_dev according to {rid, pasid} in the raw fault data. > > > > Yeah, we have the defPASID that marks the mdev and refers to the default > I/O address space, but how about the non-default I/O address spaces? > Is there a case that two different mdevs (on the same parent device) > are used by the same process in the guest, thus have a same pasid route > in the physical IOMMU? It seems that we can't figure out the mdev from > the rid and pasid in this case... > > Did I misunderstand something?... :-) > No. You are right on this case. I don't think there is a way to differentiate one mdev from the other if they come from the same parent and attached by the same guest process. In this case the fault could be reported on either mdev (e.g. the first matching one) to get it fixed in the guest. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On 2021/7/15 11:55, Tian, Kevin wrote: >> From: Shenming Lu >> Sent: Thursday, July 15, 2021 11:21 AM >> >> On 2021/7/9 15:48, Tian, Kevin wrote: >>> 4.6. I/O page fault >>> +++ >>> >>> uAPI is TBD. Here is just about the high-level flow from host IOMMU driver >>> to guest IOMMU driver and backwards. This flow assumes that I/O page >> faults >>> are reported via IOMMU interrupts. Some devices report faults via device >>> specific way instead of going through the IOMMU. That usage is not >> covered >>> here: >>> >>> - Host IOMMU driver receives a I/O page fault with raw fault_data {rid, >>> pasid, addr}; >>> >>> - Host IOMMU driver identifies the faulting I/O page table according to >>> {rid, pasid} and calls the corresponding fault handler with an opaque >>> object (registered by the handler) and raw fault_data (rid, pasid, >>> addr); >>> >>> - IOASID fault handler identifies the corresponding ioasid and device >>> cookie according to the opaque object, generates an user fault_data >>> (ioasid, cookie, addr) in the fault region, and triggers eventfd to >>> userspace; >>> >> >> Hi, I have some doubts here: >> >> For mdev, it seems that the rid in the raw fault_data is the parent device's, >> then in the vSVA scenario, how can we get to know the mdev(cookie) from >> the >> rid and pasid? >> >> And from this point of view,would it be better to register the mdev >> (iommu_register_device()) with the parent device info? >> > > This is what is proposed in this RFC. A successful binding generates a new > iommu_dev object for each vfio device. For mdev this object includes > its parent device, the defPASID marking this mdev, and the cookie > representing it in userspace. Later it is iommu_dev being recorded in > the attaching_data when the mdev is attached to an IOASID: > > struct iommu_attach_data *__iommu_device_attach( > struct iommu_dev *dev, u32 ioasid, u32 pasid, int flags); > > Then when a fault is reported, the fault handler just needs to figure out > iommu_dev according to {rid, pasid} in the raw fault data. > Yeah, we have the defPASID that marks the mdev and refers to the default I/O address space, but how about the non-default I/O address spaces? Is there a case that two different mdevs (on the same parent device) are used by the same process in the guest, thus have a same pasid route in the physical IOMMU? It seems that we can't figure out the mdev from the rid and pasid in this case... Did I misunderstand something?... :-) Thanks, Shenming ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC v2] /dev/iommu uAPI proposal
> From: Shenming Lu > Sent: Thursday, July 15, 2021 11:21 AM > > On 2021/7/9 15:48, Tian, Kevin wrote: > > 4.6. I/O page fault > > +++ > > > > uAPI is TBD. Here is just about the high-level flow from host IOMMU driver > > to guest IOMMU driver and backwards. This flow assumes that I/O page > faults > > are reported via IOMMU interrupts. Some devices report faults via device > > specific way instead of going through the IOMMU. That usage is not > covered > > here: > > > > - Host IOMMU driver receives a I/O page fault with raw fault_data {rid, > > pasid, addr}; > > > > - Host IOMMU driver identifies the faulting I/O page table according to > > {rid, pasid} and calls the corresponding fault handler with an opaque > > object (registered by the handler) and raw fault_data (rid, pasid, > > addr); > > > > - IOASID fault handler identifies the corresponding ioasid and device > > cookie according to the opaque object, generates an user fault_data > > (ioasid, cookie, addr) in the fault region, and triggers eventfd to > > userspace; > > > > Hi, I have some doubts here: > > For mdev, it seems that the rid in the raw fault_data is the parent device's, > then in the vSVA scenario, how can we get to know the mdev(cookie) from > the > rid and pasid? > > And from this point of view,would it be better to register the mdev > (iommu_register_device()) with the parent device info? > This is what is proposed in this RFC. A successful binding generates a new iommu_dev object for each vfio device. For mdev this object includes its parent device, the defPASID marking this mdev, and the cookie representing it in userspace. Later it is iommu_dev being recorded in the attaching_data when the mdev is attached to an IOASID: struct iommu_attach_data *__iommu_device_attach( struct iommu_dev *dev, u32 ioasid, u32 pasid, int flags); Then when a fault is reported, the fault handler just needs to figure out iommu_dev according to {rid, pasid} in the raw fault data. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On 2021/7/9 15:48, Tian, Kevin wrote: > 4.6. I/O page fault > +++ > > uAPI is TBD. Here is just about the high-level flow from host IOMMU driver > to guest IOMMU driver and backwards. This flow assumes that I/O page faults > are reported via IOMMU interrupts. Some devices report faults via device > specific way instead of going through the IOMMU. That usage is not covered > here: > > - Host IOMMU driver receives a I/O page fault with raw fault_data {rid, > pasid, addr}; > > - Host IOMMU driver identifies the faulting I/O page table according to > {rid, pasid} and calls the corresponding fault handler with an opaque > object (registered by the handler) and raw fault_data (rid, pasid, addr); > > - IOASID fault handler identifies the corresponding ioasid and device > cookie according to the opaque object, generates an user fault_data > (ioasid, cookie, addr) in the fault region, and triggers eventfd to > userspace; > Hi, I have some doubts here: For mdev, it seems that the rid in the raw fault_data is the parent device's, then in the vSVA scenario, how can we get to know the mdev(cookie) from the rid and pasid? And from this point of view,would it be better to register the mdev (iommu_register_device()) with the parent device info? Thanks, Shenming ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC v2] /dev/iommu uAPI proposal
> From: Jason Gunthorpe > Sent: Wednesday, July 14, 2021 7:23 AM > > On Tue, Jul 13, 2021 at 11:20:12PM +, Tian, Kevin wrote: > > > From: Jason Gunthorpe > > > Sent: Wednesday, July 14, 2021 7:03 AM > > > > > > On Tue, Jul 13, 2021 at 10:48:38PM +, Tian, Kevin wrote: > > > > > > > We can still bind to the parent with cookie, but with > > > > iommu_register_ sw_device() IOMMU fd knows that this binding > doesn't > > > > need to establish any security context via IOMMU API. > > > > > > AFAIK there is no reason to involve the parent PCI or other device in > > > SW mode. The iommufd doesn't need to be aware of anything there. > > > > > > > Yes. but does it makes sense to have an unified model in IOMMU fd > > which always have a [struct device, cookie] with flags to indicate whether > > the binding/attaching should be specially handled for sw mdev? Or > > are you suggesting that lacking of struct device is actually the indicator > > for such trick? > > I think you've veered into such micro implementation details that it > is better to wait and see how things look. > > The important point here is that whatever physical device is under a > SW mdev does not need to be passed to the iommufd because there is > nothing it can do with that information. > Make sense ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On Tue, Jul 13, 2021 at 11:20:12PM +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Wednesday, July 14, 2021 7:03 AM > > > > On Tue, Jul 13, 2021 at 10:48:38PM +, Tian, Kevin wrote: > > > > > We can still bind to the parent with cookie, but with > > > iommu_register_ sw_device() IOMMU fd knows that this binding doesn't > > > need to establish any security context via IOMMU API. > > > > AFAIK there is no reason to involve the parent PCI or other device in > > SW mode. The iommufd doesn't need to be aware of anything there. > > > > Yes. but does it makes sense to have an unified model in IOMMU fd > which always have a [struct device, cookie] with flags to indicate whether > the binding/attaching should be specially handled for sw mdev? Or > are you suggesting that lacking of struct device is actually the indicator > for such trick? I think you've veered into such micro implementation details that it is better to wait and see how things look. The important point here is that whatever physical device is under a SW mdev does not need to be passed to the iommufd because there is nothing it can do with that information. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC v2] /dev/iommu uAPI proposal
> From: Jason Gunthorpe > Sent: Wednesday, July 14, 2021 7:03 AM > > On Tue, Jul 13, 2021 at 10:48:38PM +, Tian, Kevin wrote: > > > We can still bind to the parent with cookie, but with > > iommu_register_ sw_device() IOMMU fd knows that this binding doesn't > > need to establish any security context via IOMMU API. > > AFAIK there is no reason to involve the parent PCI or other device in > SW mode. The iommufd doesn't need to be aware of anything there. > Yes. but does it makes sense to have an unified model in IOMMU fd which always have a [struct device, cookie] with flags to indicate whether the binding/attaching should be specially handled for sw mdev? Or are you suggesting that lacking of struct device is actually the indicator for such trick? Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On Tue, Jul 13, 2021 at 10:48:38PM +, Tian, Kevin wrote: > We can still bind to the parent with cookie, but with > iommu_register_ sw_device() IOMMU fd knows that this binding doesn't > need to establish any security context via IOMMU API. AFAIK there is no reason to involve the parent PCI or other device in SW mode. The iommufd doesn't need to be aware of anything there. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC v2] /dev/iommu uAPI proposal
> From: Jason Gunthorpe > Sent: Wednesday, July 14, 2021 12:33 AM > > On Tue, Jul 13, 2021 at 10:26:07AM -0600, Alex Williamson wrote: > > Quoting this proposal again: > > > > > 1) A successful binding call for the first device in the group creates > > > the security context for the entire group, by: > > > > > > * Verifying group viability in a similar way as VFIO does; > > > > > > * Calling IOMMU-API to move the group into a block-dma state, > > > which makes all devices in the group attached to an block-dma > > > domain with an empty I/O page table; > > > > > > VFIO should not allow the user to mmap the MMIO bar of the bound > > > device until the binding call succeeds. > > > > The attach step is irrelevant to my question, the bind step is where > > the device/group gets into a secure state for device access. > > Binding is similar to attach, it will need to indicate the drivers > intention and a SW driver will not attach to the PCI device underneath > it. Yes. I need to clarify this part in next version. In v1 the binding operation was purely a software operation within IOMMU fd thus there was no intention to differentiate device types in this step. But now with v2 the binding actually involves calling IOMMU API for devices other than sw mdev. Then we do need similar per-type binding wrappers as defined for attaching calls. > > > AIUI the operation of VFIO_DEVICE_BIND_IOMMU_FD looks like this: > > > > iommu_ctx = iommu_ctx_fdget(iommu_fd); > > > > mdev = mdev_from_dev(vdev->dev); > > dev = mdev ? mdev_parent_dev(mdev) : vdev->dev; > > > > iommu_dev = iommu_register_device(iommu_ctx, dev, cookie); > > A default of binding to vdev->dev might turn out to be OK, but this > needs to be an overridable op in vfio_device and the SW mdevs will > have to do some 'iommu_register_sw_device()' and not pass in a dev at > all. > We can still bind to the parent with cookie, but with iommu_register_ sw_device() IOMMU fd knows that this binding doesn't need to establish any security context via IOMMU API. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On Tue, Jul 13, 2021 at 10:26:07AM -0600, Alex Williamson wrote: > Quoting this proposal again: > > > 1) A successful binding call for the first device in the group creates > > the security context for the entire group, by: > > > > * Verifying group viability in a similar way as VFIO does; > > > > * Calling IOMMU-API to move the group into a block-dma state, > > which makes all devices in the group attached to an block-dma > > domain with an empty I/O page table; > > > > VFIO should not allow the user to mmap the MMIO bar of the bound > > device until the binding call succeeds. > > The attach step is irrelevant to my question, the bind step is where > the device/group gets into a secure state for device access. Binding is similar to attach, it will need to indicate the drivers intention and a SW driver will not attach to the PCI device underneath it. > AIUI the operation of VFIO_DEVICE_BIND_IOMMU_FD looks like this: > > iommu_ctx = iommu_ctx_fdget(iommu_fd); > > mdev = mdev_from_dev(vdev->dev); > dev = mdev ? mdev_parent_dev(mdev) : vdev->dev; > > iommu_dev = iommu_register_device(iommu_ctx, dev, cookie); A default of binding to vdev->dev might turn out to be OK, but this needs to be an overridable op in vfio_device and the SW mdevs will have to do some 'iommu_register_sw_device()' and not pass in a dev at all. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On Tue, 13 Jul 2021 09:55:03 -0300 Jason Gunthorpe wrote: > On Mon, Jul 12, 2021 at 11:56:24PM +, Tian, Kevin wrote: > > > Maybe I misunderstood your question. Are you specifically worried > > about establishing the security context for a mdev vs. for its > > parent? > > The way to think about the cookie, and the device bind/attach in > general, is as taking control of a portion of the IOMMU routing: > > - RID > - RID + PASID > - "software" > > For the first two there can be only one device attachment per value so > the cookie is unambiguous. > > For "software" the iommu layer has little to do with this - everything > is constructed outside by the mdev. If the mdev wishes to communicate > on /dev/iommu using the cookie then it has to do so using some iommufd > api and we can convay the proper device at that point. > > Kevin didn't show it, but along side the PCI attaches: > > struct iommu_attach_data * iommu_pci_device_attach( > struct iommu_dev *dev, struct pci_device *pdev, > u32 ioasid); > > There would also be a software attach for mdev: > > struct iommu_attach_data * iommu_sw_device_attach( > struct iommu_dev *dev, struct device *pdev, u32 ioasid); > > Which does not connect anything to the iommu layer. > > It would have to return something that allows querying the IO page > table, and the mdev would use that API instead of vfio_pin_pages(). Quoting this proposal again: > 1) A successful binding call for the first device in the group creates > the security context for the entire group, by: > > * Verifying group viability in a similar way as VFIO does; > > * Calling IOMMU-API to move the group into a block-dma state, > which makes all devices in the group attached to an block-dma > domain with an empty I/O page table; > > VFIO should not allow the user to mmap the MMIO bar of the bound > device until the binding call succeeds. The attach step is irrelevant to my question, the bind step is where the device/group gets into a secure state for device access. So for IGD we have two scenarios, direct assignment and software mdevs. AIUI the operation of VFIO_DEVICE_BIND_IOMMU_FD looks like this: iommu_ctx = iommu_ctx_fdget(iommu_fd); mdev = mdev_from_dev(vdev->dev); dev = mdev ? mdev_parent_dev(mdev) : vdev->dev; iommu_dev = iommu_register_device(iommu_ctx, dev, cookie); In either case, this last line is either registering the IGD itself (ie. the struct device representing PCI device :00:02.0) or the parent of the GVT-g mdev (ie. the struct device representing PCI device :00:02.0). They're the same! AIUI, the cookie is simply an arbitrary user generated value which they'll use to refer to this device via the iommu_fd uAPI. So what magic is iommu_register_device() doing to infer my intentions as to whether I'm asking for the IGD RID to be isolated or I'm only creating a software context for an mdev? Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On Mon, Jul 12, 2021 at 11:56:24PM +, Tian, Kevin wrote: > Maybe I misunderstood your question. Are you specifically worried > about establishing the security context for a mdev vs. for its > parent? The way to think about the cookie, and the device bind/attach in general, is as taking control of a portion of the IOMMU routing: - RID - RID + PASID - "software" For the first two there can be only one device attachment per value so the cookie is unambiguous. For "software" the iommu layer has little to do with this - everything is constructed outside by the mdev. If the mdev wishes to communicate on /dev/iommu using the cookie then it has to do so using some iommufd api and we can convay the proper device at that point. Kevin didn't show it, but along side the PCI attaches: struct iommu_attach_data * iommu_pci_device_attach( struct iommu_dev *dev, struct pci_device *pdev, u32 ioasid); There would also be a software attach for mdev: struct iommu_attach_data * iommu_sw_device_attach( struct iommu_dev *dev, struct device *pdev, u32 ioasid); Which does not connect anything to the iommu layer. It would have to return something that allows querying the IO page table, and the mdev would use that API instead of vfio_pin_pages(). Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC v2] /dev/iommu uAPI proposal
> From: Alex Williamson > Sent: Tuesday, July 13, 2021 2:42 AM > > On Mon, 12 Jul 2021 01:22:11 + > "Tian, Kevin" wrote: > > > From: Alex Williamson > > > Sent: Saturday, July 10, 2021 5:51 AM > > > On Fri, 9 Jul 2021 07:48:44 + > > > "Tian, Kevin" wrote: > > > > > For mdev the struct device should be the pointer to the parent device. > > > > > > I don't get how iommu_register_device() differentiates an mdev from a > > > pdev in this case. > > > > via device cookie. > > > Let me re-add this section for more context: > > > 3. Sample structures and helper functions > > > > > > Three helper functions are provided to support VFIO_BIND_IOMMU_FD: > > > > struct iommu_ctx *iommu_ctx_fdget(int fd); > > struct iommu_dev *iommu_register_device(struct iommu_ctx *ctx, > > struct device *device, u64 cookie); > > int iommu_unregister_device(struct iommu_dev *dev); > > > > An iommu_ctx is created for each fd: > > > > struct iommu_ctx { > > // a list of allocated IOASID data's > > struct xarray ioasid_xa; > > > > // a list of registered devices > > struct xarray dev_xa; > > }; > > > > Later some group-tracking fields will be also introduced to support > > multi-devices group. > > > > Each registered device is represented by iommu_dev: > > > > struct iommu_dev { > > struct iommu_ctx*ctx; > > // always be the physical device > > struct device *device; > > u64 cookie; > > struct kref kref; > > }; > > > > A successful binding establishes a security context for the bound > > device and returns struct iommu_dev pointer to the caller. After this > > point, the user is allowed to query device capabilities via IOMMU_ > > DEVICE_GET_INFO. > > > > For mdev the struct device should be the pointer to the parent device. > > > So we'll have a VFIO_DEVICE_BIND_IOMMU_FD ioctl where the user > provides > the iommu_fd and a cookie. vfio will use iommu_ctx_fdget() to get an > iommu_ctx* for that iommu_fd, then we'll call iommu_register_device() > using that iommu_ctx* we got from the iommu_fd, the cookie provided by > the user, and for an mdev, the parent of the device the user owns > (the device_fd on which this ioctl is called)... > > How does an arbitrary user provided cookie let you differentiate that > the request is actually for an mdev versus the parent device itself? > Maybe I misunderstood your question. Are you specifically worried about establishing the security context for a mdev vs. for its parent? At least in concept we should not change the security context of the parent if this binding call is just for the mdev. And for mdev it will be in a security context as long as the associated PASID entry is disabled at the binding time. If this is the case, possibly we also need VFIO to provide defPASID marking the mdev when calling iommu_register_device() then IOMMU fd also provides defPASID when calling IOMMU API to establish the security context. Thanks, Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC v2] /dev/iommu uAPI proposal
> From: Alex Williamson > Sent: Tuesday, July 13, 2021 2:42 AM > > On Mon, 12 Jul 2021 01:22:11 + > "Tian, Kevin" wrote: > > > From: Alex Williamson > > > Sent: Saturday, July 10, 2021 5:51 AM > > > On Fri, 9 Jul 2021 07:48:44 + > > > "Tian, Kevin" wrote: > > > > > For mdev the struct device should be the pointer to the parent device. > > > > > > I don't get how iommu_register_device() differentiates an mdev from a > > > pdev in this case. > > > > via device cookie. > > > Let me re-add this section for more context: > > > 3. Sample structures and helper functions > > > > > > Three helper functions are provided to support VFIO_BIND_IOMMU_FD: > > > > struct iommu_ctx *iommu_ctx_fdget(int fd); > > struct iommu_dev *iommu_register_device(struct iommu_ctx *ctx, > > struct device *device, u64 cookie); > > int iommu_unregister_device(struct iommu_dev *dev); > > > > An iommu_ctx is created for each fd: > > > > struct iommu_ctx { > > // a list of allocated IOASID data's > > struct xarray ioasid_xa; > > > > // a list of registered devices > > struct xarray dev_xa; > > }; > > > > Later some group-tracking fields will be also introduced to support > > multi-devices group. > > > > Each registered device is represented by iommu_dev: > > > > struct iommu_dev { > > struct iommu_ctx*ctx; > > // always be the physical device > > struct device *device; > > u64 cookie; > > struct kref kref; > > }; > > > > A successful binding establishes a security context for the bound > > device and returns struct iommu_dev pointer to the caller. After this > > point, the user is allowed to query device capabilities via IOMMU_ > > DEVICE_GET_INFO. > > > > For mdev the struct device should be the pointer to the parent device. > > > So we'll have a VFIO_DEVICE_BIND_IOMMU_FD ioctl where the user > provides > the iommu_fd and a cookie. vfio will use iommu_ctx_fdget() to get an > iommu_ctx* for that iommu_fd, then we'll call iommu_register_device() > using that iommu_ctx* we got from the iommu_fd, the cookie provided by > the user, and for an mdev, the parent of the device the user owns > (the device_fd on which this ioctl is called)... > > How does an arbitrary user provided cookie let you differentiate that > the request is actually for an mdev versus the parent device itself? > > For instance, how can the IOMMU layer distinguish GVT-g (mdev) vs GVT-d > (direct assignment) when both use the same struct device* and cookie is > just a user provided value? Still confused. Thanks, > GVT-g is a special case here since it's purely software-emulated mdev and reuse the default domain of the parent device. In this case IOASID is treated as metadata for GVT-g device driver to conduct DMA isolation in software. We won't install a new page table in the IOMMU just for GVT-g mdev (this does reminds a missing flag in the attaching call to indicate this requirement). What you really care about is about SIOV mdev (with PASID-granular DMA isolation in the IOMMU) and its parent. In this case mdev and parent assignment are exclusive. When the parent is already assigned to an user, it's not managed by the kernel anymore thus no mdev per se. If mdev is created then it implies that the parent must be managed by the kernel. In either case the user-provided cookie is contained only within IOMMU fd. When calling IOMMU-API, it's always about the routing information (RID, or RID+PASID) provided in the attaching call. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC v2] /dev/iommu uAPI proposal
On Mon, 12 Jul 2021 01:22:11 + "Tian, Kevin" wrote: > > From: Alex Williamson > > Sent: Saturday, July 10, 2021 5:51 AM > > On Fri, 9 Jul 2021 07:48:44 + > > "Tian, Kevin" wrote: > > > For mdev the struct device should be the pointer to the parent device. > > > > I don't get how iommu_register_device() differentiates an mdev from a > > pdev in this case. > > via device cookie. Let me re-add this section for more context: > 3. Sample structures and helper functions > > > Three helper functions are provided to support VFIO_BIND_IOMMU_FD: > > struct iommu_ctx *iommu_ctx_fdget(int fd); > struct iommu_dev *iommu_register_device(struct iommu_ctx *ctx, > struct device *device, u64 cookie); > int iommu_unregister_device(struct iommu_dev *dev); > > An iommu_ctx is created for each fd: > > struct iommu_ctx { > // a list of allocated IOASID data's > struct xarray ioasid_xa; > > // a list of registered devices > struct xarray dev_xa; > }; > > Later some group-tracking fields will be also introduced to support > multi-devices group. > > Each registered device is represented by iommu_dev: > > struct iommu_dev { > struct iommu_ctx*ctx; > // always be the physical device > struct device *device; > u64 cookie; > struct kref kref; > }; > > A successful binding establishes a security context for the bound > device and returns struct iommu_dev pointer to the caller. After this > point, the user is allowed to query device capabilities via IOMMU_ > DEVICE_GET_INFO. > > For mdev the struct device should be the pointer to the parent device. So we'll have a VFIO_DEVICE_BIND_IOMMU_FD ioctl where the user provides the iommu_fd and a cookie. vfio will use iommu_ctx_fdget() to get an iommu_ctx* for that iommu_fd, then we'll call iommu_register_device() using that iommu_ctx* we got from the iommu_fd, the cookie provided by the user, and for an mdev, the parent of the device the user owns (the device_fd on which this ioctl is called)... How does an arbitrary user provided cookie let you differentiate that the request is actually for an mdev versus the parent device itself? For instance, how can the IOMMU layer distinguish GVT-g (mdev) vs GVT-d (direct assignment) when both use the same struct device* and cookie is just a user provided value? Still confused. Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC v2] /dev/iommu uAPI proposal
> From: Alex Williamson > Sent: Saturday, July 10, 2021 5:51 AM > > Hi Kevin, > > A couple first pass comments... > > On Fri, 9 Jul 2021 07:48:44 + > "Tian, Kevin" wrote: > > 2.2. /dev/vfio device uAPI > > ++ > > > > /* > > * Bind a vfio_device to the specified IOMMU fd > > * > > * The user should provide a device cookie when calling this ioctl. The > > * cookie is later used in IOMMU fd for capability query, iotlb > > invalidation > > * and I/O fault handling. > > * > > * User is not allowed to access the device before the binding operation > > * is completed. > > * > > * Unbind is automatically conducted when device fd is closed. > > * > > * Input parameters: > > * - iommu_fd; > > * - cookie; > > * > > * Return: 0 on success, -errno on failure. > > */ > > #define VFIO_BIND_IOMMU_FD _IO(VFIO_TYPE, VFIO_BASE + 22) > > I believe this is an ioctl on the device fd, therefore it should be > named VFIO_DEVICE_BIND_IOMMU_FD. make sense. > > > > > > > /* > > * Report vPASID info to userspace via VFIO_DEVICE_GET_INFO > > * > > * Add a new device capability. The presence indicates that the user > > * is allowed to create multiple I/O address spaces on this device. The > > * capability further includes following flags: > > * > > * - PASID_DELEGATED, if clear every vPASID must be registered to > > * the kernel; > > * - PASID_CPU, if set vPASID is allowed to be carried in the CPU > > * instructions (e.g. ENQCMD); > > * - PASID_CPU_VIRT, if set require vPASID translation in the CPU; > > * > > * The user must check that all devices with PASID_CPU set have the > > * same setting on PASID_CPU_VIRT. If mismatching, it should enable > > * vPASID only in one category (all set, or all clear). > > * > > * When the user enables vPASID on the device with PASID_CPU_VIRT > > * set, it must enable vPASID CPU translation via kvm fd before attempting > > * to use ENQCMD to submit work items. The command portal is blocked > > * by the kernel until the CPU translation is enabled. > > */ > > #define VFIO_DEVICE_INFO_CAP_PASID 5 > > > > > > /* > > * Attach a vfio device to the specified IOASID > > * > > * Multiple vfio devices can be attached to the same IOASID, and vice > > * versa. > > * > > * User may optionally provide a "virtual PASID" to mark an I/O page > > * table on this vfio device, if PASID_DELEGATED is not set in device info. > > * Whether the virtual PASID is physically used or converted to another > > * kernel-allocated PASID is a policy in the kernel. > > * > > * Because one device is allowed to bind to multiple IOMMU fd's, the > > * user should provide both iommu_fd and ioasid for this attach operation. > > * > > * Input parameter: > > * - iommu_fd; > > * - ioasid; > > * - flag; > > * - vpasid (if specified); > > * > > * Return: 0 on success, -errno on failure. > > */ > > #define VFIO_ATTACH_IOASID _IO(VFIO_TYPE, VFIO_BASE + > 23) > > #define VFIO_DETACH_IOASID _IO(VFIO_TYPE, VFIO_BASE + > 24) > > Likewise, VFIO_DEVICE_{ATTACH,DETACH}_IOASID > > ... > > 3. Sample structures and helper functions > > > > > > Three helper functions are provided to support VFIO_BIND_IOMMU_FD: > > > > struct iommu_ctx *iommu_ctx_fdget(int fd); > > struct iommu_dev *iommu_register_device(struct iommu_ctx *ctx, > > struct device *device, u64 cookie); > > int iommu_unregister_device(struct iommu_dev *dev); > > > > An iommu_ctx is created for each fd: > > > > struct iommu_ctx { > > // a list of allocated IOASID data's > > struct xarray ioasid_xa; > > > > // a list of registered devices > > struct xarray dev_xa; > > }; > > > > Later some group-tracking fields will be also introduced to support > > multi-devices group. > > > > Each registered device is represented by iommu_dev: > > > > struct iommu_dev { > > struct iommu_ctx*ctx; > > // always be the physical device > > struct device *device; > > u64 cookie; > > struct kref kref; > > }; > > > > A successful binding establishes a security context for the bound > > device and returns struct iommu_dev pointer to the caller. After this > > point, the user is allowed to query device capabilities via IOMMU_ > > DEVICE_GET_INFO. > > If we have an initial singleton group only restriction, I assume that > both iommu_register_device() would fail for any devices that are not in > a singleton group and vfio would only expose direct device files for > the devices in singleton groups. The latter implementation could > change when multi-device group support is added so that userspace can > assume that if the vfio device file exists, this interface is avail
Re: [RFC v2] /dev/iommu uAPI proposal
Hi Kevin, A couple first pass comments... On Fri, 9 Jul 2021 07:48:44 + "Tian, Kevin" wrote: > 2.2. /dev/vfio device uAPI > ++ > > /* > * Bind a vfio_device to the specified IOMMU fd > * > * The user should provide a device cookie when calling this ioctl. The > * cookie is later used in IOMMU fd for capability query, iotlb invalidation > * and I/O fault handling. > * > * User is not allowed to access the device before the binding operation > * is completed. > * > * Unbind is automatically conducted when device fd is closed. > * > * Input parameters: > * - iommu_fd; > * - cookie; > * > * Return: 0 on success, -errno on failure. > */ > #define VFIO_BIND_IOMMU_FD_IO(VFIO_TYPE, VFIO_BASE + 22) I believe this is an ioctl on the device fd, therefore it should be named VFIO_DEVICE_BIND_IOMMU_FD. > > > /* > * Report vPASID info to userspace via VFIO_DEVICE_GET_INFO > * > * Add a new device capability. The presence indicates that the user > * is allowed to create multiple I/O address spaces on this device. The > * capability further includes following flags: > * > * - PASID_DELEGATED, if clear every vPASID must be registered to > * the kernel; > * - PASID_CPU, if set vPASID is allowed to be carried in the CPU > * instructions (e.g. ENQCMD); > * - PASID_CPU_VIRT, if set require vPASID translation in the CPU; > * > * The user must check that all devices with PASID_CPU set have the > * same setting on PASID_CPU_VIRT. If mismatching, it should enable > * vPASID only in one category (all set, or all clear). > * > * When the user enables vPASID on the device with PASID_CPU_VIRT > * set, it must enable vPASID CPU translation via kvm fd before attempting > * to use ENQCMD to submit work items. The command portal is blocked > * by the kernel until the CPU translation is enabled. > */ > #define VFIO_DEVICE_INFO_CAP_PASID5 > > > /* > * Attach a vfio device to the specified IOASID > * > * Multiple vfio devices can be attached to the same IOASID, and vice > * versa. > * > * User may optionally provide a "virtual PASID" to mark an I/O page > * table on this vfio device, if PASID_DELEGATED is not set in device info. > * Whether the virtual PASID is physically used or converted to another > * kernel-allocated PASID is a policy in the kernel. > * > * Because one device is allowed to bind to multiple IOMMU fd's, the > * user should provide both iommu_fd and ioasid for this attach operation. > * > * Input parameter: > * - iommu_fd; > * - ioasid; > * - flag; > * - vpasid (if specified); > * > * Return: 0 on success, -errno on failure. > */ > #define VFIO_ATTACH_IOASID_IO(VFIO_TYPE, VFIO_BASE + 23) > #define VFIO_DETACH_IOASID_IO(VFIO_TYPE, VFIO_BASE + 24) Likewise, VFIO_DEVICE_{ATTACH,DETACH}_IOASID ... > 3. Sample structures and helper functions > > > Three helper functions are provided to support VFIO_BIND_IOMMU_FD: > > struct iommu_ctx *iommu_ctx_fdget(int fd); > struct iommu_dev *iommu_register_device(struct iommu_ctx *ctx, > struct device *device, u64 cookie); > int iommu_unregister_device(struct iommu_dev *dev); > > An iommu_ctx is created for each fd: > > struct iommu_ctx { > // a list of allocated IOASID data's > struct xarray ioasid_xa; > > // a list of registered devices > struct xarray dev_xa; > }; > > Later some group-tracking fields will be also introduced to support > multi-devices group. > > Each registered device is represented by iommu_dev: > > struct iommu_dev { > struct iommu_ctx*ctx; > // always be the physical device > struct device *device; > u64 cookie; > struct kref kref; > }; > > A successful binding establishes a security context for the bound > device and returns struct iommu_dev pointer to the caller. After this > point, the user is allowed to query device capabilities via IOMMU_ > DEVICE_GET_INFO. If we have an initial singleton group only restriction, I assume that both iommu_register_device() would fail for any devices that are not in a singleton group and vfio would only expose direct device files for the devices in singleton groups. The latter implementation could change when multi-device group support is added so that userspace can assume that if the vfio device file exists, this interface is available. I think this is confirmed further below. > For mdev the struct device should be the pointer to the parent device. I don't get how iommu_register_device() differentiates an mdev from a pdev in this case. ... > 4.3. IOASID nesting (software) >
[RFC v2] /dev/iommu uAPI proposal
/dev/iommu provides an unified interface for managing I/O page tables for devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, etc.) are expected to use this interface instead of creating their own logic to isolate untrusted device DMAs initiated by userspace. This proposal describes the uAPI of /dev/iommu and also sample sequences with VFIO as example in typical usages. The driver-facing kernel API provided by the iommu layer is still TBD, which can be discussed after consensus is made on this uAPI. It's based on a lengthy discussion starting from here: https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/ v1 can be found here: https://lore.kernel.org/linux-iommu/ph0pr12mb54811863b392c644e5365446dc...@ph0pr12mb5481.namprd12.prod.outlook.com/T/ This doc is also tracked on github, though it's not very useful for v1->v2 given dramatic refactoring: https://github.com/luxis1999/dev_iommu_uapi Changelog (v1->v2): - Rename /dev/ioasid to /dev/iommu (Jason); - Add a section for device-centric vs. group-centric design (many); - Add a section for handling no-snoop DMA (Jason/Alex/Paolo); - Add definition of user/kernel/shared I/O page tables (Baolu/Jason); - Allow one device bound to multiple iommu fd's (Jason); - No need to track user I/O page tables in kernel on ARM/AMD (Jean/Jason); - Add a device cookie for iotlb invalidation and fault handling (Jean/Jason); - Add capability/format query interface per device cookie (Jason); - Specify format/attribute when creating an IOASID, leading to several v1 uAPI commands removed (Jason); - Explain the value of software nesting (Jean); - Replace IOASID_REGISTER_VIRTUAL_MEMORY with software nesting (David/Jason); - Cover software mdev usage (Jason); - No restriction on map/unmap vs. bind/invalidate (Jason/David); - Report permitted IOVA range instead of reserved range (David); - Refine the sample structures and helper functions (Jason); - Add definition of default and non-default I/O address spaces; - Expand and clarify the design for PASID virtualization; - and lots of subtle refinement according to above changes; TOC 1. Terminologies and Concepts 1.1. Manage I/O address space 1.2. Attach device to I/O address space 1.3. Group isolation 1.4. PASID virtualization 1.4.1. Devices which don't support DMWr 1.4.2. Devices which support DMWr 1.4.3. Mix different types together 1.4.4. User sequence 1.5. No-snoop DMA 2. uAPI Proposal 2.1. /dev/iommu uAPI 2.2. /dev/vfio device uAPI 2.3. /dev/kvm uAPI 3. Sample Structures and Helper Functions 4. Use Cases and Flows 4.1. A simple example 4.2. Multiple IOASIDs (no nesting) 4.3. IOASID nesting (software) 4.4. IOASID nesting (hardware) 4.5. Guest SVA (vSVA) 4.6. I/O page fault 1. Terminologies and Concepts - IOMMU fd is the container holding multiple I/O address spaces. User manages those address spaces through fd operations. Multiple fd's are allowed per process, but with this proposal one fd should be sufficient for all intended usages. IOASID is the fd-local software handle representing an I/O address space. Each IOASID is associated with a single I/O page table. IOASIDs can be nested together, implying the output address from one I/O page table (represented by child IOASID) must be further translated by another I/O page table (represented by parent IOASID). An I/O address space takes effect only after it is attached by a device. One device is allowed to attach to multiple I/O address spaces. One I/O address space can be attached by multiple devices. Device must be bound to an IOMMU fd before attach operation can be conducted. Though not necessary, user could bind one device to multiple IOMMU FD's. But no cross-FD IOASID nesting is allowed. The format of an I/O page table must be compatible to the attached devices (or more specifically to the IOMMU which serves the DMA from the attached devices). User is responsible for specifying the format when allocating an IOASID, according to one or multiple devices which will be attached right after. Attaching a device to an IOASID with incompatible format is simply rejected. Relationship between IOMMU fd, VFIO fd and KVM fd: - IOMMU fd provides uAPI for managing IOASIDs and I/O page tables. It also provides an unified capability/format reporting interface for each bound device. - VFIO fd provides uAPI for device binding and attaching. In this proposal VFIO is used as the example of device passthrough frameworks. The routing information that identifies an I/O address space in the wire is per-device and registered to IOMMU fd via VFIO uAPI. - KVM fd provides uAPI for handling no-snoop DMA and PASID virtualization in CPU (when PASID is carried in instruction payload). 1.1. Manage I/O address space +++