Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Tue, Jun 08, 2021 at 10:44:31AM +1000, David Gibson wrote: > When you say "not using a drivers/iommu IOMMU interface" do you > basically mean the device doesn't do DMA? No, I mean the device doesn't use iommu_map() to manage the DMA mappings. vfio_iommu_type1 has a special code path that mdev triggers that doesn't allocate an IOMMU domain and doesn't call iommu_map() or anything related to that. Instead a mdev driver calls vfio_pin_pages() which "reads" a fake page table and returns back the CPU pages for the mdev to DMA map however it likes. > Now, we could represent those different sorts of isolation separately, > but at the time our thinking was that we should group together devices > that can't be safely isolated for *any* reason, since the practical > upshot is the same: you can't safely split those devices between > different owners. It is fine, but the direction is going the other way, devices have perfect ioslation and rely on special interactions with the iommu to get it. > > What I don't like is forcing certain things depending on how the > > vfio_device was created - for instance forcing a IOMMU group as part > > and forcing an ugly "SW IOMMU" mode in the container only as part of > > mdev_device. > > I don't really see how this is depending on how the device is > created. static int vfio_iommu_type1_attach_group(void *iommu_data, struct iommu_group *iommu_group) { if (vfio_bus_is_mdev(bus)) { What the iommu code does depends on how the device was created. This is really ugly. This is happening becaus the three objects in the model: driver/group/domain are not being linked together in a way that reflects the modern world. The group has no idea what the driver wants but is in charge of creating the domain on behalf of the device. And so people have been created complicated hackery to pass information from the driver to the group, through the device, so that the group can create the right domain. I want to see the driver simply create the right domain directly. It is much simpler and scales to more domain complexity. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Tue, Jun 01, 2021 at 09:57:12AM -0300, Jason Gunthorpe wrote: > On Tue, Jun 01, 2021 at 02:03:33PM +1000, David Gibson wrote: > > On Thu, May 27, 2021 at 03:48:47PM -0300, Jason Gunthorpe wrote: > > > On Thu, May 27, 2021 at 02:58:30PM +1000, David Gibson wrote: > > > > On Tue, May 25, 2021 at 04:52:57PM -0300, Jason Gunthorpe wrote: > > > > > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote: > > > > > > > > > > > 2. iommu backed mdev devices for SRIOV where mdev device is created > > > > > > per > > > > > > VF (mdev device == VF device) then that mdev device has same iommu > > > > > > protection scope as VF associated to it. > > > > > > > > > > This doesn't require, and certainly shouldn't create, a fake group. > > > > > > > > It's only fake if you start with a narrow view of what a group is. > > > > > > A group is connected to drivers/iommu. A group object without *any* > > > relation to drivers/iommu is just a complete fiction, IMHO. > > > > That might be where we differ. As I've said, my group I'm primarily > > meaning the fundamental hardware unit of isolation. *Usually* that's > > determined by the capabilities of an IOMMU, but in some cases it might > > not be. In either case, the boundaries still matter. > > As in my other email we absolutely need a group concept, it is just a > question of how the user API is designed around it. > > > > The group mdev implicitly creates is just a fake proxy that comes > > > along with mdev API. It doesn't do anything and it doesn't mean > > > anything. > > > > But.. the case of multiple mdevs managed by a single PCI device with > > an internal IOMMU also exists, and then the mdev groups are *not* > > proxies but true groups independent of the parent device. Which means > > that the group structure of mdevs can vary, which is an argument *for* > > keeping it, not against. > > If VFIO becomes more "vfio_device" centric then the vfio_device itself > has some properties. One of those can be "is it inside a drivers/iommu > group, or not?". > > If the vfio_device is not using a drivers/iommu IOMMU interface then > it can just have no group at all - no reason to lie. This would mean > that the device has perfect isolation. When you say "not using a drivers/iommu IOMMU interface" do you basically mean the device doesn't do DMA? I can see some benefit to that, but some drawbacks too. The *main* form of isolation (or lack thereof) that groups is about the IOMMU, but groups can also represent other forms of isolation failure: e.g. a multifunction device, where function 0 has some debug registers which affect other functions. That's relevant whether or not any of those functions use DMA. Now, we could represent those different sorts of isolation separately, but at the time our thinking was that we should group together devices that can't be safely isolated for *any* reason, since the practical upshot is the same: you can't safely split those devices between different owners. > What I don't like is forcing certain things depending on how the > vfio_device was created - for instance forcing a IOMMU group as part > and forcing an ugly "SW IOMMU" mode in the container only as part of > mdev_device. I don't really see how this is depending on how the device is created. The current VFIO model is that every device always belongs to a group - but that group might be a singleton. That seems less complicated to me that some devices do and some don't have a group. > These should all be properties of the vfio_device itself. > > Again this is all about the group fd - and how to fit in with the > /dev/ioasid proposal from Kevin: > > https://lore.kernel.org/kvm/mwhpr11mb1886422d4839b372c6ab245f8c...@mwhpr11mb1886.namprd11.prod.outlook.com/ > > Focusing on vfio_device and skipping the group fd smooths out some > rough edges. > > Code wise we are not quite there, but I have mapped out eliminating > the group from the vfio_device centric API and a few other places it > has crept in. > > The group can exist in the background to enforce security without > being a cornerstone of the API design. > > Jason > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Tue, Jun 01, 2021 at 02:03:33PM +1000, David Gibson wrote: > On Thu, May 27, 2021 at 03:48:47PM -0300, Jason Gunthorpe wrote: > > On Thu, May 27, 2021 at 02:58:30PM +1000, David Gibson wrote: > > > On Tue, May 25, 2021 at 04:52:57PM -0300, Jason Gunthorpe wrote: > > > > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote: > > > > > > > > > 2. iommu backed mdev devices for SRIOV where mdev device is created > > > > > per > > > > > VF (mdev device == VF device) then that mdev device has same iommu > > > > > protection scope as VF associated to it. > > > > > > > > This doesn't require, and certainly shouldn't create, a fake group. > > > > > > It's only fake if you start with a narrow view of what a group is. > > > > A group is connected to drivers/iommu. A group object without *any* > > relation to drivers/iommu is just a complete fiction, IMHO. > > That might be where we differ. As I've said, my group I'm primarily > meaning the fundamental hardware unit of isolation. *Usually* that's > determined by the capabilities of an IOMMU, but in some cases it might > not be. In either case, the boundaries still matter. As in my other email we absolutely need a group concept, it is just a question of how the user API is designed around it. > > The group mdev implicitly creates is just a fake proxy that comes > > along with mdev API. It doesn't do anything and it doesn't mean > > anything. > > But.. the case of multiple mdevs managed by a single PCI device with > an internal IOMMU also exists, and then the mdev groups are *not* > proxies but true groups independent of the parent device. Which means > that the group structure of mdevs can vary, which is an argument *for* > keeping it, not against. If VFIO becomes more "vfio_device" centric then the vfio_device itself has some properties. One of those can be "is it inside a drivers/iommu group, or not?". If the vfio_device is not using a drivers/iommu IOMMU interface then it can just have no group at all - no reason to lie. This would mean that the device has perfect isolation. What I don't like is forcing certain things depending on how the vfio_device was created - for instance forcing a IOMMU group as part and forcing an ugly "SW IOMMU" mode in the container only as part of mdev_device. These should all be properties of the vfio_device itself. Again this is all about the group fd - and how to fit in with the /dev/ioasid proposal from Kevin: https://lore.kernel.org/kvm/mwhpr11mb1886422d4839b372c6ab245f8c...@mwhpr11mb1886.namprd11.prod.outlook.com/ Focusing on vfio_device and skipping the group fd smooths out some rough edges. Code wise we are not quite there, but I have mapped out eliminating the group from the vfio_device centric API and a few other places it has crept in. The group can exist in the background to enforce security without being a cornerstone of the API design. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Thu, May 27, 2021 at 11:55:00PM +0530, Kirti Wankhede wrote: > > > On 5/27/2021 10:30 AM, David Gibson wrote: > > On Wed, May 26, 2021 at 02:48:03AM +0530, Kirti Wankhede wrote: > > > > > > > > > On 5/26/2021 1:22 AM, Jason Gunthorpe wrote: > > > > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote: > > > > > > > > > 2. iommu backed mdev devices for SRIOV where mdev device is created > > > > > per > > > > > VF (mdev device == VF device) then that mdev device has same iommu > > > > > protection scope as VF associated to it. > > > > > > > > This doesn't require, and certainly shouldn't create, a fake group. > > > > > > > > Only the VF's real IOMMU group should be used to model an iommu domain > > > > linked to a VF. Injecting fake groups that are proxies for real groups > > > > only opens the possibility of security problems like David is > > > > concerned with. > > > > > > > > > > I think this security issue should be addressed by letting mdev device > > > inherit its parent's iommu_group, i.e. VF's iommu_group here. > > > > No, that doesn't work. AIUI part of the whole point of mdevs is to > > allow chunks of a single PCI function to be handed out to different > > places, because they're isolated from each other not by the system > > IOMMU, but by a combination of MMU hardware in the hardware (e.g. in a > > GPU card) and software in the mdev driver. > > That's correct for non-iommu backed mdev devices. > > > If mdevs inherited the > > group of their parent device they wouldn't count as isolated from each > > other, which they should. > > > > For iommu backed mdev devices for SRIOV, where there can be single mdev > device for its parent, here parent device is VF, there can't be multiple > mdev devices associated with that VF. In this case mdev can inherit the > group of parent device. Ah, yes, if there's just one mdev for the PCI function, and the function doesn't have an internal memory protection unit then this makes sense. Which means we *do* have at least two meaningfully different group configurations for mdev: * mdev is in a singleton group independent of the parent PCI device * mdev shares a group with its parent PCI device Which means even in the case of mdevs, the group structure is *not* a meaningless fiction. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Thu, May 27, 2021 at 04:06:20PM -0300, Jason Gunthorpe wrote: > On Thu, May 27, 2021 at 02:53:42PM +1000, David Gibson wrote: > > > > > If the physical device had a bug which meant the mdevs *weren't* > > > > properly isolated from each other, then those mdevs would share a > > > > group, and you *would* care about it. Depending on how the isolation > > > > failed the mdevs might or might not also share a group with the parent > > > > physical device. > > > > > > That isn't a real scenario.. mdevs that can't be isolated just > > > wouldn't be useful to exist > > > > Really? So what do you do when you discover some mdevs you thought > > were isolated actually aren't due to a hardware bug? Drop support > > from the driver entirely? In which case what do you say to the people > > who understandably complain "but... we had all the mdevs in one guest > > anyway, we don't care if they're not isolated"? > > I've never said to eliminate groups entirely. > > What I'm saying is that all the cases we have for mdev today do not > require groups, but are forced to create a fake group anyhow just to > satisfy the odd VFIO requirement to have a group FD. > > If some future mdev needs groups then sure, add the appropriate group > stuff. > > But that doesn't effect the decision to have a VFIO group FD, or not. > > > > > It ensures that they're parked at the moment the group moves from > > > > kernel to userspace ownership, but it can't prevent dpdk from > > > > accessing and unparking those devices via peer to peer DMA. > > > > > > Right, and adding all this group stuff did nothing to alert the poor > > > admin that is running DPDK to this risk. > > > > Didn't it? Seems to me the admin that in order to give the group to > > DPDK, the admin had to find and unbind all the things in it... so is > > therefore aware that they're giving everything in it to DPDK. > > Again, I've never said the *group* should be removed. I'm only > concerned about the *group FD* Ok, that wasn't really clear to me. I still wouldn't say the group for mdevs is a fiction though.. rather that the group device used for (no internal IOMMU case) mdevs is just plain wrong. > When the admin found and unbound they didn't use the *group FD* in any > way. No, they are likely to have changed permissions on the group device node as part of the process, though. > > > You put the same security labels you'd put on the group to the devices > > > that consitute the group. It is only more tricky in the sense that the > > > script that would have to do this will need to do more than ID the > > > group to label but also ID the device members of the group and label > > > their char nodes. > > > > Well, I guess, if you take the view that root is allowed to break the > > kernel. I tend to prefer that although root can obviously break the > > kernel if they intend do, we should make it hard to do by accident - > > which in this case would mean the kernel *enforcing* that the devices > > in the group have the same security labels, which I can't really see > > how to do without an exposed group. > > How is this "break the kernel"? It has nothing to do with the > kernel. Security labels are a user space concern. *thinks*... yeah, ok, that was much too strong an assertion. What I was thinking of is the fact that this means that guarantees you'd normally expect the kernel to enforce can be obviated by bad configuration: chown-ing a device to root doesn't actually protect it if there's another device in the same group exposed to other users. But I guess you could say the same about, say, an unauthenticated nbd export of a root-owned block device, so I guess that's not something the kernel can reasonably enforce. Ok.. you might be finally convincing me, somewhat. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Thu, May 27, 2021 at 03:48:47PM -0300, Jason Gunthorpe wrote: > On Thu, May 27, 2021 at 02:58:30PM +1000, David Gibson wrote: > > On Tue, May 25, 2021 at 04:52:57PM -0300, Jason Gunthorpe wrote: > > > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote: > > > > > > > 2. iommu backed mdev devices for SRIOV where mdev device is created per > > > > VF (mdev device == VF device) then that mdev device has same iommu > > > > protection scope as VF associated to it. > > > > > > This doesn't require, and certainly shouldn't create, a fake group. > > > > It's only fake if you start with a narrow view of what a group is. > > A group is connected to drivers/iommu. A group object without *any* > relation to drivers/iommu is just a complete fiction, IMHO. That might be where we differ. As I've said, my group I'm primarily meaning the fundamental hardware unit of isolation. *Usually* that's determined by the capabilities of an IOMMU, but in some cases it might not be. In either case, the boundaries still matter. > > > Only the VF's real IOMMU group should be used to model an iommu domain > > > linked to a VF. Injecting fake groups that are proxies for real groups > > > only opens the possibility of security problems like David is > > > concerned with. > > > > It's not a proxy for a real group, it's a group of its own. If you > > discover that (due to a hardware bug, for example) the mdev is *not* > > What Kirti is talking about here is the case where a mdev is wrapped > around a VF and the DMA isolation stems directly from the SRIOV VF's > inherent DMA isolation, not anything the mdev wrapper did. > > The group providing the isolation is the VF's group. Yes, in that case the mdev absolutely should be in the VF's group - having its own group is not just messy but incorrect. > The group mdev implicitly creates is just a fake proxy that comes > along with mdev API. It doesn't do anything and it doesn't mean > anything. But.. the case of multiple mdevs managed by a single PCI device with an internal IOMMU also exists, and then the mdev groups are *not* proxies but true groups independent of the parent device. Which means that the group structure of mdevs can vary, which is an argument *for* keeping it, not against. > > properly isolated from its parent PCI device, then both the mdev > > virtual device *and* the physical PCI device are in the same group. > > Groups including devices of different types and on different buses > > were considered from the start, and are precedented, if rare. > > This is far too theoretical for me. A security broken mdev is > functionally useless. Is it, though? Again, I'm talking about the case of multiple mdevs with a single parent device (because that's the only case I was aware of until recently). Isolation comes from a device-internal IOMMU... that turns out to be broken. But if your security domain happens to include all the mdevs on the device anyway, then you don't care. Are you really going to say people can't use their fancy hardware in this mode because it has a security flaw that's not relevant to their usecase? And then.. there's Kirti's case. In that case the mdev should belong to its parent PCI device's group since that's what's providing isolation. But in that case the parent device can be in a multi-device group for any of the usual reasons (PCIe-to-PCI bridge, PCIe switch with broken ACS, multifunction device with crosstalk). Which means the mdev also shares a group with those other device. So again, the group structure matters and is not a fiction. > We don't need to support it, and we don't need complicated software to > model it. > > Jason > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Thu, May 27, 2021 at 02:53:42PM +1000, David Gibson wrote: > > > If the physical device had a bug which meant the mdevs *weren't* > > > properly isolated from each other, then those mdevs would share a > > > group, and you *would* care about it. Depending on how the isolation > > > failed the mdevs might or might not also share a group with the parent > > > physical device. > > > > That isn't a real scenario.. mdevs that can't be isolated just > > wouldn't be useful to exist > > Really? So what do you do when you discover some mdevs you thought > were isolated actually aren't due to a hardware bug? Drop support > from the driver entirely? In which case what do you say to the people > who understandably complain "but... we had all the mdevs in one guest > anyway, we don't care if they're not isolated"? I've never said to eliminate groups entirely. What I'm saying is that all the cases we have for mdev today do not require groups, but are forced to create a fake group anyhow just to satisfy the odd VFIO requirement to have a group FD. If some future mdev needs groups then sure, add the appropriate group stuff. But that doesn't effect the decision to have a VFIO group FD, or not. > > > It ensures that they're parked at the moment the group moves from > > > kernel to userspace ownership, but it can't prevent dpdk from > > > accessing and unparking those devices via peer to peer DMA. > > > > Right, and adding all this group stuff did nothing to alert the poor > > admin that is running DPDK to this risk. > > Didn't it? Seems to me the admin that in order to give the group to > DPDK, the admin had to find and unbind all the things in it... so is > therefore aware that they're giving everything in it to DPDK. Again, I've never said the *group* should be removed. I'm only concerned about the *group FD* When the admin found and unbound they didn't use the *group FD* in any way. > > You put the same security labels you'd put on the group to the devices > > that consitute the group. It is only more tricky in the sense that the > > script that would have to do this will need to do more than ID the > > group to label but also ID the device members of the group and label > > their char nodes. > > Well, I guess, if you take the view that root is allowed to break the > kernel. I tend to prefer that although root can obviously break the > kernel if they intend do, we should make it hard to do by accident - > which in this case would mean the kernel *enforcing* that the devices > in the group have the same security labels, which I can't really see > how to do without an exposed group. How is this "break the kernel"? It has nothing to do with the kernel. Security labels are a user space concern. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Thu, May 27, 2021 at 02:58:30PM +1000, David Gibson wrote: > On Tue, May 25, 2021 at 04:52:57PM -0300, Jason Gunthorpe wrote: > > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote: > > > > > 2. iommu backed mdev devices for SRIOV where mdev device is created per > > > VF (mdev device == VF device) then that mdev device has same iommu > > > protection scope as VF associated to it. > > > > This doesn't require, and certainly shouldn't create, a fake group. > > It's only fake if you start with a narrow view of what a group is. A group is connected to drivers/iommu. A group object without *any* relation to drivers/iommu is just a complete fiction, IMHO. > > Only the VF's real IOMMU group should be used to model an iommu domain > > linked to a VF. Injecting fake groups that are proxies for real groups > > only opens the possibility of security problems like David is > > concerned with. > > It's not a proxy for a real group, it's a group of its own. If you > discover that (due to a hardware bug, for example) the mdev is *not* What Kirti is talking about here is the case where a mdev is wrapped around a VF and the DMA isolation stems directly from the SRIOV VF's inherent DMA isolation, not anything the mdev wrapper did. The group providing the isolation is the VF's group. The group mdev implicitly creates is just a fake proxy that comes along with mdev API. It doesn't do anything and it doesn't mean anything. > properly isolated from its parent PCI device, then both the mdev > virtual device *and* the physical PCI device are in the same group. > Groups including devices of different types and on different buses > were considered from the start, and are precedented, if rare. This is far too theoretical for me. A security broken mdev is functionally useless. We don't need to support it, and we don't need complicated software to model it. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On 5/27/2021 10:30 AM, David Gibson wrote: On Wed, May 26, 2021 at 02:48:03AM +0530, Kirti Wankhede wrote: On 5/26/2021 1:22 AM, Jason Gunthorpe wrote: On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote: 2. iommu backed mdev devices for SRIOV where mdev device is created per VF (mdev device == VF device) then that mdev device has same iommu protection scope as VF associated to it. This doesn't require, and certainly shouldn't create, a fake group. Only the VF's real IOMMU group should be used to model an iommu domain linked to a VF. Injecting fake groups that are proxies for real groups only opens the possibility of security problems like David is concerned with. I think this security issue should be addressed by letting mdev device inherit its parent's iommu_group, i.e. VF's iommu_group here. No, that doesn't work. AIUI part of the whole point of mdevs is to allow chunks of a single PCI function to be handed out to different places, because they're isolated from each other not by the system IOMMU, but by a combination of MMU hardware in the hardware (e.g. in a GPU card) and software in the mdev driver. That's correct for non-iommu backed mdev devices. If mdevs inherited the group of their parent device they wouldn't count as isolated from each other, which they should. For iommu backed mdev devices for SRIOV, where there can be single mdev device for its parent, here parent device is VF, there can't be multiple mdev devices associated with that VF. In this case mdev can inherit the group of parent device. Kirti ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Mon, May 24, 2021 at 08:37:44PM -0300, Jason Gunthorpe wrote: > On Mon, May 24, 2021 at 05:52:58PM +1000, David Gibson wrote: > > > > > I don't really see a semantic distinction between "always one-device > > > > groups" and "groups don't matter". Really the only way you can afford > > > > to not care about groups is if they're singletons. > > > > > > The kernel driver under the mdev may not be in an "always one-device" > > > group. > > > > I don't really understand what you mean by that. > > I mean the group of the mdev's actual DMA device may have multiple > things in it. > > > > It is a kernel driver so the only thing we know and care about is that > > > all devices in the HW group are bound to kernel drivers. > > > > > > The vfio device that spawns from this kernel driver is really a > > > "groups don't matter" vfio device because at the IOMMU layer it should > > > be riding on the physical group of the kernel driver. At the VFIO > > > layer we no longer care about the group abstraction because the system > > > guarentees isolation in some other way. > > > > Uh.. I don't really know how mdevs are isolated from each other. I > > thought it was because the physical device providing the mdevs > > effectively had an internal IOMMU (or at least DMA permissioning) to > > isolate the mdevs, even though the physical device may not be fully > > isolated. > > > > In that case the virtual mdev is effectively in a singleton group, > > which is different from the group of its parent device. > > That is one way to view it, but it means creating a whole group > infrastructure and abusing the IOMMU stack just to create this > nonsense fiction. It's a nonsense fiction until it's not, at which point it will bite you in the arse. > We also abuse the VFIO container stuff to hackily > create several different types pf IOMMU uAPIs for the mdev - all of > which are unrelated to drivers/iommu. > > Basically, there is no drivers/iommu thing involved, thus is no really > iommu group, for mdev it is all a big hacky lie. Well, "iommu" group might not be the best name, but hardware isolation is still a real concern here, even if it's not entirely related to the IOMMU. > > If the physical device had a bug which meant the mdevs *weren't* > > properly isolated from each other, then those mdevs would share a > > group, and you *would* care about it. Depending on how the isolation > > failed the mdevs might or might not also share a group with the parent > > physical device. > > That isn't a real scenario.. mdevs that can't be isolated just > wouldn't be useful to exist Really? So what do you do when you discover some mdevs you thought were isolated actually aren't due to a hardware bug? Drop support from the driver entirely? In which case what do you say to the people who understandably complain "but... we had all the mdevs in one guest anyway, we don't care if they're not isolated"? > > > This is today's model, yes. When you run dpdk on a multi-group device > > > vfio already ensures that all the device groups remained parked and > > > inaccessible. > > > > I'm not really following what you're saying there. > > > > If you have a multi-device group, and dpdk is using one device in it, > > VFIO *does not* (and cannot) ensure that other devices in the group > > are parked and inaccessible. > > I mean in the sense that no other user space can open those devices > and no kernel driver can later be attached to them. Ok. > > It ensures that they're parked at the moment the group moves from > > kernel to userspace ownership, but it can't prevent dpdk from > > accessing and unparking those devices via peer to peer DMA. > > Right, and adding all this group stuff did nothing to alert the poor > admin that is running DPDK to this risk. Didn't it? Seems to me the admin that in order to give the group to DPDK, the admin had to find and unbind all the things in it... so is therefore aware that they're giving everything in it to DPDK. > > > If the administator configures the system with different security > > > labels for different VFIO devices then yes removing groups makes this > > > more tricky as all devices in the group should have the same label. > > > > That seems a bigger problem than "more tricky". How would you propose > > addressing this with your device-first model? > > You put the same security labels you'd put on the group to the devices > that consitute the group. It is only more tricky in the sense that the > script that would have to do this will need to do more than ID the > group to label but also ID the device members of the group and label > their char nodes. Well, I guess, if you take the view that root is allowed to break the kernel. I tend to prefer that although root can obviously break the kernel if they intend do, we should make it hard to do by accident - which in this case would mean the kernel *enforcing* that the devices in the group have the same security labels, which I
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Tue, May 25, 2021 at 04:52:57PM -0300, Jason Gunthorpe wrote: > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote: > > > 2. iommu backed mdev devices for SRIOV where mdev device is created per > > VF (mdev device == VF device) then that mdev device has same iommu > > protection scope as VF associated to it. > > This doesn't require, and certainly shouldn't create, a fake group. It's only fake if you start with a narrow view of what a group is. A group is a set of devices (in the kernel sense of "device", not necessarily the hardware sense) which can't be isolated from each other. The mdev device is a kernel device, and if working as intended it can be isolated from everything else, and is therefore in an absolute bona fide group of its own. > Only the VF's real IOMMU group should be used to model an iommu domain > linked to a VF. Injecting fake groups that are proxies for real groups > only opens the possibility of security problems like David is > concerned with. It's not a proxy for a real group, it's a group of its own. If you discover that (due to a hardware bug, for example) the mdev is *not* properly isolated from its parent PCI device, then both the mdev virtual device *and* the physical PCI device are in the same group. Groups including devices of different types and on different buses were considered from the start, and are precedented, if rare. > Max's series approaches this properly by fully linking the struct > pci_device of the VF throughout the entire VFIO scheme, including the > group and container, while still allowing override of various VFIO > operations. > > Jason > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Wed, May 26, 2021 at 02:48:03AM +0530, Kirti Wankhede wrote: > > > On 5/26/2021 1:22 AM, Jason Gunthorpe wrote: > > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote: > > > > > 2. iommu backed mdev devices for SRIOV where mdev device is created per > > > VF (mdev device == VF device) then that mdev device has same iommu > > > protection scope as VF associated to it. > > > > This doesn't require, and certainly shouldn't create, a fake group. > > > > Only the VF's real IOMMU group should be used to model an iommu domain > > linked to a VF. Injecting fake groups that are proxies for real groups > > only opens the possibility of security problems like David is > > concerned with. > > > > I think this security issue should be addressed by letting mdev device > inherit its parent's iommu_group, i.e. VF's iommu_group here. No, that doesn't work. AIUI part of the whole point of mdevs is to allow chunks of a single PCI function to be handed out to different places, because they're isolated from each other not by the system IOMMU, but by a combination of MMU hardware in the hardware (e.g. in a GPU card) and software in the mdev driver. If mdevs inherited the group of their parent device they wouldn't count as isolated from each other, which they should. > > Kirti > > > Max's series approaches this properly by fully linking the struct > > pci_device of the VF throughout the entire VFIO scheme, including the > > group and container, while still allowing override of various VFIO > > operations. > > > > Jason > > > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Wed, May 26, 2021 at 12:59:05PM -0600, Alex Williamson wrote: > A driver provided sysfs attribute would obviously fill the short > term gap, long term maybe this would be standardized via netlink. It > seems a bit analogous to setting the MAC address for a VF on an SR-IOV > NIC or VF namespace configuration for an SR-IOV NVMe device. Thanks, We have been doing a lot of work in netlink/devlink to program the VF settings before starting the VF driver I've long thought it would be good to standardize a VF lifecycle in Linux eg VFs start their life in some 'unconfigured' state and drivers don't bind, then userspace can use the PF to perform configuration, finally the driver can start on a configured VF We are already doing this model for the mlx5 'sub function' interfaces which have alot of analogs to both VFs and Intel's SIOV ADIs. It seems to be working. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Wed, 26 May 2021 23:40:02 +0530 Kirti Wankhede wrote: > On 5/26/2021 4:22 AM, Alex Williamson wrote: > > On Wed, 26 May 2021 00:56:30 +0530 > > Kirti Wankhede wrote: > > > >> On 5/25/2021 5:07 AM, Jason Gunthorpe wrote: > >>> On Mon, May 24, 2021 at 05:52:58PM +1000, David Gibson wrote: > >>> > >> I don't really see a semantic distinction between "always one-device > >> groups" and "groups don't matter". Really the only way you can afford > >> to not care about groups is if they're singletons. > > > > The kernel driver under the mdev may not be in an "always one-device" > > group. > > I don't really understand what you mean by that. > >>> > >>> I mean the group of the mdev's actual DMA device may have multiple > >>> things in it. > >>> > > It is a kernel driver so the only thing we know and care about is that > > all devices in the HW group are bound to kernel drivers. > > > > The vfio device that spawns from this kernel driver is really a > > "groups don't matter" vfio device because at the IOMMU layer it should > > be riding on the physical group of the kernel driver. At the VFIO > > layer we no longer care about the group abstraction because the system > > guarentees isolation in some other way. > > Uh.. I don't really know how mdevs are isolated from each other. I > thought it was because the physical device providing the mdevs > effectively had an internal IOMMU (or at least DMA permissioning) to > isolate the mdevs, even though the physical device may not be fully > isolated. > > In that case the virtual mdev is effectively in a singleton group, > which is different from the group of its parent device. > >>> > >> > >> That's correct. > >> > >>> That is one way to view it, but it means creating a whole group > >>> infrastructure and abusing the IOMMU stack just to create this > >>> nonsense fiction. > >> > >> I really didn't get how this abuse the IOMMU stack. > >> mdev can be used in 3 different ways: > >> 1. non-iommu backed mdev devices where mdev vendor driver takes care to > >> DMA map (iommu_map) and isolation is through device hardware internal > >> MMU. Here vfio_iommu_type1 module provides a way to validate and pin > >> pages required by mdev device for DMA mapping. Then IOMMU mapping is > >> done by mdev vendor driver which is owner driver of physical device. > >> > >> 2. iommu backed mdev devices for SRIOV where mdev device is created per > >> VF (mdev device == VF device) then that mdev device has same iommu > >> protection scope as VF associated to it. Here mdev device is virtual > >> device which uses features of mdev and represents underlying VF device, > >> same as vfio-pci but with additional mdev features. > > > > What features would those be? There are no mdev specific parts of the > > vfio uAPI. > > > > The mdev device is a virtual device, by why it it virtual in this case? > > Aren't we effectively assigning the VF itself (mdev device == VF device) > > with a bunch of extra support code to fill in the gaps of the VF > > implementing the complete device model in hardware? > > > > We're effectively creating this virtual device, creating a fake IOMMU > > group, and trying to create this association of this virtual device to > > the real VF in order to shoehorn it into the mdev model. What do we > > get from that model other than lifecycle management (ie. type selection) > > and re-use of a bunch of code from the driver supporting the 1) model > > above? > > > > Yes, the lifecycle management which is in mdev is not in vfio-pci variant. > > > This specific model seems better served by a device specific peer > > driver to vfio-pci (ie. a "vfio-pci variant"). You effectively already > > have the code for this driver, it's just in the format of an mdev > > driver rather than a vfio "bus driver". The work Jason references > > relative to Max aims to make these kinds of drivers easier to implement > > through re-use of vfio-pci code. > > > > There are certainly other solutions we could come up with for selecting > > a specific device type for a vfio-pci variant driver to implement other > > than pretending this model actually belongs in mdev, right? Thanks, > > > > Sure and would like to see type selection mechanism to be implemented in > vfio-pci variant. A driver provided sysfs attribute would obviously fill the short term gap, long term maybe this would be standardized via netlink. It seems a bit analogous to setting the MAC address for a VF on an SR-IOV NIC or VF namespace configuration for an SR-IOV NVMe device. Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On 5/26/2021 4:22 AM, Alex Williamson wrote: On Wed, 26 May 2021 00:56:30 +0530 Kirti Wankhede wrote: On 5/25/2021 5:07 AM, Jason Gunthorpe wrote: On Mon, May 24, 2021 at 05:52:58PM +1000, David Gibson wrote: I don't really see a semantic distinction between "always one-device groups" and "groups don't matter". Really the only way you can afford to not care about groups is if they're singletons. The kernel driver under the mdev may not be in an "always one-device" group. I don't really understand what you mean by that. I mean the group of the mdev's actual DMA device may have multiple things in it. It is a kernel driver so the only thing we know and care about is that all devices in the HW group are bound to kernel drivers. The vfio device that spawns from this kernel driver is really a "groups don't matter" vfio device because at the IOMMU layer it should be riding on the physical group of the kernel driver. At the VFIO layer we no longer care about the group abstraction because the system guarentees isolation in some other way. Uh.. I don't really know how mdevs are isolated from each other. I thought it was because the physical device providing the mdevs effectively had an internal IOMMU (or at least DMA permissioning) to isolate the mdevs, even though the physical device may not be fully isolated. In that case the virtual mdev is effectively in a singleton group, which is different from the group of its parent device. That's correct. That is one way to view it, but it means creating a whole group infrastructure and abusing the IOMMU stack just to create this nonsense fiction. I really didn't get how this abuse the IOMMU stack. mdev can be used in 3 different ways: 1. non-iommu backed mdev devices where mdev vendor driver takes care to DMA map (iommu_map) and isolation is through device hardware internal MMU. Here vfio_iommu_type1 module provides a way to validate and pin pages required by mdev device for DMA mapping. Then IOMMU mapping is done by mdev vendor driver which is owner driver of physical device. 2. iommu backed mdev devices for SRIOV where mdev device is created per VF (mdev device == VF device) then that mdev device has same iommu protection scope as VF associated to it. Here mdev device is virtual device which uses features of mdev and represents underlying VF device, same as vfio-pci but with additional mdev features. What features would those be? There are no mdev specific parts of the vfio uAPI. The mdev device is a virtual device, by why it it virtual in this case? Aren't we effectively assigning the VF itself (mdev device == VF device) with a bunch of extra support code to fill in the gaps of the VF implementing the complete device model in hardware? We're effectively creating this virtual device, creating a fake IOMMU group, and trying to create this association of this virtual device to the real VF in order to shoehorn it into the mdev model. What do we get from that model other than lifecycle management (ie. type selection) and re-use of a bunch of code from the driver supporting the 1) model above? Yes, the lifecycle management which is in mdev is not in vfio-pci variant. This specific model seems better served by a device specific peer driver to vfio-pci (ie. a "vfio-pci variant"). You effectively already have the code for this driver, it's just in the format of an mdev driver rather than a vfio "bus driver". The work Jason references relative to Max aims to make these kinds of drivers easier to implement through re-use of vfio-pci code. There are certainly other solutions we could come up with for selecting a specific device type for a vfio-pci variant driver to implement other than pretending this model actually belongs in mdev, right? Thanks, Sure and would like to see type selection mechanism to be implemented in vfio-pci variant. Thanks, Kirti ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Wed, 26 May 2021 00:56:30 +0530 Kirti Wankhede wrote: > On 5/25/2021 5:07 AM, Jason Gunthorpe wrote: > > On Mon, May 24, 2021 at 05:52:58PM +1000, David Gibson wrote: > > > I don't really see a semantic distinction between "always one-device > groups" and "groups don't matter". Really the only way you can afford > to not care about groups is if they're singletons. > >>> > >>> The kernel driver under the mdev may not be in an "always one-device" > >>> group. > >> > >> I don't really understand what you mean by that. > > > > I mean the group of the mdev's actual DMA device may have multiple > > things in it. > > > >>> It is a kernel driver so the only thing we know and care about is that > >>> all devices in the HW group are bound to kernel drivers. > >>> > >>> The vfio device that spawns from this kernel driver is really a > >>> "groups don't matter" vfio device because at the IOMMU layer it should > >>> be riding on the physical group of the kernel driver. At the VFIO > >>> layer we no longer care about the group abstraction because the system > >>> guarentees isolation in some other way. > >> > >> Uh.. I don't really know how mdevs are isolated from each other. I > >> thought it was because the physical device providing the mdevs > >> effectively had an internal IOMMU (or at least DMA permissioning) to > >> isolate the mdevs, even though the physical device may not be fully > >> isolated. > >> > >> In that case the virtual mdev is effectively in a singleton group, > >> which is different from the group of its parent device. > > > > That's correct. > > > That is one way to view it, but it means creating a whole group > > infrastructure and abusing the IOMMU stack just to create this > > nonsense fiction. > > I really didn't get how this abuse the IOMMU stack. > mdev can be used in 3 different ways: > 1. non-iommu backed mdev devices where mdev vendor driver takes care to > DMA map (iommu_map) and isolation is through device hardware internal > MMU. Here vfio_iommu_type1 module provides a way to validate and pin > pages required by mdev device for DMA mapping. Then IOMMU mapping is > done by mdev vendor driver which is owner driver of physical device. > > 2. iommu backed mdev devices for SRIOV where mdev device is created per > VF (mdev device == VF device) then that mdev device has same iommu > protection scope as VF associated to it. Here mdev device is virtual > device which uses features of mdev and represents underlying VF device, > same as vfio-pci but with additional mdev features. What features would those be? There are no mdev specific parts of the vfio uAPI. The mdev device is a virtual device, by why it it virtual in this case? Aren't we effectively assigning the VF itself (mdev device == VF device) with a bunch of extra support code to fill in the gaps of the VF implementing the complete device model in hardware? We're effectively creating this virtual device, creating a fake IOMMU group, and trying to create this association of this virtual device to the real VF in order to shoehorn it into the mdev model. What do we get from that model other than lifecycle management (ie. type selection) and re-use of a bunch of code from the driver supporting the 1) model above? This specific model seems better served by a device specific peer driver to vfio-pci (ie. a "vfio-pci variant"). You effectively already have the code for this driver, it's just in the format of an mdev driver rather than a vfio "bus driver". The work Jason references relative to Max aims to make these kinds of drivers easier to implement through re-use of vfio-pci code. There are certainly other solutions we could come up with for selecting a specific device type for a vfio-pci variant driver to implement other than pretending this model actually belongs in mdev, right? Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On 5/26/2021 1:22 AM, Jason Gunthorpe wrote: On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote: 2. iommu backed mdev devices for SRIOV where mdev device is created per VF (mdev device == VF device) then that mdev device has same iommu protection scope as VF associated to it. This doesn't require, and certainly shouldn't create, a fake group. Only the VF's real IOMMU group should be used to model an iommu domain linked to a VF. Injecting fake groups that are proxies for real groups only opens the possibility of security problems like David is concerned with. I think this security issue should be addressed by letting mdev device inherit its parent's iommu_group, i.e. VF's iommu_group here. Kirti Max's series approaches this properly by fully linking the struct pci_device of the VF throughout the entire VFIO scheme, including the group and container, while still allowing override of various VFIO operations. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote: > 2. iommu backed mdev devices for SRIOV where mdev device is created per > VF (mdev device == VF device) then that mdev device has same iommu > protection scope as VF associated to it. This doesn't require, and certainly shouldn't create, a fake group. Only the VF's real IOMMU group should be used to model an iommu domain linked to a VF. Injecting fake groups that are proxies for real groups only opens the possibility of security problems like David is concerned with. Max's series approaches this properly by fully linking the struct pci_device of the VF throughout the entire VFIO scheme, including the group and container, while still allowing override of various VFIO operations. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On 5/25/2021 5:07 AM, Jason Gunthorpe wrote: On Mon, May 24, 2021 at 05:52:58PM +1000, David Gibson wrote: I don't really see a semantic distinction between "always one-device groups" and "groups don't matter". Really the only way you can afford to not care about groups is if they're singletons. The kernel driver under the mdev may not be in an "always one-device" group. I don't really understand what you mean by that. I mean the group of the mdev's actual DMA device may have multiple things in it. It is a kernel driver so the only thing we know and care about is that all devices in the HW group are bound to kernel drivers. The vfio device that spawns from this kernel driver is really a "groups don't matter" vfio device because at the IOMMU layer it should be riding on the physical group of the kernel driver. At the VFIO layer we no longer care about the group abstraction because the system guarentees isolation in some other way. Uh.. I don't really know how mdevs are isolated from each other. I thought it was because the physical device providing the mdevs effectively had an internal IOMMU (or at least DMA permissioning) to isolate the mdevs, even though the physical device may not be fully isolated. In that case the virtual mdev is effectively in a singleton group, which is different from the group of its parent device. That's correct. That is one way to view it, but it means creating a whole group infrastructure and abusing the IOMMU stack just to create this nonsense fiction. I really didn't get how this abuse the IOMMU stack. mdev can be used in 3 different ways: 1. non-iommu backed mdev devices where mdev vendor driver takes care to DMA map (iommu_map) and isolation is through device hardware internal MMU. Here vfio_iommu_type1 module provides a way to validate and pin pages required by mdev device for DMA mapping. Then IOMMU mapping is done by mdev vendor driver which is owner driver of physical device. 2. iommu backed mdev devices for SRIOV where mdev device is created per VF (mdev device == VF device) then that mdev device has same iommu protection scope as VF associated to it. Here mdev device is virtual device which uses features of mdev and represents underlying VF device, same as vfio-pci but with additional mdev features. 3. iommu backed mdev devices for PASID with aux feature. I would not comment on this, there has been a long discussion on this. I don't think this is abusing IOMMU stack, atleast for 1 and 2 above. Thanks, Kirti We also abuse the VFIO container stuff to hackily create several different types pf IOMMU uAPIs for the mdev - all of which are unrelated to drivers/iommu. Basically, there is no drivers/iommu thing involved, thus is no really iommu group, for mdev it is all a big hacky lie. If the physical device had a bug which meant the mdevs *weren't* properly isolated from each other, then those mdevs would share a group, and you *would* care about it. Depending on how the isolation failed the mdevs might or might not also share a group with the parent physical device. That isn't a real scenario.. mdevs that can't be isolated just wouldn't be useful to exist This is today's model, yes. When you run dpdk on a multi-group device vfio already ensures that all the device groups remained parked and inaccessible. I'm not really following what you're saying there. If you have a multi-device group, and dpdk is using one device in it, VFIO *does not* (and cannot) ensure that other devices in the group are parked and inaccessible. I mean in the sense that no other user space can open those devices and no kernel driver can later be attached to them. It ensures that they're parked at the moment the group moves from kernel to userspace ownership, but it can't prevent dpdk from accessing and unparking those devices via peer to peer DMA. Right, and adding all this group stuff did nothing to alert the poor admin that is running DPDK to this risk. If the administator configures the system with different security labels for different VFIO devices then yes removing groups makes this more tricky as all devices in the group should have the same label. That seems a bigger problem than "more tricky". How would you propose addressing this with your device-first model? You put the same security labels you'd put on the group to the devices that consitute the group. It is only more tricky in the sense that the script that would have to do this will need to do more than ID the group to label but also ID the device members of the group and label their char nodes. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Mon, May 24, 2021 at 05:52:58PM +1000, David Gibson wrote: > > > I don't really see a semantic distinction between "always one-device > > > groups" and "groups don't matter". Really the only way you can afford > > > to not care about groups is if they're singletons. > > > > The kernel driver under the mdev may not be in an "always one-device" > > group. > > I don't really understand what you mean by that. I mean the group of the mdev's actual DMA device may have multiple things in it. > > It is a kernel driver so the only thing we know and care about is that > > all devices in the HW group are bound to kernel drivers. > > > > The vfio device that spawns from this kernel driver is really a > > "groups don't matter" vfio device because at the IOMMU layer it should > > be riding on the physical group of the kernel driver. At the VFIO > > layer we no longer care about the group abstraction because the system > > guarentees isolation in some other way. > > Uh.. I don't really know how mdevs are isolated from each other. I > thought it was because the physical device providing the mdevs > effectively had an internal IOMMU (or at least DMA permissioning) to > isolate the mdevs, even though the physical device may not be fully > isolated. > > In that case the virtual mdev is effectively in a singleton group, > which is different from the group of its parent device. That is one way to view it, but it means creating a whole group infrastructure and abusing the IOMMU stack just to create this nonsense fiction. We also abuse the VFIO container stuff to hackily create several different types pf IOMMU uAPIs for the mdev - all of which are unrelated to drivers/iommu. Basically, there is no drivers/iommu thing involved, thus is no really iommu group, for mdev it is all a big hacky lie. > If the physical device had a bug which meant the mdevs *weren't* > properly isolated from each other, then those mdevs would share a > group, and you *would* care about it. Depending on how the isolation > failed the mdevs might or might not also share a group with the parent > physical device. That isn't a real scenario.. mdevs that can't be isolated just wouldn't be useful to exist > > This is today's model, yes. When you run dpdk on a multi-group device > > vfio already ensures that all the device groups remained parked and > > inaccessible. > > I'm not really following what you're saying there. > > If you have a multi-device group, and dpdk is using one device in it, > VFIO *does not* (and cannot) ensure that other devices in the group > are parked and inaccessible. I mean in the sense that no other user space can open those devices and no kernel driver can later be attached to them. > It ensures that they're parked at the moment the group moves from > kernel to userspace ownership, but it can't prevent dpdk from > accessing and unparking those devices via peer to peer DMA. Right, and adding all this group stuff did nothing to alert the poor admin that is running DPDK to this risk. > > If the administator configures the system with different security > > labels for different VFIO devices then yes removing groups makes this > > more tricky as all devices in the group should have the same label. > > That seems a bigger problem than "more tricky". How would you propose > addressing this with your device-first model? You put the same security labels you'd put on the group to the devices that consitute the group. It is only more tricky in the sense that the script that would have to do this will need to do more than ID the group to label but also ID the device members of the group and label their char nodes. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Thu, May 13, 2021 at 10:59:38AM -0300, Jason Gunthorpe wrote: > On Thu, May 13, 2021 at 03:48:19PM +1000, David Gibson wrote: > > On Mon, May 03, 2021 at 01:15:18PM -0300, Jason Gunthorpe wrote: > > > On Thu, Apr 29, 2021 at 01:04:05PM +1000, David Gibson wrote: > > > > Again, I don't know enough about VDPA to make sense of that. Are we > > > > essentially talking non-PCI virtual devices here? In which case you > > > > could define the VDPA "bus" to always have one-device groups. > > > > > > It is much worse than that. > > > > > > What these non-PCI devices need is for the kernel driver to be part of > > > the IOMMU group of the underlying PCI device but tell VFIO land that > > > "groups don't matter" > > > > I don't really see a semantic distinction between "always one-device > > groups" and "groups don't matter". Really the only way you can afford > > to not care about groups is if they're singletons. > > The kernel driver under the mdev may not be in an "always one-device" > group. I don't really understand what you mean by that. > It is a kernel driver so the only thing we know and care about is that > all devices in the HW group are bound to kernel drivers. > > The vfio device that spawns from this kernel driver is really a > "groups don't matter" vfio device because at the IOMMU layer it should > be riding on the physical group of the kernel driver. At the VFIO > layer we no longer care about the group abstraction because the system > guarentees isolation in some other way. Uh.. I don't really know how mdevs are isolated from each other. I thought it was because the physical device providing the mdevs effectively had an internal IOMMU (or at least DMA permissioning) to isolate the mdevs, even though the physical device may not be fully isolated. In that case the virtual mdev is effectively in a singleton group, which is different from the group of its parent device. If the physical device had a bug which meant the mdevs *weren't* properly isolated from each other, then those mdevs would share a group, and you *would* care about it. Depending on how the isolation failed the mdevs might or might not also share a group with the parent physical device. > The issue is a software one of tightly coupling IOMMU HW groups to > VFIO's API and then introducing an entire class of VFIO mdev devices > that no longer care about IOMMU HW groups at all. The don't necessarily care about the IOMMU groups of the parent physical hardware, but they have their own IOMMU groups as virtual hardware devices. > Currently mdev tries to trick this by creating singleton groups, but > it is very ugly and very tightly coupled to a specific expectation of > the few existing mdev drivers. Trying to add PASID made it alot worse. > > > Aside: I'm primarily using "group" to mean the underlying hardware > > unit, not the vfio construct on top of it, I'm not sure that's been > > clear throughout. > > Sure, that is obviously fixed, but I'm not interested in that. > > I'm interested in having a VFIO API that makes sense for vfio-pci > which has a tight coupling to the HW notion of a IOMMU and also vfio > mdev's that have no concept of a HW IOMMU group. > > > So.. your model assumes that every device has a safe quiescent state > > where it won't do any harm until poked, whether its group is > > currently kernel owned, or owned by a userspace that doesn't know > > anything about it. > > This is today's model, yes. When you run dpdk on a multi-group device > vfio already ensures that all the device groups remained parked and > inaccessible. I'm not really following what you're saying there. If you have a multi-device group, and dpdk is using one device in it, VFIO *does not* (and cannot) ensure that other devices in the group are parked and inaccessible. It ensures that they're parked at the moment the group moves from kernel to userspace ownership, but it can't prevent dpdk from accessing and unparking those devices via peer to peer DMA. > > At minimum this does mean that in order to use one device in the group > > you must have permission to use *all* the devices in the group - > > otherwise you may be able to operate a device you don't have > > permission to by DMAing to its registers from a device you do have > > permission to. > > If the administator configures the system with different security > labels for different VFIO devices then yes removing groups makes this > more tricky as all devices in the group should have the same label. That seems a bigger problem than "more tricky". How would you propose addressing this with your device-first model? -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Thu, May 13, 2021 at 10:50:30AM -0300, Jason Gunthorpe wrote: > On Thu, May 13, 2021 at 04:07:07PM +1000, David Gibson wrote: > > On Wed, May 05, 2021 at 01:39:02PM -0300, Jason Gunthorpe wrote: > > > On Wed, May 05, 2021 at 02:28:53PM +1000, Alexey Kardashevskiy wrote: > > > > > > > This is a good feature in general when let's say there is a linux > > > > supported > > > > device which has a proprietary device firmware update tool which only > > > > exists > > > > as an x86 binary and your hardware is not x86 - running qemu + vfio in > > > > full > > > > emulation would provide a way to run the tool to update a physical > > > > device. > > > > > > That specific use case doesn't really need a vIOMMU though, does it? > > > > Possibly not, but the mechanics needed to do vIOMMU on different host > > IOMMU aren't really different from what you need for a no-vIOMMU > > guest. > > For very simple vIOMMUs this might be true, but this new features of nesting > PASID, migration, etc, etc all make the vIOMMU complicated and > emuluating it completely alot harder. Well, sure, emulating a complex vIOMMU is complex. But "very simple vIOMMUs" covers the vast majority of currently deployed hardware, and several are already emulated by qemu. > Stuffing a vfio-pci into a guest and creating a physical map using a > single IOASID is comparably trivial. Note that for PAPR (POWER guest) systems this is not an option: the PAPR platform *always* has a vIOMMU. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Thu, May 13, 2021 at 03:48:19PM +1000, David Gibson wrote: > On Mon, May 03, 2021 at 01:15:18PM -0300, Jason Gunthorpe wrote: > > On Thu, Apr 29, 2021 at 01:04:05PM +1000, David Gibson wrote: > > > Again, I don't know enough about VDPA to make sense of that. Are we > > > essentially talking non-PCI virtual devices here? In which case you > > > could define the VDPA "bus" to always have one-device groups. > > > > It is much worse than that. > > > > What these non-PCI devices need is for the kernel driver to be part of > > the IOMMU group of the underlying PCI device but tell VFIO land that > > "groups don't matter" > > I don't really see a semantic distinction between "always one-device > groups" and "groups don't matter". Really the only way you can afford > to not care about groups is if they're singletons. The kernel driver under the mdev may not be in an "always one-device" group. It is a kernel driver so the only thing we know and care about is that all devices in the HW group are bound to kernel drivers. The vfio device that spawns from this kernel driver is really a "groups don't matter" vfio device because at the IOMMU layer it should be riding on the physical group of the kernel driver. At the VFIO layer we no longer care about the group abstraction because the system guarentees isolation in some other way. The issue is a software one of tightly coupling IOMMU HW groups to VFIO's API and then introducing an entire class of VFIO mdev devices that no longer care about IOMMU HW groups at all. Currently mdev tries to trick this by creating singleton groups, but it is very ugly and very tightly coupled to a specific expectation of the few existing mdev drivers. Trying to add PASID made it alot worse. > Aside: I'm primarily using "group" to mean the underlying hardware > unit, not the vfio construct on top of it, I'm not sure that's been > clear throughout. Sure, that is obviously fixed, but I'm not interested in that. I'm interested in having a VFIO API that makes sense for vfio-pci which has a tight coupling to the HW notion of a IOMMU and also vfio mdev's that have no concept of a HW IOMMU group. > So.. your model assumes that every device has a safe quiescent state > where it won't do any harm until poked, whether its group is > currently kernel owned, or owned by a userspace that doesn't know > anything about it. This is today's model, yes. When you run dpdk on a multi-group device vfio already ensures that all the device groups remained parked and inaccessible. > At minimum this does mean that in order to use one device in the group > you must have permission to use *all* the devices in the group - > otherwise you may be able to operate a device you don't have > permission to by DMAing to its registers from a device you do have > permission to. If the administator configures the system with different security labels for different VFIO devices then yes removing groups makes this more tricky as all devices in the group should have the same label. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Thu, May 13, 2021 at 04:07:07PM +1000, David Gibson wrote: > On Wed, May 05, 2021 at 01:39:02PM -0300, Jason Gunthorpe wrote: > > On Wed, May 05, 2021 at 02:28:53PM +1000, Alexey Kardashevskiy wrote: > > > > > This is a good feature in general when let's say there is a linux > > > supported > > > device which has a proprietary device firmware update tool which only > > > exists > > > as an x86 binary and your hardware is not x86 - running qemu + vfio in > > > full > > > emulation would provide a way to run the tool to update a physical device. > > > > That specific use case doesn't really need a vIOMMU though, does it? > > Possibly not, but the mechanics needed to do vIOMMU on different host > IOMMU aren't really different from what you need for a no-vIOMMU > guest. For very simple vIOMMUs this might be true, but this new features of nesting PASID, migration, etc, etc all make the vIOMMU complicated and emuluating it completely alot harder. Stuffing a vfio-pci into a guest and creating a physical map using a single IOASID is comparably trivial. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Thu, May 13, 2021 at 04:01:20PM +1000, David Gibson wrote: > But.. even if you're exposing page tables to userspace.. with hardware > that has explicit support for nesting you can probably expose the hw > tables directly which is great for the cases that works for. But > surely for older IOMMUs which don't do nesting you must have some way > of shadowing guest IO page tables to host IO page tables to translate > GPA to HPA at least? I expect this would be in quemu and would be part of the expensive emulation I suggested. Converting the guest's page table structure into a sequence of map/unmaps to a non-nestable IOASID. > If you're doing that, I don't see that converting page table format > is really any harder It isn't, but it is a completely different flow and custom from the normal HW accelerated nesting. > It might not be a theoretically complete emulation of the vIOMMU, but > it can support in-practice usage. In particular it works pretty well > if your backend has a nice big IOVA range (like x86 IOMMUS) but your > guest platform typically uses relatively small IOVA windows. PAPR on > x86 is exactly that... well.. possibly not the 64-bit window, but > because of old PAPR platforms that didn't support that, we can choose > not to advertise that and guests will cope. So maybe this multi-window thing is generic API somehow. You'll have to check what Kevin comes up with to ensure it fits in Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
> From: David Gibson > Sent: Thursday, May 13, 2021 2:01 PM > > > > But this definitely all becomes HW specific. > > > > For instance I want to have an ARM vIOMMU driver it needs to do some > > > > ret = ioctl(ioasid_fd, CREATE_NESTED_IOASID, [page table format is > ARMvXXX]) > > if (ret == -EOPNOTSUPP) > > ret = ioctl(ioasid_fd, CREATE_NORMAL_IOASID, ..) > > // and do completely different and more expensive emulation > > > > I can get a little bit of generality, but at the end of the day the > > IOMMU must create a specific HW layout of the nested page table, if it > > can't, it can't. > > Erm.. I don't really know how your IOASID interface works here. I'm > thinking about the VFIO interface where maps and unmaps are via > explicit ioctl()s, which provides an obvious point to do translation > between page table formats. > We are working on a draft uAPI proposal based on the discussions in this thread. expect to send it out next week. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Wed, May 05, 2021 at 01:39:02PM -0300, Jason Gunthorpe wrote: > On Wed, May 05, 2021 at 02:28:53PM +1000, Alexey Kardashevskiy wrote: > > > This is a good feature in general when let's say there is a linux supported > > device which has a proprietary device firmware update tool which only exists > > as an x86 binary and your hardware is not x86 - running qemu + vfio in full > > emulation would provide a way to run the tool to update a physical device. > > That specific use case doesn't really need a vIOMMU though, does it? Possibly not, but the mechanics needed to do vIOMMU on different host IOMMU aren't really different from what you need for a no-vIOMMU guest. With a vIOMMU you need to map guest IOVA space into the host IOVA space. With no no-vIOMMU you need to map guest physical addresses into the host IOVA space. In either case the GPA/gIOVA to userspace and userspace to HPA mappings are basically arbitrary. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Mon, May 03, 2021 at 01:15:18PM -0300, Jason Gunthorpe wrote: > On Thu, Apr 29, 2021 at 01:04:05PM +1000, David Gibson wrote: > > Again, I don't know enough about VDPA to make sense of that. Are we > > essentially talking non-PCI virtual devices here? In which case you > > could define the VDPA "bus" to always have one-device groups. > > It is much worse than that. > > What these non-PCI devices need is for the kernel driver to be part of > the IOMMU group of the underlying PCI device but tell VFIO land that > "groups don't matter" I don't really see a semantic distinction between "always one-device groups" and "groups don't matter". Really the only way you can afford to not care about groups is if they're singletons. > Today mdev tries to fake this by using singleton iommu groups, but it > is really horrible and direcly hacks up the VFIO IOMMU code to > understand these special cases. Intel was proposing more special > hacking in the VFIO IOMMU code to extend this to PASID. At this stage I don't really understand why that would end up so horrible. > When we get to a /dev/ioasid this is all nonsense. The kernel device > driver is going to have to tell drivers/iommu exactly what kind of > ioasid it can accept, be it a PASID inside a kernel owned group, a SW > emulated 'mdev' ioasid, or whatever. > > In these cases the "group" idea has become a fiction that just creates > a pain. I don't see how the group is a fiction in this instance. You can still have devices that can't be isolated, therefore you can have non-singleton groups. > "Just reorganize VDPA to do something insane with the driver > core so we can create a dummy group to satisfy an unnecessary uAPI > restriction" is not a very compelling argument. > > So if the nonsensical groups goes away for PASID/mdev, where does it > leave the uAPI in other cases? > > > I don't think simplified-but-wrong is a good goal. The thing about > > groups is that if they're there, you can't just "not care" about them, > > they affect you whether you like it or not. > > You really can. If one thing claims the group then all the other group > devices become locked out. Aside: I'm primarily using "group" to mean the underlying hardware unit, not the vfio construct on top of it, I'm not sure that's been clear throughout. So.. your model assumes that every device has a safe quiescent state where it won't do any harm until poked, whether its group is currently kernel owned, or owned by a userspace that doesn't know anything about it. At minimum this does mean that in order to use one device in the group you must have permission to use *all* the devices in the group - otherwise you may be able to operate a device you don't have permission to by DMAing to its registers from a device you do have permission to. Whatever scripts are managing ownership of devices also need to know about groups, because they need to put all the devices into that quiescent state before the group can change ownership. > The main point to understand is that groups are NOT an application > restriction! It is a whole system restriction that the operator needs > to understand and deal with. This is why things like dpdk don't care > about the group at all - there is nothing they can do with the > information. > > If the operator says to run dpdk on a specific device then the > operator is the one that has to deal with all the other devices in the > group getting locked out. Ok, I think I see your point there. > At best the application can make it more obvious that the operator is > doing something dangerous, but the current kernel API doesn't seem to > really support that either. > > Jason > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Tue, May 04, 2021 at 03:15:37PM -0300, Jason Gunthorpe wrote: > On Tue, May 04, 2021 at 01:54:55PM +1000, David Gibson wrote: > > On Mon, May 03, 2021 at 01:05:30PM -0300, Jason Gunthorpe wrote: > > > On Thu, Apr 29, 2021 at 01:20:22PM +1000, David Gibson wrote: > > > > > There is a certain appeal to having some > > > > > 'PPC_TCE_CREATE_SPECIAL_IOASID' entry point that has a wack of extra > > > > > information like windows that can be optionally called by the viommu > > > > > driver and it remains well defined and described. > > > > > > > > Windows really aren't ppc specific. They're absolutely there on x86 > > > > and everything else as well - it's just that people are used to having > > > > a window at 0.. that you can often get away with > > > > treating it sloppily. > > > > > > My point is this detailed control seems to go on to more than just > > > windows. As you say the vIOMMU is emulating specific HW that needs to > > > have kernel interfaces to match it exactly. > > > > It's really not that bad. The case of emulating the PAPR vIOMMU on > > something else is relatively easy, because all updates to the IO page > > tables go through hypercalls. So, as long as the backend IOMMU can > > map all the IOVAs that the guest IOMMU can, then qemu's implementation > > of those hypercalls just needs to put an equivalent mapping in the > > backend, which it can do with a generic VFIO_DMA_MAP. > > So you also want the PAPR vIOMMU driver to run on, say, an ARM IOMMU? Well, I don't want to preclude it in the API. I'm not sure about that specific example, but in most cases it should be possible to run the PAPR vIOMMU on an x86 IOMMU backend. Obviously only something you'd want to do for testing and experimentation, but it could be quite useful for that. > > vIOMMUs with page tables in guest memory are harder, but only really > > in the usual ways that a vIOMMU of that type is harder (needs cache > > mode or whatever). At whatever point you need to shadow from the > > guest IO page tables to the host backend, you can again do that with > > generic maps, as long as the backend supports the necessary IOVAs, and > > has an IO page size that's equal to or a submultiple of the vIOMMU > > page size. > > But this definitely all becomes HW specific. > > For instance I want to have an ARM vIOMMU driver it needs to do some > > ret = ioctl(ioasid_fd, CREATE_NESTED_IOASID, [page table format is ARMvXXX]) > if (ret == -EOPNOTSUPP) > ret = ioctl(ioasid_fd, CREATE_NORMAL_IOASID, ..) > // and do completely different and more expensive emulation > > I can get a little bit of generality, but at the end of the day the > IOMMU must create a specific HW layout of the nested page table, if it > can't, it can't. Erm.. I don't really know how your IOASID interface works here. I'm thinking about the VFIO interface where maps and unmaps are via explicit ioctl()s, which provides an obvious point to do translation between page table formats. But.. even if you're exposing page tables to userspace.. with hardware that has explicit support for nesting you can probably expose the hw tables directly which is great for the cases that works for. But surely for older IOMMUs which don't do nesting you must have some way of shadowing guest IO page tables to host IO page tables to translate GPA to HPA at least? If you're doing that, I don't see that converting page table format is really any harder > > > I'm remarking that trying to unify every HW IOMMU implementation that > > > ever has/will exist into a generic API complete enough to allow the > > > vIOMMU to be created is likely to result in an API too complicated to > > > understand.. > > > > Maybe not every one, but I think we can get a pretty wide range with a > > reasonable interface. > > It sounds like a reasonable guideline is if the feature is actually > general to all IOMMUs and can be used by qemu as part of a vIOMMU > emulation when compatible vIOMMU HW is not available. > > Having 'requested window' support that isn't actually implemented in > every IOMMU is going to mean the PAPR vIOMMU emulation won't work, > defeating the whole point of making things general? The trick is that you don't necessarily need dynamic window support in the backend to emulate it in the vIOMMU. If your backend has fixed windows, then you emulate request window as: if (requested window is within backend windows) no-op; else return ERROR; It might not be a theoretically complete emulation of the vIOMMU, but it can support in-practice usage. In particular it works pretty well if your backend has a nice big IOVA range (like x86 IOMMUS) but your guest platform typically uses relatively small IOVA windows. PAPR on x86 is exactly that... well.. possibly not the 64-bit window, but because of old PAPR platforms that didn't support that, we can choose not to advertise that and guests will cope. -- David Gibson
RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
> From: Jason Gunthorpe > Sent: Wednesday, May 12, 2021 8:25 AM > > On Wed, May 12, 2021 at 12:21:24AM +, Tian, Kevin wrote: > > > > Basically each RID knows based on its kernel drivers if it is a local > > > or global RID and the ioasid knob can further fine tune this for any > > > other specialty cases. > > > > It's fine if you insist on this way. Then we leave it to userspace to > > ensure same split range is used across devices when vIOMMU is > > concerned. > > I'm still confused why there is a split range needed. a device could support both ENQCMD and non-ENQCMD submissions. for ENQCMD path, CPU provides a PASID translation mechanism (from guest PASID to host PASID) for non-ENQCMD path, guest driver directly programs untranslated guest PASID to the device MMIO register. the host kernel only setups host PASID entry which is hwid for a said ioasid page table. If we don't split range, we have to assume guest PASID == host PASID otherwise non-ENQCMD path will fail. But expose host PASID to guest breaks migration. If we want to allow migration, then need support guest PASID != host PASID and make sure both entries point to the same page table so ENQCMD (host PASID) and non-ENQCMD (guest PASID) can both work. It requires range split to avoid conflict between host/guest PASIDs in the same space. > > > Please note such range split has to be enforced through > > vIOMMU which (e.g. on VT-d) includes a register to report available > > PASID space size (applying to all devices behind this vIOMMU) to > > the guest. The kernel just follows per-RID split info. If anything broken, > > the userspace just shoots its own foot. > > Is it because this specific vIOMMU protocol is limiting things? When range split is enabled, we need a way to tell the guest about the local range size so guest PASID is allocated only within this range. Then we use vIOMMU to expose such information. > > > > > > It does need some user visible difference because SIOV/mdev is not > > > > > migratable. Only the kernel can select a PASID, userspace (and hence > > > > > the guest) shouldn't have the option to force a specific PASID as the > > > > > PASID space is shared across the entire RID to all VMs using the mdev. > > > > > > > > not migratable only when you choose exposing host-allocated PASID > > > > into guest. However in the entire this proposal we actually virtualize > > > > PASIDs, letting the guest manage its own PASID space in all > > > > scenarios > > > > > > PASID cannot be virtualized without also using ENQCMD. > > > > > > A mdev that is using PASID without ENQCMD is non-migratable and this > > > needs to be make visiable in the uAPI. > > > > No. without ENQCMD the PASID must be programmed to a mdev MMIO > > register. This operation is mediated then mdev driver can translate the > > PASID from virtual to real. > > That is probably unworkable with real devices, but if you do this you > need to explicitly expose the vPASID to the mdev API somehow, and still > the device needs to declare if it supports this, and devices that > don't should still work in a non-migratable mode. > It's not necessary. For real devices we use alias mapping for both guest/host PASID as explained above. Then we can have the guest to always register its vPASID with ioasid (just like map/unmap GPA to HVA), and then let host drivers to figure out whether that vPASID can be used as a real hwid. When it's considered virtual and a real hwid is allocated by the host, the mapping is saved under this ioasid to be queried by device drivers if translation is required. >From this angle, the previous IOASID_SET_HWID possibly should be renamed to IOASID_SET_USER_HWID. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Wed, May 12, 2021 at 12:21:24AM +, Tian, Kevin wrote: > > Basically each RID knows based on its kernel drivers if it is a local > > or global RID and the ioasid knob can further fine tune this for any > > other specialty cases. > > It's fine if you insist on this way. Then we leave it to userspace to > ensure same split range is used across devices when vIOMMU is > concerned. I'm still confused why there is a split range needed. > Please note such range split has to be enforced through > vIOMMU which (e.g. on VT-d) includes a register to report available > PASID space size (applying to all devices behind this vIOMMU) to > the guest. The kernel just follows per-RID split info. If anything broken, > the userspace just shoots its own foot. Is it because this specific vIOMMU protocol is limiting things? > > > > It does need some user visible difference because SIOV/mdev is not > > > > migratable. Only the kernel can select a PASID, userspace (and hence > > > > the guest) shouldn't have the option to force a specific PASID as the > > > > PASID space is shared across the entire RID to all VMs using the mdev. > > > > > > not migratable only when you choose exposing host-allocated PASID > > > into guest. However in the entire this proposal we actually virtualize > > > PASIDs, letting the guest manage its own PASID space in all > > > scenarios > > > > PASID cannot be virtualized without also using ENQCMD. > > > > A mdev that is using PASID without ENQCMD is non-migratable and this > > needs to be make visiable in the uAPI. > > No. without ENQCMD the PASID must be programmed to a mdev MMIO > register. This operation is mediated then mdev driver can translate the > PASID from virtual to real. That is probably unworkable with real devices, but if you do this you need to explicitly expose the vPASID to the mdev API somehow, and still the device needs to declare if it supports this, and devices that don't should still work in a non-migratable mode. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
> From: Jason Gunthorpe > Sent: Wednesday, May 12, 2021 7:40 AM > > On Tue, May 11, 2021 at 10:51:40PM +, Tian, Kevin wrote: > > > From: Jason Gunthorpe > > > Sent: Tuesday, May 11, 2021 10:39 PM > > > > > > On Tue, May 11, 2021 at 09:10:03AM +, Tian, Kevin wrote: > > > > > > > 3) SRIOV, ENQCMD (Intel): > > > > - "PASID global" with host-allocated PASIDs; > > > > - PASID table managed by host (in HPA space); > > > > - all RIDs bound to this ioasid_fd use the global pool; > > > > - however, exposing global PASID into guest breaks migration; > > > > - hybrid scheme: split local PASID range and global PASID range; > > > > - force guest to use only local PASID range (through vIOMMU); > > > > - for ENQCMD, configure CPU to translate local->global; > > > > - for non-ENQCMD, setup both local/global pasid entries; > > > > - uAPI for range split and CPU pasid mapping: > > > > > > > > // set to "PASID global" > > > > ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_GLOBAL); > > > > > > > > // split local/global range, applying to all RIDs in this fd > > > > // Example: local [0, 1024), global [1024, max) > > > > // local PASID range is managed by guest and migrated as VM state > > > > // global PASIDs are re-allocated and mapped to local PASIDs post > > > migration > > > > ioctl(ioasid_fd, IOASID_HWID_SET_GLOBAL_MIN, 1024); > > > > > > I'm still not sold that ranges are the best idea here, it just adds > > > more state that has to match during migration. Keeping the > > > global/local split per RID seems much cleaner to me > > > > With ENQCMD the PASID is kept in CPU MSR, making it a process > > context within the guest. When a guest process is bound to two > > devices, the same local PASID must be usable on both devices. > > Having per RID split cannot guarantee it. > > That is only for ENQCMD. All drivers know if they are ENQCMD > compatible drivers and can ensure they use the global allocator > consistently for their RIDs. > > Basically each RID knows based on its kernel drivers if it is a local > or global RID and the ioasid knob can further fine tune this for any > other specialty cases. It's fine if you insist on this way. Then we leave it to userspace to ensure same split range is used across devices when vIOMMU is concerned. Please note such range split has to be enforced through vIOMMU which (e.g. on VT-d) includes a register to report available PASID space size (applying to all devices behind this vIOMMU) to the guest. The kernel just follows per-RID split info. If anything broken, the userspace just shoots its own foot. > > > > It does need some user visible difference because SIOV/mdev is not > > > migratable. Only the kernel can select a PASID, userspace (and hence > > > the guest) shouldn't have the option to force a specific PASID as the > > > PASID space is shared across the entire RID to all VMs using the mdev. > > > > not migratable only when you choose exposing host-allocated PASID > > into guest. However in the entire this proposal we actually virtualize > > PASIDs, letting the guest manage its own PASID space in all > > scenarios > > PASID cannot be virtualized without also using ENQCMD. > > A mdev that is using PASID without ENQCMD is non-migratable and this > needs to be make visiable in the uAPI. > No. without ENQCMD the PASID must be programmed to a mdev MMIO register. This operation is mediated then mdev driver can translate the PASID from virtual to real. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Tue, May 11, 2021 at 10:51:40PM +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Tuesday, May 11, 2021 10:39 PM > > > > On Tue, May 11, 2021 at 09:10:03AM +, Tian, Kevin wrote: > > > > > 3) SRIOV, ENQCMD (Intel): > > > - "PASID global" with host-allocated PASIDs; > > > - PASID table managed by host (in HPA space); > > > - all RIDs bound to this ioasid_fd use the global pool; > > > - however, exposing global PASID into guest breaks migration; > > > - hybrid scheme: split local PASID range and global PASID range; > > > - force guest to use only local PASID range (through vIOMMU); > > > - for ENQCMD, configure CPU to translate local->global; > > > - for non-ENQCMD, setup both local/global pasid entries; > > > - uAPI for range split and CPU pasid mapping: > > > > > > // set to "PASID global" > > > ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_GLOBAL); > > > > > > // split local/global range, applying to all RIDs in this fd > > > // Example: local [0, 1024), global [1024, max) > > > // local PASID range is managed by guest and migrated as VM state > > > // global PASIDs are re-allocated and mapped to local PASIDs post > > migration > > > ioctl(ioasid_fd, IOASID_HWID_SET_GLOBAL_MIN, 1024); > > > > I'm still not sold that ranges are the best idea here, it just adds > > more state that has to match during migration. Keeping the > > global/local split per RID seems much cleaner to me > > With ENQCMD the PASID is kept in CPU MSR, making it a process > context within the guest. When a guest process is bound to two > devices, the same local PASID must be usable on both devices. > Having per RID split cannot guarantee it. That is only for ENQCMD. All drivers know if they are ENQCMD compatible drivers and can ensure they use the global allocator consistently for their RIDs. Basically each RID knows based on its kernel drivers if it is a local or global RID and the ioasid knob can further fine tune this for any other specialty cases. > > It does need some user visible difference because SIOV/mdev is not > > migratable. Only the kernel can select a PASID, userspace (and hence > > the guest) shouldn't have the option to force a specific PASID as the > > PASID space is shared across the entire RID to all VMs using the mdev. > > not migratable only when you choose exposing host-allocated PASID > into guest. However in the entire this proposal we actually virtualize > PASIDs, letting the guest manage its own PASID space in all > scenarios PASID cannot be virtualized without also using ENQCMD. A mdev that is using PASID without ENQCMD is non-migratable and this needs to be make visiable in the uAPI. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
> From: Liu Yi L > Sent: Tuesday, May 11, 2021 9:25 PM > > On Tue, 11 May 2021 09:10:03 +, Tian, Kevin wrote: > > > > From: Jason Gunthorpe > > > Sent: Monday, May 10, 2021 8:37 PM > > > > > [...] > > > > gPASID!=hPASID has a problem when assigning a physical device which > > > > supports both shared work queue (ENQCMD with PASID in MSR) > > > > and dedicated work queue (PASID in device register) to a guest > > > > process which is associated to a gPASID. Say the host kernel has setup > > > > the hPASID entry with nested translation though /dev/ioasid. For > > > > shared work queue the CPU is configured to translate gPASID in MSR > > > > into **hPASID** before the payload goes out to the wire. However > > > > for dedicated work queue the device MMIO register is directly mapped > > > > to and programmed by the guest, thus containing a **gPASID** value > > > > implying DMA requests through this interface will hit IOMMU faults > > > > due to invalid gPASID entry. Having gPASID==hPASID is a simple > > > > workaround here. mdev doesn't have this problem because the > > > > PASID register is in emulated control-path thus can be translated > > > > to hPASID manually by mdev driver. > > > > > > This all must be explicit too. > > > > > > If a PASID is allocated and it is going to be used with ENQCMD then > > > everything needs to know it is actually quite different than a PASID > > > that was allocated to be used with a normal SRIOV device, for > > > instance. > > > > > > The former case can accept that the guest PASID is virtualized, while > > > the lattter can not. > > > > > > This is also why PASID per RID has to be an option. When I assign a > > > full SRIOV function to the guest then that entire RID space needs to > > > also be assigned to the guest. Upon migration I need to take all the > > > physical PASIDs and rebuild them in another hypervisor exactly as is. > > > > > > If you force all RIDs into a global PASID pool then normal SRIOV > > > migration w/PASID becomes impossible. ie ENQCMD breaks everything > else > > > that should work. > > > > > > This is why you need to sort all this out and why it feels like some > > > of the specs here have been mis-designed. > > > > > > I'm not sure carving out ranges is really workable for migration. > > > > > > I think the real answer is to carve out entire RIDs as being in the > > > global pool or not. Then the ENQCMD HW can be bundled together and > > > everything else can live in the natural PASID per RID world. > > > > > > > OK. Here is the revised scheme by making it explicitly. > > > > There are three scenarios to be considered: > > > > 1) SR-IOV (AMD/ARM): > > - "PASID per RID" with guest-allocated PASIDs; > > - PASID table managed by guest (in GPA space); > > - the entire PASID space delegated to guest; > > - no need to explicitly register guest-allocated PASIDs to host; > > - uAPI for attaching PASID table: > > > > // set to "PASID per RID" > > ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_LOCAL); > > > > // When Qemu captures a new PASID table through vIOMMU; > > pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_ALLOC); > > ioctl(device_fd, VFIO_ATTACH_IOASID, pasidtbl_ioasid); > > > > // Set the PASID table to the RID associated with pasidtbl_ioasid; > > ioctl(ioasid_fd, IOASID_SET_PASID_TABLE, pasidtbl_ioasid, gpa_addr); > > > > 2) SR-IOV, no ENQCMD (Intel): > > - "PASID per RID" with guest-allocated PASIDs; > > - PASID table managed by host (in HPA space); > > - the entire PASID space delegated to guest too; > > - host must be explicitly notified for guest-allocated PASIDs; > > - uAPI for binding user-allocated PASIDs: > > > > // set to "PASID per RID" > > ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_LOCAL); > > > > // When Qemu captures a new PASID allocated through vIOMMU; > > Is this achieved by VCMD or by capturing guest's PASID cache invalidation? The latter one > > > pgtbl_ioasid = ioctl(ioasid_fd, IOASID_ALLOC); > > ioctl(device_fd, VFIO_ATTACH_IOASID, pgtbl_ioasid); > > > > // Tell the kernel to associate pasid to pgtbl_ioasid in internal > > structure; > > // being a pointer due to a requirement in scenario-3 > > ioctl(ioasid_fd, IOASID_SET_HWID, pgtbl_ioasid, ); > > > > // Set guest page table to the RID+pasid associated to pgtbl_ioasid > > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, pgtbl_ioasid, gpa_addr); > > > > 3) SRIOV, ENQCMD (Intel): > > - "PASID global" with host-allocated PASIDs; > > - PASID table managed by host (in HPA space); > > - all RIDs bound to this ioasid_fd use the global pool; > > - however, exposing global PASID into guest breaks migration; > > - hybrid scheme: split local PASID range and global PASID range; > > - force guest to use only local PASID range (through vIOMMU); > > - for ENQCMD, configure CPU to translate local->global; > > - for non-ENQCMD, setup both
RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
> From: Jason Gunthorpe > Sent: Tuesday, May 11, 2021 10:39 PM > > On Tue, May 11, 2021 at 09:10:03AM +, Tian, Kevin wrote: > > > 3) SRIOV, ENQCMD (Intel): > > - "PASID global" with host-allocated PASIDs; > > - PASID table managed by host (in HPA space); > > - all RIDs bound to this ioasid_fd use the global pool; > > - however, exposing global PASID into guest breaks migration; > > - hybrid scheme: split local PASID range and global PASID range; > > - force guest to use only local PASID range (through vIOMMU); > > - for ENQCMD, configure CPU to translate local->global; > > - for non-ENQCMD, setup both local/global pasid entries; > > - uAPI for range split and CPU pasid mapping: > > > > // set to "PASID global" > > ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_GLOBAL); > > > > // split local/global range, applying to all RIDs in this fd > > // Example: local [0, 1024), global [1024, max) > > // local PASID range is managed by guest and migrated as VM state > > // global PASIDs are re-allocated and mapped to local PASIDs post > migration > > ioctl(ioasid_fd, IOASID_HWID_SET_GLOBAL_MIN, 1024); > > I'm still not sold that ranges are the best idea here, it just adds > more state that has to match during migration. Keeping the > global/local split per RID seems much cleaner to me With ENQCMD the PASID is kept in CPU MSR, making it a process context within the guest. When a guest process is bound to two devices, the same local PASID must be usable on both devices. Having per RID split cannot guarantee it. > > This is also why I don't really like having the global/local be global > to the ioasid either. It would be better to specify global/local as > part of each VFIO_ATTACH_IOASID so each device is moved to the correct > allocator. this was my original thought. But for above reason this has to be a global enforcement in this ioasid fd. > > > When considering SIOV/mdev there is no change to above uAPI sequence. > > It's n/a for 1) as SIOV requires PASID table in HPA space, nor does it > > cause any change to 3) regarding to the split range scheme. The only > > conceptual change is in 2), where although it's still "PASID per RID" the > > PASIDs must be managed by host because the parent driver also allocates > > PASIDs from per-RID space to mark mdev (RID+PASID). But this difference > > doesn't change the uAPI flow - just treat user-provisioned PASID as > > 'virtual' > > and then allocate a 'real' PASID at IOASID_SET_HWID. Later always use the > > real one when programming PASID entry (IOASID_BIND_PGTABLE) or > device > > PASID register (converted in the mediation path). > > It does need some user visible difference because SIOV/mdev is not > migratable. Only the kernel can select a PASID, userspace (and hence > the guest) shouldn't have the option to force a specific PASID as the > PASID space is shared across the entire RID to all VMs using the mdev. not migratable only when you choose exposing host-allocated PASID into guest. However in the entire this proposal we actually virtualize PASIDs, letting the guest manage its own PASID space in all scenarios (being SR-IOV or SIOV) though the size of PASID space might be different. The PASID chosen by guest may be used as the hw PASID when the PASID space is delegated to guest (e.g. SR-IOV in scenario 1), or is mapped to a different PASID allocated by guest (e.g. in this mdev case or ENQCMD in scenario-3). From uAPI p.o.v the userspace just needs to attach its own pasid to ioasid while the kernel will decide the real hwid underlyingly (being same or different one). Migration only needs cover guest-allocated PASIDs, with all host-side PASIDs are hidden from userspace and reconstructed in the new machine post migration (following common virtualization practice). The only exception where we return host-allocated PASID to userspace in scenario-3 is because Qemu needs such information to update CPU PASID translation table through KVM. Earlier you suggested that this must be explicitly done through userspace instead of implicit notification between ioasid and kvm in kernel. > > I don't see any alternative to telling every part if the PASID is > going to be used by ENQCMD or not, too many important decisions rest > on this detail. > > Jason Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Tue, May 11, 2021 at 09:10:03AM +, Tian, Kevin wrote: > 3) SRIOV, ENQCMD (Intel): > - "PASID global" with host-allocated PASIDs; > - PASID table managed by host (in HPA space); > - all RIDs bound to this ioasid_fd use the global pool; > - however, exposing global PASID into guest breaks migration; > - hybrid scheme: split local PASID range and global PASID range; > - force guest to use only local PASID range (through vIOMMU); > - for ENQCMD, configure CPU to translate local->global; > - for non-ENQCMD, setup both local/global pasid entries; > - uAPI for range split and CPU pasid mapping: > > // set to "PASID global" > ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_GLOBAL); > > // split local/global range, applying to all RIDs in this fd > // Example: local [0, 1024), global [1024, max) > // local PASID range is managed by guest and migrated as VM state > // global PASIDs are re-allocated and mapped to local PASIDs post > migration > ioctl(ioasid_fd, IOASID_HWID_SET_GLOBAL_MIN, 1024); I'm still not sold that ranges are the best idea here, it just adds more state that has to match during migration. Keeping the global/local split per RID seems much cleaner to me This is also why I don't really like having the global/local be global to the ioasid either. It would be better to specify global/local as part of each VFIO_ATTACH_IOASID so each device is moved to the correct allocator. > When considering SIOV/mdev there is no change to above uAPI sequence. > It's n/a for 1) as SIOV requires PASID table in HPA space, nor does it > cause any change to 3) regarding to the split range scheme. The only > conceptual change is in 2), where although it's still "PASID per RID" the > PASIDs must be managed by host because the parent driver also allocates > PASIDs from per-RID space to mark mdev (RID+PASID). But this difference > doesn't change the uAPI flow - just treat user-provisioned PASID as 'virtual' > and then allocate a 'real' PASID at IOASID_SET_HWID. Later always use the > real one when programming PASID entry (IOASID_BIND_PGTABLE) or device > PASID register (converted in the mediation path). It does need some user visible difference because SIOV/mdev is not migratable. Only the kernel can select a PASID, userspace (and hence the guest) shouldn't have the option to force a specific PASID as the PASID space is shared across the entire RID to all VMs using the mdev. I don't see any alternative to telling every part if the PASID is going to be used by ENQCMD or not, too many important decisions rest on this detail. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Tue, 11 May 2021 09:10:03 +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Monday, May 10, 2021 8:37 PM > > > [...] > > > gPASID!=hPASID has a problem when assigning a physical device which > > > supports both shared work queue (ENQCMD with PASID in MSR) > > > and dedicated work queue (PASID in device register) to a guest > > > process which is associated to a gPASID. Say the host kernel has setup > > > the hPASID entry with nested translation though /dev/ioasid. For > > > shared work queue the CPU is configured to translate gPASID in MSR > > > into **hPASID** before the payload goes out to the wire. However > > > for dedicated work queue the device MMIO register is directly mapped > > > to and programmed by the guest, thus containing a **gPASID** value > > > implying DMA requests through this interface will hit IOMMU faults > > > due to invalid gPASID entry. Having gPASID==hPASID is a simple > > > workaround here. mdev doesn't have this problem because the > > > PASID register is in emulated control-path thus can be translated > > > to hPASID manually by mdev driver. > > > > This all must be explicit too. > > > > If a PASID is allocated and it is going to be used with ENQCMD then > > everything needs to know it is actually quite different than a PASID > > that was allocated to be used with a normal SRIOV device, for > > instance. > > > > The former case can accept that the guest PASID is virtualized, while > > the lattter can not. > > > > This is also why PASID per RID has to be an option. When I assign a > > full SRIOV function to the guest then that entire RID space needs to > > also be assigned to the guest. Upon migration I need to take all the > > physical PASIDs and rebuild them in another hypervisor exactly as is. > > > > If you force all RIDs into a global PASID pool then normal SRIOV > > migration w/PASID becomes impossible. ie ENQCMD breaks everything else > > that should work. > > > > This is why you need to sort all this out and why it feels like some > > of the specs here have been mis-designed. > > > > I'm not sure carving out ranges is really workable for migration. > > > > I think the real answer is to carve out entire RIDs as being in the > > global pool or not. Then the ENQCMD HW can be bundled together and > > everything else can live in the natural PASID per RID world. > > > > OK. Here is the revised scheme by making it explicitly. > > There are three scenarios to be considered: > > 1) SR-IOV (AMD/ARM): > - "PASID per RID" with guest-allocated PASIDs; > - PASID table managed by guest (in GPA space); > - the entire PASID space delegated to guest; > - no need to explicitly register guest-allocated PASIDs to host; > - uAPI for attaching PASID table: > > // set to "PASID per RID" > ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_LOCAL); > > // When Qemu captures a new PASID table through vIOMMU; > pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_ALLOC); > ioctl(device_fd, VFIO_ATTACH_IOASID, pasidtbl_ioasid); > > // Set the PASID table to the RID associated with pasidtbl_ioasid; > ioctl(ioasid_fd, IOASID_SET_PASID_TABLE, pasidtbl_ioasid, gpa_addr); > > 2) SR-IOV, no ENQCMD (Intel): > - "PASID per RID" with guest-allocated PASIDs; > - PASID table managed by host (in HPA space); > - the entire PASID space delegated to guest too; > - host must be explicitly notified for guest-allocated PASIDs; > - uAPI for binding user-allocated PASIDs: > > // set to "PASID per RID" > ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_LOCAL); > > // When Qemu captures a new PASID allocated through vIOMMU; Is this achieved by VCMD or by capturing guest's PASID cache invalidation? > pgtbl_ioasid = ioctl(ioasid_fd, IOASID_ALLOC); > ioctl(device_fd, VFIO_ATTACH_IOASID, pgtbl_ioasid); > > // Tell the kernel to associate pasid to pgtbl_ioasid in internal > structure; > // being a pointer due to a requirement in scenario-3 > ioctl(ioasid_fd, IOASID_SET_HWID, pgtbl_ioasid, ); > > // Set guest page table to the RID+pasid associated to pgtbl_ioasid > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, pgtbl_ioasid, gpa_addr); > > 3) SRIOV, ENQCMD (Intel): > - "PASID global" with host-allocated PASIDs; > - PASID table managed by host (in HPA space); > - all RIDs bound to this ioasid_fd use the global pool; > - however, exposing global PASID into guest breaks migration; > - hybrid scheme: split local PASID range and global PASID range; > - force guest to use only local PASID range (through vIOMMU); > - for ENQCMD, configure CPU to translate local->global; > - for non-ENQCMD, setup both local/global pasid entries; > - uAPI for range split and CPU pasid mapping: > > // set to "PASID global" > ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_GLOBAL); > > // split local/global range, applying to
RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
> From: Jason Gunthorpe > Sent: Monday, May 10, 2021 8:37 PM > [...] > > gPASID!=hPASID has a problem when assigning a physical device which > > supports both shared work queue (ENQCMD with PASID in MSR) > > and dedicated work queue (PASID in device register) to a guest > > process which is associated to a gPASID. Say the host kernel has setup > > the hPASID entry with nested translation though /dev/ioasid. For > > shared work queue the CPU is configured to translate gPASID in MSR > > into **hPASID** before the payload goes out to the wire. However > > for dedicated work queue the device MMIO register is directly mapped > > to and programmed by the guest, thus containing a **gPASID** value > > implying DMA requests through this interface will hit IOMMU faults > > due to invalid gPASID entry. Having gPASID==hPASID is a simple > > workaround here. mdev doesn't have this problem because the > > PASID register is in emulated control-path thus can be translated > > to hPASID manually by mdev driver. > > This all must be explicit too. > > If a PASID is allocated and it is going to be used with ENQCMD then > everything needs to know it is actually quite different than a PASID > that was allocated to be used with a normal SRIOV device, for > instance. > > The former case can accept that the guest PASID is virtualized, while > the lattter can not. > > This is also why PASID per RID has to be an option. When I assign a > full SRIOV function to the guest then that entire RID space needs to > also be assigned to the guest. Upon migration I need to take all the > physical PASIDs and rebuild them in another hypervisor exactly as is. > > If you force all RIDs into a global PASID pool then normal SRIOV > migration w/PASID becomes impossible. ie ENQCMD breaks everything else > that should work. > > This is why you need to sort all this out and why it feels like some > of the specs here have been mis-designed. > > I'm not sure carving out ranges is really workable for migration. > > I think the real answer is to carve out entire RIDs as being in the > global pool or not. Then the ENQCMD HW can be bundled together and > everything else can live in the natural PASID per RID world. > OK. Here is the revised scheme by making it explicitly. There are three scenarios to be considered: 1) SR-IOV (AMD/ARM): - "PASID per RID" with guest-allocated PASIDs; - PASID table managed by guest (in GPA space); - the entire PASID space delegated to guest; - no need to explicitly register guest-allocated PASIDs to host; - uAPI for attaching PASID table: // set to "PASID per RID" ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_LOCAL); // When Qemu captures a new PASID table through vIOMMU; pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_ALLOC); ioctl(device_fd, VFIO_ATTACH_IOASID, pasidtbl_ioasid); // Set the PASID table to the RID associated with pasidtbl_ioasid; ioctl(ioasid_fd, IOASID_SET_PASID_TABLE, pasidtbl_ioasid, gpa_addr); 2) SR-IOV, no ENQCMD (Intel): - "PASID per RID" with guest-allocated PASIDs; - PASID table managed by host (in HPA space); - the entire PASID space delegated to guest too; - host must be explicitly notified for guest-allocated PASIDs; - uAPI for binding user-allocated PASIDs: // set to "PASID per RID" ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_LOCAL); // When Qemu captures a new PASID allocated through vIOMMU; pgtbl_ioasid = ioctl(ioasid_fd, IOASID_ALLOC); ioctl(device_fd, VFIO_ATTACH_IOASID, pgtbl_ioasid); // Tell the kernel to associate pasid to pgtbl_ioasid in internal structure; // being a pointer due to a requirement in scenario-3 ioctl(ioasid_fd, IOASID_SET_HWID, pgtbl_ioasid, ); // Set guest page table to the RID+pasid associated to pgtbl_ioasid ioctl(ioasid_fd, IOASID_BIND_PGTABLE, pgtbl_ioasid, gpa_addr); 3) SRIOV, ENQCMD (Intel): - "PASID global" with host-allocated PASIDs; - PASID table managed by host (in HPA space); - all RIDs bound to this ioasid_fd use the global pool; - however, exposing global PASID into guest breaks migration; - hybrid scheme: split local PASID range and global PASID range; - force guest to use only local PASID range (through vIOMMU); - for ENQCMD, configure CPU to translate local->global; - for non-ENQCMD, setup both local/global pasid entries; - uAPI for range split and CPU pasid mapping: // set to "PASID global" ioctl(ioasid_fd, IOASID_SET_HWID_MODE, IOASID_HWID_GLOBAL); // split local/global range, applying to all RIDs in this fd // Example: local [0, 1024), global [1024, max) // local PASID range is managed by guest and migrated as VM state // global PASIDs are re-allocated and mapped to local PASIDs post migration ioctl(ioasid_fd, IOASID_HWID_SET_GLOBAL_MIN, 1024); // When Qemu captures a new
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
Hi Jason, On Mon, 10 May 2021 20:45:00 -0300, Jason Gunthorpe wrote: > On Mon, May 10, 2021 at 03:28:54PM -0700, Jacob Pan wrote: > > > To satisfy your "give me a PASID for this RID" proposal, can we just use > > the RID's struct device as the token? Also add a type field to > > explicitly indicate global vs per-set(per-RID). i.e. > > You've got it backwards, the main behavior should be to allocate PASID > per RID. > Sure, we can make the local PASID as default. My point was that the ioasid_set infrastructure's opaque token can support RID-local allocation scheme. Anyway, this is a small detail as compared to uAPI. > The special behavior is to bundle a bunch of PASIDs into a grouping > and then say the PASID number space is shared between all the group > members. > > /dev/ioasid should create and own this grouping either implicitly or > explicitly. Jumping ahead to in-kernel APIs has missed the critical > step of defining the uAPI and all the behaviors together in a > completed RFC proposal. > Agreed, the requirements for kernel API should come from uAPI. > Jason Thanks, Jacob ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Mon, May 10, 2021 at 03:28:54PM -0700, Jacob Pan wrote: > To satisfy your "give me a PASID for this RID" proposal, can we just use > the RID's struct device as the token? Also add a type field to explicitly > indicate global vs per-set(per-RID). i.e. You've got it backwards, the main behavior should be to allocate PASID per RID. The special behavior is to bundle a bunch of PASIDs into a grouping and then say the PASID number space is shared between all the group members. /dev/ioasid should create and own this grouping either implicitly or explicitly. Jumping ahead to in-kernel APIs has missed the critical step of defining the uAPI and all the behaviors together in a completed RFC proposal. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
Hi Jason, On Mon, 10 May 2021 13:39:56 -0300, Jason Gunthorpe wrote: > I still think it is smarter to push a group of RID's into a global > allocation group and accept there are potential downsides with that > than to try to force a global allocation group on every RID and then > try to fix the mess that makes for non-ENQCMD devices. The proposed ioasid_set change in this set has a token for each set of IOASIDs. /** * struct ioasid_set - Meta data about ioasid_set * @nh: List of notifiers private to that set * @xa: XArray to store ioasid_set private IDs, can be used for * guest-host IOASID mapping, or just a private IOASID namespace. * @token: Unique to identify an IOASID set * @type: Token types * @quota: Max number of IOASIDs can be allocated within the set * @nr_ioasids: Number of IOASIDs currently allocated in the set * @id: ID of the set */ struct ioasid_set { struct atomic_notifier_head nh; struct xarray xa; void *token; int type; int quota; atomic_t nr_ioasids; int id; struct rcu_head rcu; struct misc_cg *misc_cg; /* For misc cgroup accounting */ }; To satisfy your "give me a PASID for this RID" proposal, can we just use the RID's struct device as the token? Also add a type field to explicitly indicate global vs per-set(per-RID). i.e. ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max, int type, void *private) Where flags can be: enum ioasid_hwid_type { IOASID_HWID_GLOBAL, IOASID_HWID_PER_SET, }; We are really talking about the HW IOASID, just a reminder. Thanks, Jacob ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Mon, May 10, 2021 at 09:22:12AM -0700, Raj, Ashok wrote: > Those vIOMMU's will have a capability that it supports PASID allocation > interface. When you allocate you can say what type of PASID you need > (shared vs local) and Qemu will obtain either within the local range, or in > the shared range. Isn't this what I've been saying? This all has to be explicit, and it is all local to the iommu driver. At worst we'd have some single global API 'get me a global PASID' which co-ordinates with all the iommu drivers to actually implement it. > > > When we have both SRIOV and shared WQ exposed to the same guest, we > > > do have an issue. The simplest way that I thought was to have a guest > > > and host PASID separation. Where the guest has its own PASID space > > > and host has its own carved out. Guest can do what ever it wants within > > > that allocated space without fear of any collition with any other device. > > > > And how do you reliably migrate if the target kernel has a PASID > > already allocated in that range? > > For any shared range, remember there is a mapping table. And since those > ranges are always reserved in the new host it isn't a problem. It is a smaller problem - all the ranges still need to match between hosts and so forth. It is tractable but this all needs to be API'd properly and nothing can be implicit, including the global/local range split. Basically all this needs to come through in your /dev/ioasid API RFC proposal that I hope is being worked on. I still think it is smarter to push a group of RID's into a global allocation group and accept there are potential downsides with that than to try to force a global allocation group on every RID and then try to fix the mess that makes for non-ENQCMD devices. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Mon, May 10, 2021 at 12:31:11PM -0300, Jason Gunthorpe wrote: > On Mon, May 10, 2021 at 08:25:02AM -0700, Raj, Ashok wrote: > > > Global PASID doesn't break anything, but giving that control to vIOMMU > > doesn't seem right. When we have mixed uses cases like hardware that > > supports shared wq and SRIOV devices that need PASIDs we need to > > comprehend how they will work without having a backend to migrate PASIDs > > to new destination. > > Why wouldn't there be a backend? SRIOV live migration is a real thing > now (see Max's VFIO patches). The PASID space of the entire dedicated > RID needs to be migratable, which means the destination vIOMMU must be > able to program its local hardware with the same PASID numbers and any > kind of global PASID scheme at all will interfere with it. The way I'm imagining it works is as follows. We have 2 types of platforms. Let me know if i missed something. - no shared wq, meaning RID local PASID allocation is all that's needed. Say platform type p1 - Shared WQ configurations that require global PASIDs. Say platform type p2 vIOMMU might have a capability to indicate if vIOMMU has a PASID allocation capability. If there is none, guest is free to obtain its own PASID per RID since they are fully isolated. For p1 type platforms this should work. Since there is no Qemu interface even required or /dev/iommu for that instance. Guest kernel can manage it fully per-RID based. Platform type p2 that has both SIOV (with enqcmd) and SRIOV that requires PASID. My thought was to reserve say some number of PASID's for per-RID use. When you request a RID local you get PASID in that range. When you ask for global, you get in the upper PASID range. Say 0-4k is reserved for any RID local allocation. This is also the guest PASID range. 4k->2^19 are for shared WQ. (I'm not implying the size, but just for discussion sake that we need a separation.) Those vIOMMU's will have a capability that it supports PASID allocation interface. When you allocate you can say what type of PASID you need (shared vs local) and Qemu will obtain either within the local range, or in the shared range. When they are allocated in the shared range, you still end up using one in guest PASID range, but mapped to a different host-pasid using the VMCS like PASID redirection per-guest (not-perRID). Only the shared allocation requires /dev/iommu interface. All allocation in the guest range is fully in Qemu control. For supporting migration, the target also needs to have this capability for migration. > > > When we have both SRIOV and shared WQ exposed to the same guest, we > > do have an issue. The simplest way that I thought was to have a guest > > and host PASID separation. Where the guest has its own PASID space > > and host has its own carved out. Guest can do what ever it wants within > > that allocated space without fear of any collition with any other device. > > And how do you reliably migrate if the target kernel has a PASID > already allocated in that range? For any shared range, remember there is a mapping table. And since those ranges are always reserved in the new host it isn't a problem. For shared WQ that requires a PASID remapping to new host PASID, the VMCS remapping per guest that does gPASID->hPASID does this job. So the guest PASID remains unchanged. Does this make sense? Cheers, Ashok ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Mon, May 10, 2021 at 08:25:02AM -0700, Raj, Ashok wrote: > Global PASID doesn't break anything, but giving that control to vIOMMU > doesn't seem right. When we have mixed uses cases like hardware that > supports shared wq and SRIOV devices that need PASIDs we need to > comprehend how they will work without having a backend to migrate PASIDs > to new destination. Why wouldn't there be a backend? SRIOV live migration is a real thing now (see Max's VFIO patches). The PASID space of the entire dedicated RID needs to be migratable, which means the destination vIOMMU must be able to program its local hardware with the same PASID numbers and any kind of global PASID scheme at all will interfere with it. > When we have both SRIOV and shared WQ exposed to the same guest, we > do have an issue. The simplest way that I thought was to have a guest > and host PASID separation. Where the guest has its own PASID space > and host has its own carved out. Guest can do what ever it wants within > that allocated space without fear of any collition with any other device. And how do you reliably migrate if the target kernel has a PASID already allocated in that range? ENQCMD must not assume it is the only thing in the platform. It needs to be compartmentalized to specific participating RIDs and made explicit because it has a bad special requirement for cross-device PASIDs Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Mon, May 10, 2021 at 09:37:29AM -0300, Jason Gunthorpe wrote: > On Sat, May 08, 2021 at 09:56:59AM +, Tian, Kevin wrote: > > > From: Raj, Ashok > > > Sent: Friday, May 7, 2021 12:33 AM > > > > > > > Basically it means when the guest's top level IOASID is created for > > > > nesting that IOASID claims all PASID's on the RID and excludes any > > > > PASID IOASIDs from existing on the RID now or in future. > > > > > > The way to look at it this is as follows: > > > > > > For platforms that do not have a need to support shared work queue model > > > support for ENQCMD or similar, PASID space is naturally per RID. There is > > > no > > > complication with this. Every RID has the full range of PASID's and no > > > need > > > for host to track which PASIDs are allocated now or in future in the > > > guest. > > > > > > For platforms that support ENQCMD, it is required to mandate PASIDs are > > > global across the entire system. Maybe its better to call them gPASID for > > > guest and hPASID for host. Short reason being gPASID->hPASID is a guest > > > wide mapping for ENQCMD and not a per-RID based mapping. (We covered > > > that > > > in earlier responses) > > > > > > In our current implementation we actually don't separate this space, and > > > gPASID == hPASID. The iommu driver enforces that by using the custom > > > allocator and the architected interface that allows all guest vIOMMU > > > allocations to be proxied to host. Nothing but a glorified hypercall like > > > interface. In fact some OS's do use hypercall to get a hPASID vs using > > > the vCMD style interface. > > > > > > > After more thinking about the new interface, I feel gPASID==hPASID > > actually causes some confusion in uAPI design. In concept an ioasid > > is not active until it's attached to a device, because it's just an ID > > if w/o a device. So supposedly an ioasid should reject all user commands > > before attach. However an guest likely asks for a new gPASID before > > attaching it to devices and vIOMMU. if gPASID==hPASID then Qemu > > must request /dev/ioasid to allocate a hw_id for an ioasid which hasn't > > been attached to any device, with the assumption on kernel knowledge > > that this hw_id is from an global allocator w/o dependency on any > > device. This doesn't sound a clean design, not to say it also conflicts > > with live migration. > > Everything must be explicit. The situation David pointed to of > qemu emulating a vIOMMU while running on a host with a different > platform/physical IOMMU must be considered. > > If the vIOMMU needs specific behavior it must use /dev/iommu to ask > for it specifically and not just make wild assumptions about how the > platform works. I think the right way is for pIOMMU to enforce the right behavior. vIOMMU can ask for a PASID and physical IOMMU driver would give what is optimal for the platform. if vIOMMU says give me per-device PASID, but that can lead to conflicts in PASID name space, its best to avoid it. Global PASID doesn't break anything, but giving that control to vIOMMU doesn't seem right. When we have mixed uses cases like hardware that supports shared wq and SRIOV devices that need PASIDs we need to comprehend how they will work without having a backend to migrate PASIDs to new destination. for ENQCMD we have the gPASID->hPASID translation in the VMCS control. For devices that support SIOV, programming a PASID to a device is also mediated, so its possible for something like the mediated interface to assist with that migration for the dedicated WQ. When we have both SRIOV and shared WQ exposed to the same guest, we do have an issue. The simplest way that I thought was to have a guest and host PASID separation. Where the guest has its own PASID space and host has its own carved out. Guest can do what ever it wants within that allocated space without fear of any collition with any other device. Cheers, Ashok ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Sat, May 08, 2021 at 09:56:59AM +, Tian, Kevin wrote: > > From: Raj, Ashok > > Sent: Friday, May 7, 2021 12:33 AM > > > > > Basically it means when the guest's top level IOASID is created for > > > nesting that IOASID claims all PASID's on the RID and excludes any > > > PASID IOASIDs from existing on the RID now or in future. > > > > The way to look at it this is as follows: > > > > For platforms that do not have a need to support shared work queue model > > support for ENQCMD or similar, PASID space is naturally per RID. There is no > > complication with this. Every RID has the full range of PASID's and no need > > for host to track which PASIDs are allocated now or in future in the guest. > > > > For platforms that support ENQCMD, it is required to mandate PASIDs are > > global across the entire system. Maybe its better to call them gPASID for > > guest and hPASID for host. Short reason being gPASID->hPASID is a guest > > wide mapping for ENQCMD and not a per-RID based mapping. (We covered > > that > > in earlier responses) > > > > In our current implementation we actually don't separate this space, and > > gPASID == hPASID. The iommu driver enforces that by using the custom > > allocator and the architected interface that allows all guest vIOMMU > > allocations to be proxied to host. Nothing but a glorified hypercall like > > interface. In fact some OS's do use hypercall to get a hPASID vs using > > the vCMD style interface. > > > > After more thinking about the new interface, I feel gPASID==hPASID > actually causes some confusion in uAPI design. In concept an ioasid > is not active until it's attached to a device, because it's just an ID > if w/o a device. So supposedly an ioasid should reject all user commands > before attach. However an guest likely asks for a new gPASID before > attaching it to devices and vIOMMU. if gPASID==hPASID then Qemu > must request /dev/ioasid to allocate a hw_id for an ioasid which hasn't > been attached to any device, with the assumption on kernel knowledge > that this hw_id is from an global allocator w/o dependency on any > device. This doesn't sound a clean design, not to say it also conflicts > with live migration. Everything must be explicit. The situation David pointed to of qemu emulating a vIOMMU while running on a host with a different platform/physical IOMMU must be considered. If the vIOMMU needs specific behavior it must use /dev/iommu to ask for it specifically and not just make wild assumptions about how the platform works. > gPASID!=hPASID has a problem when assigning a physical device which > supports both shared work queue (ENQCMD with PASID in MSR) > and dedicated work queue (PASID in device register) to a guest > process which is associated to a gPASID. Say the host kernel has setup > the hPASID entry with nested translation though /dev/ioasid. For > shared work queue the CPU is configured to translate gPASID in MSR > into **hPASID** before the payload goes out to the wire. However > for dedicated work queue the device MMIO register is directly mapped > to and programmed by the guest, thus containing a **gPASID** value > implying DMA requests through this interface will hit IOMMU faults > due to invalid gPASID entry. Having gPASID==hPASID is a simple > workaround here. mdev doesn't have this problem because the > PASID register is in emulated control-path thus can be translated > to hPASID manually by mdev driver. This all must be explicit too. If a PASID is allocated and it is going to be used with ENQCMD then everything needs to know it is actually quite different than a PASID that was allocated to be used with a normal SRIOV device, for instance. The former case can accept that the guest PASID is virtualized, while the lattter can not. This is also why PASID per RID has to be an option. When I assign a full SRIOV function to the guest then that entire RID space needs to also be assigned to the guest. Upon migration I need to take all the physical PASIDs and rebuild them in another hypervisor exactly as is. If you force all RIDs into a global PASID pool then normal SRIOV migration w/PASID becomes impossible. ie ENQCMD breaks everything else that should work. This is why you need to sort all this out and why it feels like some of the specs here have been mis-designed. I'm not sure carving out ranges is really workable for migration. I think the real answer is to carve out entire RIDs as being in the global pool or not. Then the ENQCMD HW can be bundled together and everything else can live in the natural PASID per RID world. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On 5/8/21 3:31 PM, Tian, Kevin wrote: From: Alex Williamson Sent: Saturday, May 8, 2021 1:06 AM Those are the main ones I can think of. It is nice to have a simple map/unmap interface, I'd hope that a new /dev/ioasid interface wouldn't raise the barrier to entry too high, but the user needs to have the ability to have more control of their mappings and locked page accounting should probably be offloaded somewhere. Thanks, Based on your feedbacks I feel it's probably reasonable to start with a type1v2 semantics for the new interface. Locked accounting could also start with the same VFIO restriction and then improve it incrementally, if a cleaner way is intrusive (if not affecting uAPI). But I didn't get the suggestion on "more control of their mappings". Can you elaborate? Things like I note above, userspace cannot currently specify mapping granularity nor has any visibility to the granularity they get from the IOMMU. What actually happens in the IOMMU is pretty opaque to the user currently. Thanks, It's much clearer. Based on all the discussions so far I'm thinking about a staging approach when building the new interface, basically following the model that Jason pointed out - generic stuff first, then platform specific extension: Phase 1: /dev/ioasid with core ingredients and vfio type1v2 semantics - ioasid is the software handle representing an I/O page table A trivial proposal, is it possible to use /dev/ioas? Conceptually, it's an IO address space representation and has nothing to do with any ID. Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
> From: Raj, Ashok > Sent: Friday, May 7, 2021 12:33 AM > > > Basically it means when the guest's top level IOASID is created for > > nesting that IOASID claims all PASID's on the RID and excludes any > > PASID IOASIDs from existing on the RID now or in future. > > The way to look at it this is as follows: > > For platforms that do not have a need to support shared work queue model > support for ENQCMD or similar, PASID space is naturally per RID. There is no > complication with this. Every RID has the full range of PASID's and no need > for host to track which PASIDs are allocated now or in future in the guest. > > For platforms that support ENQCMD, it is required to mandate PASIDs are > global across the entire system. Maybe its better to call them gPASID for > guest and hPASID for host. Short reason being gPASID->hPASID is a guest > wide mapping for ENQCMD and not a per-RID based mapping. (We covered > that > in earlier responses) > > In our current implementation we actually don't separate this space, and > gPASID == hPASID. The iommu driver enforces that by using the custom > allocator and the architected interface that allows all guest vIOMMU > allocations to be proxied to host. Nothing but a glorified hypercall like > interface. In fact some OS's do use hypercall to get a hPASID vs using > the vCMD style interface. > After more thinking about the new interface, I feel gPASID==hPASID actually causes some confusion in uAPI design. In concept an ioasid is not active until it's attached to a device, because it's just an ID if w/o a device. So supposedly an ioasid should reject all user commands before attach. However an guest likely asks for a new gPASID before attaching it to devices and vIOMMU. if gPASID==hPASID then Qemu must request /dev/ioasid to allocate a hw_id for an ioasid which hasn't been attached to any device, with the assumption on kernel knowledge that this hw_id is from an global allocator w/o dependency on any device. This doesn't sound a clean design, not to say it also conflicts with live migration. Want to hear your and Jason's opinion about an alternative option to remove such restriction thus allowing gPASID!=hPASID. gPASID!=hPASID has a problem when assigning a physical device which supports both shared work queue (ENQCMD with PASID in MSR) and dedicated work queue (PASID in device register) to a guest process which is associated to a gPASID. Say the host kernel has setup the hPASID entry with nested translation though /dev/ioasid. For shared work queue the CPU is configured to translate gPASID in MSR into **hPASID** before the payload goes out to the wire. However for dedicated work queue the device MMIO register is directly mapped to and programmed by the guest, thus containing a **gPASID** value implying DMA requests through this interface will hit IOMMU faults due to invalid gPASID entry. Having gPASID==hPASID is a simple workaround here. mdev doesn't have this problem because the PASID register is in emulated control-path thus can be translated to hPASID manually by mdev driver. Along this story one possible option is having both gPASID and hPASID entries pointing to the same paging structure, sort of making gPASID an aliasing hw_id to hPASID. Then we also need to make sure gPASID range not colliding with hPASID range for this RID. Something like below: In the beginning Qemu specifies a minimal ID (say 1024) that hPASIDs must be allocated beyond (sort of delegating [0, 1023] of this RID to userspace): ioctl(ioasid_fd, SET_IOASID_MIN_HWID, 1024); The guest still uses vIOMMU interface or hypercall to allocate gPASIDs. Upon such request, Qemu returns a gPASID from [0, 1023] to guest and also allocates a new ioasid from /dev/ioasid. there is no hw_id allocated at this step: ioasid = ioctl(ioasid_fd, ALLOC_IOASID); hw_id (hPASID) is allocated when attaching ioasid to the said device: ioctl(device_fd, VFIO_ATTACH_IOASID, ioasid); Then gPASID is provided as an aliasing hwid to this ioasid: ioctl(device_fd, VFIO_ALIASING_IOASID, ioasid, gPASID); Starting from this point the kernel should make sure that any ioasid operation should be applied to both gPASID and hPASID for this RID (entry setup, tear down, etc.) and both PASID entries point to the same paging structures. When a page fault happens, the IOMMU driver should also link a fault from either PASID back to the associated ioasid. As explained earlier this aliasing requirement only applies to physical devices on Intel platform. We may either have mdev ignore such aliasing request, or have vfio device to report whether aliasing is allowed. This is sort of a hybrid model. the gPASID range is reserved locally in per-RID pasid space and delegated to userspace, while the hPASIDs are still managed globally and not exposed to userspace. Does it sound a cleaner approach (still w/ some complexity) compared to the restrictions of having gPASID==hPASID? Thanks Kevin
RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
> From: Alex Williamson > Sent: Saturday, May 8, 2021 1:06 AM > > > > Those are the main ones I can think of. It is nice to have a simple > > > map/unmap interface, I'd hope that a new /dev/ioasid interface wouldn't > > > raise the barrier to entry too high, but the user needs to have the > > > ability to have more control of their mappings and locked page > > > accounting should probably be offloaded somewhere. Thanks, > > > > > > > Based on your feedbacks I feel it's probably reasonable to start with > > a type1v2 semantics for the new interface. Locked accounting could > > also start with the same VFIO restriction and then improve it > > incrementally, if a cleaner way is intrusive (if not affecting uAPI). > > But I didn't get the suggestion on "more control of their mappings". > > Can you elaborate? > > Things like I note above, userspace cannot currently specify mapping > granularity nor has any visibility to the granularity they get from the > IOMMU. What actually happens in the IOMMU is pretty opaque to the user > currently. Thanks, > It's much clearer. Based on all the discussions so far I'm thinking about a staging approach when building the new interface, basically following the model that Jason pointed out - generic stuff first, then platform specific extension: Phase 1: /dev/ioasid with core ingredients and vfio type1v2 semantics - ioasid is the software handle representing an I/O page table - uAPI accepts a type1v2 map/unmap semantics per ioasid - helpers for VFIO/VDPA to bind ioasid_fd and attach ioasids - multiple ioasids are allowed without nesting (vIOMMU, or devices w/ incompatible iommu attributes) - an ioasid disallows any operation before it's attached to a device - an ioasid inherits iommu attributes from the 1st device attached to it - userspace is expected to manage hardware restrictions and the kernel only returns error when restrictions are broken * map/unmap on an ioasid will fail before every device in a group is attached to it * ioasid attach will fail if the new device has incompatibile iommu attribute as that of this ioasid - thus no group semantics in uAPI - no change to vfio container/group/type1 logic, for running existing vfio applications * imply some duplication between vfio type1 and ioasid for some time - new uAPI in vfio to allow explicit opening of a device and then binding it to the ioasid_fd * possibly require each device exposed in /dev/vfio/ - support both pdev and mdev Phase 2: ioasid nesting - Allow bind/unbind_pgtable semantics per ioasid - Allow ioasid nesting * HW ioasid nesting if supported by platform * otherwise fall back to SW ioasid nesting (in-kernel shadowing) - iotlb invalidation per ioasid - I/O page fault handling per ioasid - hw_id is not exposed in uAPI. Vendor IOMMU driver decides when/how hw_id is allocated and programmed properly Phase3: optimizations and vendor extensions (order undefined, up to the specific feature owner): - (Intel) ENQCMD support with hw_id exposure in uAPI - (ARM/AMD) RID-based pasid table assignment - (PPC) window-based iova management - Optimizations: * replace vfio type1 with a shim driver to use ioasid backend * mapping granularity * HW dirty page tracking * ... Does above sounds a sensible plan? If yes we'll start working on phase1 then... Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
Hi Jason, On Wed, 5 May 2021 19:21:20 -0300, Jason Gunthorpe wrote: > On Wed, May 05, 2021 at 01:04:46PM -0700, Jacob Pan wrote: > > Hi Jason, > > > > On Wed, 5 May 2021 15:00:23 -0300, Jason Gunthorpe wrote: > > > > > On Wed, May 05, 2021 at 10:22:59AM -0700, Jacob Pan wrote: > > > > > > > Global and pluggable are for slightly separate reasons. > > > > - We need global PASID on VT-d in that we need to support shared > > > > workqueues (SWQ). E.g. One SWQ can be wrapped into two mdevs then > > > > assigned to two VMs. Each VM uses its private guest PASID to submit > > > > work but each guest PASID must be translated to a global (system-wide) > > > > host PASID to avoid conflict. Also, since PASID table storage is per > > > > PF, if two mdevs of the same PF are assigned to different VMs, the > > > > PASIDs must be unique. > > > > > > From a protocol perspective each RID has a unique PASID table, and > > > RIDs can have overlapping PASIDs. > > > > > True, per RID or per PF as I was referring to. > > > > > Since your SWQ is connected to a single RID the requirement that > > > PASIDs are unique to the RID ensures they are sufficiently unique. > > > > > True, but one process can submit work to multiple mdevs from different > > RIDs/PFs. One process uses one PASID and PASID translation table is per VM. > > The same PASID is used for all the PASID tables of each RID. > > If the model is "assign this PASID to this RID" then yes, there is a > big problem keeping everything straight that can only be solved with a > global table. > > But if the model is "give me a PASID for this RID" then it isn't such > a problem. Let me double confirm if I'm understanding you correctly. So your suggestion is to have a per-RID PASID namespace, which can be maintainer by IOMMU driver. right? Take native SVM usage as an example, everytime a process is bound with a device, a PASID within this RID will be allocated. Am I correct so far? If yes, then there is a case in which IOTLB efficiency is really low. Let's ay there is a process bound with multiple devices(RIDs) and has different PASIDs allocated for each RID. In such case, the PASID values are different for each RID. As most vendor will do, PASID will be used to tag IOTLB entries. So in such case, here will be multiple IOTLB entries for a single VA->PA mapping. And the number of such duplicate IOTLB entries increases linearly per the number of the device number. Seems not good from performance perspective. > > Basically trying to enforce a uniform PASID for an IOASID across all > RIDs attached to it is not such a nice choice. > > > > That is fine, but all this stuff should be inside the Intel vIOMMU > > > driver not made into a global resource of the entire iommu subsystem. > > > > > Intel vIOMMU has to use a generic uAPI to allocate PASID so the generic > > code need to have this option. I guess you are saying we should also have a > > per RID allocation option in addition to global? > > There always has to be a RID involvement for the PASID, for security, > this issue really boils down to where the PASID lives. > > If you need the PASID attached to the IOASID then it has to be global > because the IOASID can be attached to any RID and must keep the same > PASID. > > If the PASID is learned when the IOASID is attached to a RID then the > PASID is more flexible and isn't attached to the IOASID. > > Honestly I'm a little leary to bake into a UAPI a specific HW choice > that Intel made here. > > I would advise making the "attach a global PASID to this IOASID" > operation explicit and opt into for case that actually need it. > > Which implies the API to the iommu driver should be more like: > > 'assign an IOASID to this RID and return the PASID' > 'reserve a PASID from every RID' > 'assign an IOASID to this RID and use this specific PASID' > > In all cases the scope of those operations are completely local to a > certain IOMMU driver - 'reserver a PASID from every RID' is really > every RID that driver can operate on. Also, this reservation will be failed if the PASID happens to be occupied by previous usage. As the PASID translation table is per-VM, ENQCMD in VM will be a problem under such PASID management model. > > So it is hard to see why the allocator should be a global resource and > not something that is part of the iommu driver exclusively. > > Jason -- Regards, Yi Liu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
> From: Jason Gunthorpe > Sent: Saturday, May 8, 2021 1:11 AM > > On Fri, May 07, 2021 at 11:06:14AM -0600, Alex Williamson wrote: > > > We had tossed around an idea of a super-container with vfio, it's maybe > > something we'd want to incorporate into this design. For instance, if > > memory could be pre-registered with a super container, which would > > handle the locked memory accounting for that memory, then > > sub-containers could all handle the IOMMU context of their sets of > > devices relative to that common memory pool. > > This is where I suggested to David to use nesting of IOASIDs. > > Without HW support for nesting a SW nest is really just re-using the > memory registration information stored in the parent when constructing > the children > yes, this sounds a sensible thing to do. it also unifies the user experience regardless of whether the underlying hw supports nesting, e.g. when vIOMMU is present Qemu can always use IOASID nesting uAPI. In case of SW nest then the kernel will merge the two-level translations from two IOASIDs into one-level shadow page table (unlike today's VFIO which has the userspace to manage shadow-based mapping). but want to remark that nesting IOASIDs alone cannot solve this accounting problem completely, as long as a process is allowed to have multiple ioasid FDs (unless there is a mechanism to allow nesting IOASIDs cross FDs). But this is probably not a big issue. With all the intended usages around the new interface, I think for most applications one ioasid FD should be sufficient to meet their requirements (multiple gpa_ioasids, ioasid nesting, etc.). Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
> From: Jason Gunthorpe > Sent: Thursday, April 29, 2021 4:46 AM > > > > I think the name IOASID is fine for the uAPI, the kernel version can > > > be called ioasid_id or something. > > > > ioasid is already an id and then ioasid_id just adds confusion. Another > > point is that ioasid is currently used to represent both PCI PASID and > > ARM substream ID in the kernel. It implies that if we want to separate > > ioasid and pasid in the uAPI the 'pasid' also needs to be replaced with > > another general term usable for substream ID. Are we making the > > terms too confusing here? > > This is why I also am not so sure about exposing the PASID in the API > because it is ultimately a HW specific item. > > As I said to David, one avenue is to have some generic uAPI that is > very general and keep all this deeply detailed stuff, that really only > matters for qemu, as part of a more HW specific vIOMMU driver > interface. > OK, so the general uAPI will not expose hw_id and just provide everything generic for managing I/O page table (map/unmap, nesting, etc.) through IOASID and then specific uAPI is provided to handle platform specific requirements (hw_id, iova windows, etc.) Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
Hi Jason, On Fri, 7 May 2021 16:28:10 -0300, Jason Gunthorpe wrote: > > The unanswered question is how do we plumb from vIOMMU without a custom > > allocator to get a system wide PASID? > > PASID allocation is part of the iommu driver, it really shouldn't be > global. > In the current code, the pluggable custom allocator *is* part of the iommu vendor driver. If it decides the allocation is global then it should be suitable for the platform since there will never be a VT-d IOMMU on another vendor's platform. It is true that the default allocator is global which suites the current needs. I am just wondering if we are solving a problem does not exist yet. > When the architecture code goes to allocate a single PASID for the > mm_struct it should flag that allocation request with a 'must work for > all RIDs flag' and the iommu driver should take care of it. That might > mean the iommu driver consults a global static xarray, or maybe it > does a hypercall, but it should be done through that API, not a side > care global singleton. Why do we need to flag the allocation every time if on a platform *every* PASID can potentially be global? At the time of allocation, we don't know if the PASID will be used for a shared (ENQCMD) or a dedicated workqueue. Thanks, Jacob ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Fri, May 07, 2021 at 12:23:25PM -0700, Raj, Ashok wrote: > Hi Jason > > - Removed lizefan's email due to bounces... > > On Fri, May 07, 2021 at 03:20:50PM -0300, Jason Gunthorpe wrote: > > On Fri, May 07, 2021 at 11:14:58AM -0700, Raj, Ashok wrote: > > > On Fri, May 07, 2021 at 02:20:51PM -0300, Jason Gunthorpe wrote: > > > > On Thu, May 06, 2021 at 09:32:40AM -0700, Raj, Ashok wrote: > > > > > > > > > For platforms that support ENQCMD, it is required to mandate PASIDs > > > > > are > > > > > global across the entire system. Maybe its better to call them gPASID > > > > > for > > > > > guest and hPASID for host. Short reason being gPASID->hPASID is a > > > > > guest > > > > > wide mapping for ENQCMD and not a per-RID based mapping. (We covered > > > > > that > > > > > in earlier responses) > > > > > > > > I don't think it is actually ENQCMD that forces this, ENQCMD can use a > > > > per-RID PASID in the translation table as well. > > > > > > When using ENQCMD the PASID that needs to be sent on the wire is picked > > > from an MSR setup by kernel. This is context switched along with the > > > process. So each process has only 1 PASID that can go out when using > > > ENQCMD. ENQCMD takes one mmio address specific to the acclerator and a > > > source for the descriptor. > > > > Oh. I forgot this also globally locked the PASID to a single > > MSR. Sigh. That makes the whole mechanism useless for anything except > > whole process SVA. > > Is there another kind of SVA? Our mapping from that each process requires a > single mm, and PASID for SVM was a direct map from that. There are lots of potential applications for something like ENQCMD that are not whole process SVA. Linking it to a single PASID basically nukes any other use of it unfortunately. > > I would have to ask for a PASID that has the property it needs. You > > are saying the property is even bigger than "usable on a group of > > RIDs" but is actually "global for every RID and IOMMU in the system so > > it can go into a MSR". Gross, but fine, ask for that explicitly when > > allocating the PASID. > > If one process has a single mm, is that also gross? :-) So a single process > having a PASID is just an identifier for IOMMU. It just seems like what a > mm is for a process == PASID for SVM-IOMMU support. > > The unanswered question is how do we plumb from vIOMMU without a custom > allocator to get a system wide PASID? PASID allocation is part of the iommu driver, it really shouldn't be global. When the architecture code goes to allocate a single PASID for the mm_struct it should flag that allocation request with a 'must work for all RIDs flag' and the iommu driver should take care of it. That might mean the iommu driver consults a global static xarray, or maybe it does a hypercall, but it should be done through that API, not a side care global singleton. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
Hi Jason - Removed lizefan's email due to bounces... On Fri, May 07, 2021 at 03:20:50PM -0300, Jason Gunthorpe wrote: > On Fri, May 07, 2021 at 11:14:58AM -0700, Raj, Ashok wrote: > > On Fri, May 07, 2021 at 02:20:51PM -0300, Jason Gunthorpe wrote: > > > On Thu, May 06, 2021 at 09:32:40AM -0700, Raj, Ashok wrote: > > > > > > > For platforms that support ENQCMD, it is required to mandate PASIDs are > > > > global across the entire system. Maybe its better to call them gPASID > > > > for > > > > guest and hPASID for host. Short reason being gPASID->hPASID is a guest > > > > wide mapping for ENQCMD and not a per-RID based mapping. (We covered > > > > that > > > > in earlier responses) > > > > > > I don't think it is actually ENQCMD that forces this, ENQCMD can use a > > > per-RID PASID in the translation table as well. > > > > When using ENQCMD the PASID that needs to be sent on the wire is picked > > from an MSR setup by kernel. This is context switched along with the > > process. So each process has only 1 PASID that can go out when using > > ENQCMD. ENQCMD takes one mmio address specific to the acclerator and a > > source for the descriptor. > > Oh. I forgot this also globally locked the PASID to a single > MSR. Sigh. That makes the whole mechanism useless for anything except > whole process SVA. Is there another kind of SVA? Our mapping from that each process requires a single mm, and PASID for SVM was a direct map from that. > > It also make it a general kernel problem and not just related to the > vIOMMU scenario. > > > > I think at the uAPI level the callpaths that require allocating a > > > PASID from a group of RIDs should be explicit in their intention and > > > not implicitly rely on a certain allocator behavior. > > > > The difficult part I see is, when one application establishes a path > > to one acclerator, we have no knowledge if its going to connect to a > > second, third or such. I don't see how this can work reasonably > > well. What if PASIDx is allocated for one, but the second RID its > > trying to attach already has this PASID allocated? > > You mean like some kind of vIOMMU hot plug? Not vIOMMU hot plug. but an application opens accel1, does a bind to allocate a PASID. What i meant was kernel has no information if this needs to be a per-RID PASID, or a global PASID. Keeping this global solves the other problems or more complex mechanisms to say "Reserve this PASID on all accelerators" which seems pretty complicated to implement. Now are we loosing anything by keeping the PASIDs global? As we discussed there is no security issue since the PASID table that hosts these PASIDs for SVM are still per-RID. For e.g. app establishes connection to accl1, allocates PASID-X RID for accel1 now has PASID-X and the process mm plummed later app also connects with accl2, now the PASID-X is plummed in for RID of accel2. > > > > If you want to get a PASID that can be used with every RID on in your > > > /dev/ioasid then ask for that exactly. > > > > Correct, but how does guest through vIOMMU driver communicate that intent > > so uAPI > > plumbing can do this? I mean architecturally via IOMMU interfaces? > > I would have to ask for a PASID that has the property it needs. You > are saying the property is even bigger than "usable on a group of > RIDs" but is actually "global for every RID and IOMMU in the system so > it can go into a MSR". Gross, but fine, ask for that explicitly when > allocating the PASID. If one process has a single mm, is that also gross? :-) So a single process having a PASID is just an identifier for IOMMU. It just seems like what a mm is for a process == PASID for SVM-IOMMU support. The unanswered question is how do we plumb from vIOMMU without a custom allocator to get a system wide PASID? The way it works today is if we have a custom allocator registered, that's the mechanics to get PASIDs allocated. for Intel vIOMMU it happens to be a global unique allocation. If a particular vIOMMU doesn't require, it does not have vIOMMU interface, and those naturally get a guest local PASID name space. (Im not sure if that's how the allocator works today, but I guess its extensible to accomplish a RID local PASID if that's exactly what is required) Cheers, Ashok ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Fri, May 07, 2021 at 11:14:58AM -0700, Raj, Ashok wrote: > On Fri, May 07, 2021 at 02:20:51PM -0300, Jason Gunthorpe wrote: > > On Thu, May 06, 2021 at 09:32:40AM -0700, Raj, Ashok wrote: > > > > > For platforms that support ENQCMD, it is required to mandate PASIDs are > > > global across the entire system. Maybe its better to call them gPASID for > > > guest and hPASID for host. Short reason being gPASID->hPASID is a guest > > > wide mapping for ENQCMD and not a per-RID based mapping. (We covered that > > > in earlier responses) > > > > I don't think it is actually ENQCMD that forces this, ENQCMD can use a > > per-RID PASID in the translation table as well. > > When using ENQCMD the PASID that needs to be sent on the wire is picked > from an MSR setup by kernel. This is context switched along with the > process. So each process has only 1 PASID that can go out when using > ENQCMD. ENQCMD takes one mmio address specific to the acclerator and a > source for the descriptor. Oh. I forgot this also globally locked the PASID to a single MSR. Sigh. That makes the whole mechanism useless for anything except whole process SVA. It also make it a general kernel problem and not just related to the vIOMMU scenario. > > I think at the uAPI level the callpaths that require allocating a > > PASID from a group of RIDs should be explicit in their intention and > > not implicitly rely on a certain allocator behavior. > > The difficult part I see is, when one application establishes a path > to one acclerator, we have no knowledge if its going to connect to a > second, third or such. I don't see how this can work reasonably > well. What if PASIDx is allocated for one, but the second RID its > trying to attach already has this PASID allocated? You mean like some kind of vIOMMU hot plug? > > If you want to get a PASID that can be used with every RID on in your > > /dev/ioasid then ask for that exactly. > > Correct, but how does guest through vIOMMU driver communicate that intent so > uAPI > plumbing can do this? I mean architecturally via IOMMU interfaces? I would have to ask for a PASID that has the property it needs. You are saying the property is even bigger than "usable on a group of RIDs" but is actually "global for every RID and IOMMU in the system so it can go into a MSR". Gross, but fine, ask for that explicitly when allocating the PASID. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Fri, May 07, 2021 at 02:20:51PM -0300, Jason Gunthorpe wrote: > On Thu, May 06, 2021 at 09:32:40AM -0700, Raj, Ashok wrote: > > > For platforms that support ENQCMD, it is required to mandate PASIDs are > > global across the entire system. Maybe its better to call them gPASID for > > guest and hPASID for host. Short reason being gPASID->hPASID is a guest > > wide mapping for ENQCMD and not a per-RID based mapping. (We covered that > > in earlier responses) > > I don't think it is actually ENQCMD that forces this, ENQCMD can use a > per-RID PASID in the translation table as well. When using ENQCMD the PASID that needs to be sent on the wire is picked from an MSR setup by kernel. This is context switched along with the process. So each process has only 1 PASID that can go out when using ENQCMD. ENQCMD takes one mmio address specific to the acclerator and a source for the descriptor. When one application is connecting to more than one accelerator since this is MSR based filled in by the cpu instruction automaticaly requires both accelerators to use the same PASID. Did you refer to this implementation? or something else? > > You get forced here only based on the design of the vIOMMU > communication channel. If the guest can demand any RID is attached to > a specific guest determined PASID then the hypervisor must accommodate > that. True, but when we have guest using vSVM, and enabling vENQCMD the requirement is the same inside a guest. > > > > Which would be a different behavior than something like Intel's top > > > level IOASID that doesn't claim all the PASIDs. > > > > isn't this simple, if we can say ioasid allocator can provide > > > > - system wide PASID > > - RID local PASID > > > > Based on platform capabilities that require such differentiation? > > I think at the uAPI level the callpaths that require allocating a > PASID from a group of RIDs should be explicit in their intention and > not implicitly rely on a certain allocator behavior. The difficult part I see is, when one application establishes a path to one acclerator, we have no knowledge if its going to connect to a second, third or such. I don't see how this can work reasonably well. What if PASIDx is allocated for one, but the second RID its trying to attach already has this PASID allocated? > > If you want to get a PASID that can be used with every RID on in your > /dev/ioasid then ask for that exactly. Correct, but how does guest through vIOMMU driver communicate that intent so uAPI plumbing can do this? I mean architecturally via IOMMU interfaces? Cheers, Ashok ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Thu, May 06, 2021 at 09:32:40AM -0700, Raj, Ashok wrote: > For platforms that support ENQCMD, it is required to mandate PASIDs are > global across the entire system. Maybe its better to call them gPASID for > guest and hPASID for host. Short reason being gPASID->hPASID is a guest > wide mapping for ENQCMD and not a per-RID based mapping. (We covered that > in earlier responses) I don't think it is actually ENQCMD that forces this, ENQCMD can use a per-RID PASID in the translation table as well. You get forced here only based on the design of the vIOMMU communication channel. If the guest can demand any RID is attached to a specific guest determined PASID then the hypervisor must accommodate that. > > Which would be a different behavior than something like Intel's top > > level IOASID that doesn't claim all the PASIDs. > > isn't this simple, if we can say ioasid allocator can provide > > - system wide PASID > - RID local PASID > > Based on platform capabilities that require such differentiation? I think at the uAPI level the callpaths that require allocating a PASID from a group of RIDs should be explicit in their intention and not implicitly rely on a certain allocator behavior. If you want to get a PASID that can be used with every RID on in your /dev/ioasid then ask for that exactly. It makes the uAPI much more understandable to be explicit. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Fri, May 07, 2021 at 11:06:14AM -0600, Alex Williamson wrote: > We had tossed around an idea of a super-container with vfio, it's maybe > something we'd want to incorporate into this design. For instance, if > memory could be pre-registered with a super container, which would > handle the locked memory accounting for that memory, then > sub-containers could all handle the IOMMU context of their sets of > devices relative to that common memory pool. This is where I suggested to David to use nesting of IOASIDs. Without HW support for nesting a SW nest is really just re-using the memory registration information stored in the parent when constructing the children Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Fri, 7 May 2021 07:36:49 + "Tian, Kevin" wrote: > > From: Alex Williamson > > Sent: Wednesday, April 28, 2021 11:06 PM > > > > On Wed, 28 Apr 2021 06:34:11 + > > "Tian, Kevin" wrote: > > > > > > Can you or Alex elaborate where the complexity and performance problem > > > locate in VFIO map/umap? We'd like to understand more detail and see > > how > > > to avoid it in the new interface. > > > > > > The map/unmap interface is really only good for long lived mappings, > > the overhead is too high for things like vIOMMU use cases or any case > > where the mapping is intended to be dynamic. Userspace drivers must > > make use of a long lived buffer mapping in order to achieve performance. > > This is not a limitation of VFIO map/unmap. It's the limitation of any > map/unmap semantics since the fact of long-lived vs. short-lived is > imposed by userspace. Nested translation is the only viable optimization > allowing 2nd-level to be a long-lived mapping even w/ vIOMMU. From > this angle I'm not sure how a new map/unmap implementation could > address this perf limitation alone. Sure, we don't need to try to tackle every problem at once, a map/unmap interface compatible with what we have is a good place to start and nested translation may provide the high performance option. That's not to say that we couldn't, in the future, extend the map/unmap with memory pre-registration like done in the spapr IOMMU to see how that could reduce latency. > > The mapping and unmapping granularity has been a problem as well, > > type1v1 allowed arbitrary unmaps to bisect the original mapping, with > > the massive caveat that the caller relies on the return value of the > > unmap to determine what was actually unmapped because the IOMMU use > > of > > superpages is transparent to the caller. This led to type1v2 that > > simply restricts the user to avoid ever bisecting mappings. That still > > leaves us with problems for things like virtio-mem support where we > > need to create initial mappings with a granularity that allows us to > > later remove entries, which can prevent effective use of IOMMU > > superpages. > > We could start with a semantics similar to type1v2. > > btw why does virtio-mem require a smaller granularity? Can we split > superpages in-the-fly when removal actually happens (just similar > to page split in VM live migration for efficient dirty page tracking)? The IOMMU API doesn't currently support those semantics. If the IOMMU used a superpage, then a superpage gets unmapped, it doesn't get atomically broken down into smaller pages. Therefore virtio-mem proposes a fixed mapping granularity to allow for that same unmapping granularity. > and isn't it another problem imposed by userspace? How could a new > map/unmap implementation mitigate this problem if the userspace > insists on a smaller granularity for initial mappings? Currently if userspace wants to guarantee unmap granularity, they need to make the same restriction themselves on the mapping granularity. For instance, userspace cannot currently map a 1GB IOVA range while guaranteeing 2MB unmap granularity of that range with a single ioctl. Instead userspace would need to make 512, 2MB mappings calls. > > Locked page accounting has been another constant issue. We perform > > locked page accounting at the container level, where each container > > accounts independently. A user may require multiple containers, the > > containers may pin the same physical memory, but be accounted against > > the user once per container. > > for /dev/ioasid there is still an open whether an process is allowed to > open /dev/ioasid once or multiple times. If there is only one ioasid_fd > per process, the accounting can be made accurately. otherwise the > same problem still exists as each ioasid_fd is akin to the container, then > we need find a better solution. We had tossed around an idea of a super-container with vfio, it's maybe something we'd want to incorporate into this design. For instance, if memory could be pre-registered with a super container, which would handle the locked memory accounting for that memory, then sub-containers could all handle the IOMMU context of their sets of devices relative to that common memory pool. > > Those are the main ones I can think of. It is nice to have a simple > > map/unmap interface, I'd hope that a new /dev/ioasid interface wouldn't > > raise the barrier to entry too high, but the user needs to have the > > ability to have more control of their mappings and locked page > > accounting should probably be offloaded somewhere. Thanks, > > > > Based on your feedbacks I feel it's probably reasonable to start with > a type1v2 semantics for the new interface. Locked accounting could > also start with the same VFIO restriction and then improve it > incrementally, if a cleaner way is intrusive (if not affecting uAPI). > But I didn't get the suggestion on "more control of their
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Fri, May 07, 2021 at 07:36:49AM +, Tian, Kevin wrote: > for /dev/ioasid there is still an open whether an process is allowed to > open /dev/ioasid once or multiple times. If there is only one ioasid_fd > per process, the accounting can be made accurately. otherwise the > same problem still exists as each ioasid_fd is akin to the container, then > we need find a better solution. You can't really do tricks like 'FD once per process' in linux. The locked page accounting problem is much bigger than vfio and I don't really know of any solution.. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
> From: Jason Gunthorpe > Sent: Wednesday, May 5, 2021 1:13 AM > > On Wed, Apr 28, 2021 at 06:58:19AM +, Tian, Kevin wrote: > > > From: Jason Gunthorpe > > > Sent: Wednesday, April 28, 2021 1:12 AM > > > > > [...] > > > > As Alex says, if this line fails because of the group restrictions, > > > > that's not great because it's not very obvious what's gone wrong. > > > > > > Okay, that is fair, but let's solve that problem directly. For > > > instance netlink has been going in the direction of adding a "extack" > > > from the kernel which is a descriptive error string. If the failing > > > ioctl returned the string: > > > > > > "cannot join this device to the IOASID because device XXX in the > > >same group #10 is in use" > > > > > > Would you agree it is now obvious what has gone wrong? In fact would > > > you agree this is a lot better user experience than what applications > > > do today even though they have the group FD? > > > > > > > Currently all the discussions are around implicit vs. explicit uAPI > > semantics > > on the group restriction. However if we look beyond group the implicit > > semantics might be inevitable when dealing with incompatible iommu > > domains. An existing example of iommu incompatibility is IOMMU_ > > CACHE. > > I still think we need to get rid of these incompatibilities > somehow. Having multiple HW incompatible IOASID in the same platform > is just bad all around. > > When modeling in userspace IOMMU_CACHE sounds like it is a property of > each individual IOASID, not an attribute that requires a new domain. sure. the iommu domain is an kernel-internal concept. The userspace should focus everything on IOASID. > > People that want to create cache bypass IOASID's should just ask for > that that directly. > Yes, in earlier discussion we agreed on a scheme that ioasid module will return an error to userspace indicating incompatibility detected when binding a device to ioasid then the userspace should create a new IOASID for this device. This has to be done 'explicitly'. When I used it as the example for 'implicit semantics" is that the kernel won't create another group-like object to contain devices with compatible attributes and 'explicitly' manage it in uAPI like group_fd. If we anyway rely on the userspace to have more intelligence on those hardware restrictions, it's little sense to only explicitly handle group_fd in uAPI. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
> From: Alex Williamson > Sent: Wednesday, April 28, 2021 11:06 PM > > On Wed, 28 Apr 2021 06:34:11 + > "Tian, Kevin" wrote: > > > > From: Jason Gunthorpe > > > Sent: Monday, April 26, 2021 8:38 PM > > > > > [...] > > > > Want to hear your opinion for one open here. There is no doubt that > > > > an ioasid represents a HW page table when the table is constructed by > > > > userspace and then linked to the IOMMU through the bind/unbind > > > > API. But I'm not very sure about whether an ioasid should represent > > > > the exact pgtable or the mapping metadata when the underlying > > > > pgtable is indirectly constructed through map/unmap API. VFIO does > > > > the latter way, which is why it allows multiple incompatible domains > > > > in a single container which all share the same mapping metadata. > > > > > > I think VFIO's map/unmap is way too complex and we know it has bad > > > performance problems. > > > > Can you or Alex elaborate where the complexity and performance problem > > locate in VFIO map/umap? We'd like to understand more detail and see > how > > to avoid it in the new interface. > > > The map/unmap interface is really only good for long lived mappings, > the overhead is too high for things like vIOMMU use cases or any case > where the mapping is intended to be dynamic. Userspace drivers must > make use of a long lived buffer mapping in order to achieve performance. This is not a limitation of VFIO map/unmap. It's the limitation of any map/unmap semantics since the fact of long-lived vs. short-lived is imposed by userspace. Nested translation is the only viable optimization allowing 2nd-level to be a long-lived mapping even w/ vIOMMU. From this angle I'm not sure how a new map/unmap implementation could address this perf limitation alone. > > The mapping and unmapping granularity has been a problem as well, > type1v1 allowed arbitrary unmaps to bisect the original mapping, with > the massive caveat that the caller relies on the return value of the > unmap to determine what was actually unmapped because the IOMMU use > of > superpages is transparent to the caller. This led to type1v2 that > simply restricts the user to avoid ever bisecting mappings. That still > leaves us with problems for things like virtio-mem support where we > need to create initial mappings with a granularity that allows us to > later remove entries, which can prevent effective use of IOMMU > superpages. We could start with a semantics similar to type1v2. btw why does virtio-mem require a smaller granularity? Can we split superpages in-the-fly when removal actually happens (just similar to page split in VM live migration for efficient dirty page tracking)? and isn't it another problem imposed by userspace? How could a new map/unmap implementation mitigate this problem if the userspace insists on a smaller granularity for initial mappings? > > Locked page accounting has been another constant issue. We perform > locked page accounting at the container level, where each container > accounts independently. A user may require multiple containers, the > containers may pin the same physical memory, but be accounted against > the user once per container. for /dev/ioasid there is still an open whether an process is allowed to open /dev/ioasid once or multiple times. If there is only one ioasid_fd per process, the accounting can be made accurately. otherwise the same problem still exists as each ioasid_fd is akin to the container, then we need find a better solution. > > Those are the main ones I can think of. It is nice to have a simple > map/unmap interface, I'd hope that a new /dev/ioasid interface wouldn't > raise the barrier to entry too high, but the user needs to have the > ability to have more control of their mappings and locked page > accounting should probably be offloaded somewhere. Thanks, > Based on your feedbacks I feel it's probably reasonable to start with a type1v2 semantics for the new interface. Locked accounting could also start with the same VFIO restriction and then improve it incrementally, if a cleaner way is intrusive (if not affecting uAPI). But I didn't get the suggestion on "more control of their mappings". Can you elaborate? Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
Hi Jason On Thu, May 06, 2021 at 09:27:30AM -0300, Jason Gunthorpe wrote: > On Thu, May 06, 2021 at 09:23:48AM +0200, Jean-Philippe Brucker wrote: > > On Wed, May 05, 2021 at 01:04:46PM -0700, Jacob Pan wrote: > > > > > For ARM, since the guest owns the per device PASID table. There is no > > > > > need to allocate PASIDs from the host nor the hypervisor. Without SWQ, > > > > > there is no need for global PASID/SSID either. So PASID being global > > > > > for ARM is for simplicity in case of host PASID/SSID. > > > > > > > > It isn't clear how ARM can support PASID and mdev but that is an > > > > unrelated issue.. > > > > > > > AFAIK, the current SMMU device assignment is per RID, since only one > > > stage2 > > > page tables per RID, not per PASID. This is equivalent to the older VT-d > > > spec. prior to scalable mode. > > > > Yes that's right. Since SMMUv3 has a single level-2 page table per RID, it > > doesn't support assigning level-1 page tables to guests for mdevs (sub-VF > > devices). So no PASIDs for mdevs, which also means each guest has its own > > PASID space and the host doesn't track guest PASIDs. > > Basically it means when the guest's top level IOASID is created for > nesting that IOASID claims all PASID's on the RID and excludes any > PASID IOASIDs from existing on the RID now or in future. The way to look at it this is as follows: For platforms that do not have a need to support shared work queue model support for ENQCMD or similar, PASID space is naturally per RID. There is no complication with this. Every RID has the full range of PASID's and no need for host to track which PASIDs are allocated now or in future in the guest. For platforms that support ENQCMD, it is required to mandate PASIDs are global across the entire system. Maybe its better to call them gPASID for guest and hPASID for host. Short reason being gPASID->hPASID is a guest wide mapping for ENQCMD and not a per-RID based mapping. (We covered that in earlier responses) In our current implementation we actually don't separate this space, and gPASID == hPASID. The iommu driver enforces that by using the custom allocator and the architected interface that allows all guest vIOMMU allocations to be proxied to host. Nothing but a glorified hypercall like interface. In fact some OS's do use hypercall to get a hPASID vs using the vCMD style interface. For cases where there is full PASID range for every RID and completely managed by the guest that requires no assist from host to ensure uniqueness, they don't need to have a custom allocator. Maybe the general allocator can have ways to ensure global uniqueness vs. RID wide uniqueness. This is still managed by the iommu driver (vIOMMU) + the backend for vCMD in the host IOMMU driver. > > Which would be a different behavior than something like Intel's top > level IOASID that doesn't claim all the PASIDs. isn't this simple, if we can say ioasid allocator can provide - system wide PASID - RID local PASID Based on platform capabilities that require such differentiation? And based on the other threads, if ioasid is just a pgtable representation, it doesn't need a PASID per-se. But when you want to use SVM or such, you can associate a PASID with it for the IOMMU to plumb things with hardware. Cheers, Ashok ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Thu, May 06, 2021 at 09:23:48AM +0200, Jean-Philippe Brucker wrote: > On Wed, May 05, 2021 at 01:04:46PM -0700, Jacob Pan wrote: > > > > For ARM, since the guest owns the per device PASID table. There is no > > > > need to allocate PASIDs from the host nor the hypervisor. Without SWQ, > > > > there is no need for global PASID/SSID either. So PASID being global > > > > for ARM is for simplicity in case of host PASID/SSID. > > > > > > It isn't clear how ARM can support PASID and mdev but that is an > > > unrelated issue.. > > > > > AFAIK, the current SMMU device assignment is per RID, since only one stage2 > > page tables per RID, not per PASID. This is equivalent to the older VT-d > > spec. prior to scalable mode. > > Yes that's right. Since SMMUv3 has a single level-2 page table per RID, it > doesn't support assigning level-1 page tables to guests for mdevs (sub-VF > devices). So no PASIDs for mdevs, which also means each guest has its own > PASID space and the host doesn't track guest PASIDs. Basically it means when the guest's top level IOASID is created for nesting that IOASID claims all PASID's on the RID and excludes any PASID IOASIDs from existing on the RID now or in future. Which would be a different behavior than something like Intel's top level IOASID that doesn't claim all the PASIDs. Lots of little special flags in here :| Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Wed, May 05, 2021 at 04:23:19PM -0700, Raj, Ashok wrote: > > Which implies the API to the iommu driver should be more like: > > > > 'assign an IOASID to this RID and return the PASID' > > 'reserve a PASID from every RID' > > I don't think this has any decent change of success. Its rather round about > way to get a global PASID namespace. > > > 'assign an IOASID to this RID and use this specific PASID' > > This seems a bit complicated. Another way to specify this. Maybe, but I don't like that the driver-based iommu API has been corrupted by injecting a global 'first driver to claim it' resource. It is not properly layered anymore. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Wed, May 05, 2021 at 01:04:46PM -0700, Jacob Pan wrote: > > > For ARM, since the guest owns the per device PASID table. There is no > > > need to allocate PASIDs from the host nor the hypervisor. Without SWQ, > > > there is no need for global PASID/SSID either. So PASID being global > > > for ARM is for simplicity in case of host PASID/SSID. > > > > It isn't clear how ARM can support PASID and mdev but that is an > > unrelated issue.. > > > AFAIK, the current SMMU device assignment is per RID, since only one stage2 > page tables per RID, not per PASID. This is equivalent to the older VT-d > spec. prior to scalable mode. Yes that's right. Since SMMUv3 has a single level-2 page table per RID, it doesn't support assigning level-1 page tables to guests for mdevs (sub-VF devices). So no PASIDs for mdevs, which also means each guest has its own PASID space and the host doesn't track guest PASIDs. Thanks, Jean ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Wed, May 05, 2021 at 07:21:20PM -0300, Jason Gunthorpe wrote: > On Wed, May 05, 2021 at 01:04:46PM -0700, Jacob Pan wrote: > > Hi Jason, > > > > On Wed, 5 May 2021 15:00:23 -0300, Jason Gunthorpe wrote: > > > > > On Wed, May 05, 2021 at 10:22:59AM -0700, Jacob Pan wrote: > > > > > > > Global and pluggable are for slightly separate reasons. > > > > - We need global PASID on VT-d in that we need to support shared > > > > workqueues (SWQ). E.g. One SWQ can be wrapped into two mdevs then > > > > assigned to two VMs. Each VM uses its private guest PASID to submit > > > > work but each guest PASID must be translated to a global (system-wide) > > > > host PASID to avoid conflict. Also, since PASID table storage is per > > > > PF, if two mdevs of the same PF are assigned to different VMs, the > > > > PASIDs must be unique. > > > > > > From a protocol perspective each RID has a unique PASID table, and > > > RIDs can have overlapping PASIDs. > > > > > True, per RID or per PF as I was referring to. > > > > > Since your SWQ is connected to a single RID the requirement that > > > PASIDs are unique to the RID ensures they are sufficiently unique. > > > > > True, but one process can submit work to multiple mdevs from different > > RIDs/PFs. One process uses one PASID and PASID translation table is per VM. > > The same PASID is used for all the PASID tables of each RID. > > If the model is "assign this PASID to this RID" then yes, there is a > big problem keeping everything straight that can only be solved with a > global table. > > But if the model is "give me a PASID for this RID" then it isn't such > a problem. Correct, since we have usage with ENQCMD, its more like - Give me a PASID1 (not attached to any RID) - Bind/attach PASID1 with RID1 - Bind/attach PASID1 with RID2 and ENQCMD isn't just for Intel, with the DMWr spec in PCI, it brings it to all devices as long as routing is supported by interim switches and such. > > Basically trying to enforce a uniform PASID for an IOASID across all > RIDs attached to it is not such a nice choice. > > > > That is fine, but all this stuff should be inside the Intel vIOMMU > > > driver not made into a global resource of the entire iommu subsystem. > > > > > Intel vIOMMU has to use a generic uAPI to allocate PASID so the generic > > code need to have this option. I guess you are saying we should also have a > > per RID allocation option in addition to global? > > There always has to be a RID involvement for the PASID, for security, > this issue really boils down to where the PASID lives. We do have a RID involvement with PASID always for security. Every RID has its own PASID table, but the PASID name space is global. So if you have RID1 associated with PASID1, another RID2 doesn't have the PASID1 in its PASID table. Until when the app binds PASID1 with RID2 as well. Then you have PASID1 plumbed in the PASID table for RID2. Is this what you refer to for security? > > If you need the PASID attached to the IOASID then it has to be global > because the IOASID can be attached to any RID and must keep the same > PASID. > > If the PASID is learned when the IOASID is attached to a RID then the > PASID is more flexible and isn't attached to the IOASID. > > Honestly I'm a little leary to bake into a UAPI a specific HW choice > that Intel made here. Like I mentioned, this isn't just Intel going forward. The specs are public in PCIe. I just can't comment which other vendors are adopting it. > > I would advise making the "attach a global PASID to this IOASID" > operation explicit and opt into for case that actually need it. > > Which implies the API to the iommu driver should be more like: > > 'assign an IOASID to this RID and return the PASID' > 'reserve a PASID from every RID' I don't think this has any decent change of success. Its rather round about way to get a global PASID namespace. > 'assign an IOASID to this RID and use this specific PASID' This seems a bit complicated. Another way to specify this. - IOASID is a logical construct to specify a page table. - You can bind a global PASID to an IOASID We aren't loosing any security by using a global PASID name space. Until the application asks for it, that is not bound to any other RID without an explicit request. -- Cheers, Ashok ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Wed, May 05, 2021 at 01:04:46PM -0700, Jacob Pan wrote: > Hi Jason, > > On Wed, 5 May 2021 15:00:23 -0300, Jason Gunthorpe wrote: > > > On Wed, May 05, 2021 at 10:22:59AM -0700, Jacob Pan wrote: > > > > > Global and pluggable are for slightly separate reasons. > > > - We need global PASID on VT-d in that we need to support shared > > > workqueues (SWQ). E.g. One SWQ can be wrapped into two mdevs then > > > assigned to two VMs. Each VM uses its private guest PASID to submit > > > work but each guest PASID must be translated to a global (system-wide) > > > host PASID to avoid conflict. Also, since PASID table storage is per > > > PF, if two mdevs of the same PF are assigned to different VMs, the > > > PASIDs must be unique. > > > > From a protocol perspective each RID has a unique PASID table, and > > RIDs can have overlapping PASIDs. > > > True, per RID or per PF as I was referring to. > > > Since your SWQ is connected to a single RID the requirement that > > PASIDs are unique to the RID ensures they are sufficiently unique. > > > True, but one process can submit work to multiple mdevs from different > RIDs/PFs. One process uses one PASID and PASID translation table is per VM. > The same PASID is used for all the PASID tables of each RID. If the model is "assign this PASID to this RID" then yes, there is a big problem keeping everything straight that can only be solved with a global table. But if the model is "give me a PASID for this RID" then it isn't such a problem. Basically trying to enforce a uniform PASID for an IOASID across all RIDs attached to it is not such a nice choice. > > That is fine, but all this stuff should be inside the Intel vIOMMU > > driver not made into a global resource of the entire iommu subsystem. > > > Intel vIOMMU has to use a generic uAPI to allocate PASID so the generic > code need to have this option. I guess you are saying we should also have a > per RID allocation option in addition to global? There always has to be a RID involvement for the PASID, for security, this issue really boils down to where the PASID lives. If you need the PASID attached to the IOASID then it has to be global because the IOASID can be attached to any RID and must keep the same PASID. If the PASID is learned when the IOASID is attached to a RID then the PASID is more flexible and isn't attached to the IOASID. Honestly I'm a little leary to bake into a UAPI a specific HW choice that Intel made here. I would advise making the "attach a global PASID to this IOASID" operation explicit and opt into for case that actually need it. Which implies the API to the iommu driver should be more like: 'assign an IOASID to this RID and return the PASID' 'reserve a PASID from every RID' 'assign an IOASID to this RID and use this specific PASID' In all cases the scope of those operations are completely local to a certain IOMMU driver - 'reserver a PASID from every RID' is really every RID that driver can operate on. So it is hard to see why the allocator should be a global resource and not something that is part of the iommu driver exclusively. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
Hi Jason, On Wed, 5 May 2021 15:00:23 -0300, Jason Gunthorpe wrote: > On Wed, May 05, 2021 at 10:22:59AM -0700, Jacob Pan wrote: > > > Global and pluggable are for slightly separate reasons. > > - We need global PASID on VT-d in that we need to support shared > > workqueues (SWQ). E.g. One SWQ can be wrapped into two mdevs then > > assigned to two VMs. Each VM uses its private guest PASID to submit > > work but each guest PASID must be translated to a global (system-wide) > > host PASID to avoid conflict. Also, since PASID table storage is per > > PF, if two mdevs of the same PF are assigned to different VMs, the > > PASIDs must be unique. > > From a protocol perspective each RID has a unique PASID table, and > RIDs can have overlapping PASIDs. > True, per RID or per PF as I was referring to. > Since your SWQ is connected to a single RID the requirement that > PASIDs are unique to the RID ensures they are sufficiently unique. > True, but one process can submit work to multiple mdevs from different RIDs/PFs. One process uses one PASID and PASID translation table is per VM. The same PASID is used for all the PASID tables of each RID. For example: VM1 has two mdevs: mdev1 and mdev2. mdev1's parent is RID1, mdev2's parent is RID2. The guest process A allocates PASID_A and bind to both mdev1 and mdev2. PASID_A must be present in the PASID tables for both RID1 and RID2. If the allocator is per RID, it is not possible to ensure PASID_A is available for both RIDs. Right? Sorry I missed this point in my earlier explanation. > If the IOMMU driver has additional restrictions then it should raise > the PASID table up higher in the hierarchy than at the RID. > That higher level in the hierarchy is global, right? I am a little concerned about expanding PASID table sharing from security perspective. Even though, VMs already share PASID table for mdevs. > I think what you are trying to explain is that the Intel vIOMMU has a > single PASID address space shared globally by the vCPU because ENQCMD > uses the global vGPU translation table. > Yes, PASID translation table is per VM, global in terms of the guest. That combined with the case of two mdevs from different RIDs can be used by the same guest process/PASID requires global PASID. > That is fine, but all this stuff should be inside the Intel vIOMMU > driver not made into a global resource of the entire iommu subsystem. > Intel vIOMMU has to use a generic uAPI to allocate PASID so the generic code need to have this option. I guess you are saying we should also have a per RID allocation option in addition to global? > Systems that work this way just cannot have multiple iommu drivers > competing for PASID. > Sorry, I am not following. There would not be mixed iommu drivers on one platform, I must have missed your point. Could you explain a little? > > - The pluggable allocator is to support the option where the guest > > PASIDs are allocated by the hypervisor. > > And if the hypervisor allocates the PASID then again the specific > vIOMMU itself is concerned with this and it has nothing to do with > global behavior of the iommu subsystem. > > > For ARM, since the guest owns the per device PASID table. There is no > > need to allocate PASIDs from the host nor the hypervisor. Without SWQ, > > there is no need for global PASID/SSID either. So PASID being global > > for ARM is for simplicity in case of host PASID/SSID. > > It isn't clear how ARM can support PASID and mdev but that is an > unrelated issue.. > AFAIK, the current SMMU device assignment is per RID, since only one stage2 page tables per RID, not per PASID. This is equivalent to the older VT-d spec. prior to scalable mode. Eric/Jean, can you help? > Jason Thanks, Jacob ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Wed, May 05, 2021 at 10:22:59AM -0700, Jacob Pan wrote: > Global and pluggable are for slightly separate reasons. > - We need global PASID on VT-d in that we need to support shared > workqueues (SWQ). E.g. One SWQ can be wrapped into two mdevs then assigned > to two VMs. Each VM uses its private guest PASID to submit work but > each guest PASID must be translated to a global (system-wide) host PASID to > avoid conflict. Also, since PASID table storage is per PF, if two mdevs of > the same PF are assigned to different VMs, the PASIDs must be unique. >From a protocol perspective each RID has a unique PASID table, and RIDs can have overlapping PASIDs. Since your SWQ is connected to a single RID the requirement that PASIDs are unique to the RID ensures they are sufficiently unique. If the IOMMU driver has additional restrictions then it should raise the PASID table up higher in the hierarchy than at the RID. I think what you are trying to explain is that the Intel vIOMMU has a single PASID address space shared globally by the vCPU because ENQCMD uses the global vGPU translation table. That is fine, but all this stuff should be inside the Intel vIOMMU driver not made into a global resource of the entire iommu subsystem. Systems that work this way just cannot have multiple iommu drivers competing for PASID. > - The pluggable allocator is to support the option where the guest PASIDs > are allocated by the hypervisor. And if the hypervisor allocates the PASID then again the specific vIOMMU itself is concerned with this and it has nothing to do with global behavior of the iommu subsystem. > For ARM, since the guest owns the per device PASID table. There is no need > to allocate PASIDs from the host nor the hypervisor. Without SWQ, there is > no need for global PASID/SSID either. So PASID being global for ARM is for > simplicity in case of host PASID/SSID. It isn't clear how ARM can support PASID and mdev but that is an unrelated issue.. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
Hi Jason, On Tue, 4 May 2021 20:15:30 -0300, Jason Gunthorpe wrote: > On Tue, May 04, 2021 at 03:11:54PM -0700, Jacob Pan wrote: > > > > It is a weird way to use xarray to have a structure which > > > itself is just a wrapper around another RCU protected structure. > > > > > > Make the caller supply the ioasid_data memory, embedded in its own > > > element, get rid of the void * and rely on XA_ZERO_ENTRY to hold > > > allocated but not active entries. > > > > > Let me try to paraphrase to make sure I understand. Currently > > struct ioasid_data is private to the iasid core, its memory is > > allocated by the ioasid core. > > > > You are suggesting the following: > > 1. make struct ioasid_data public > > 2. caller allocates memory for ioasid_data, initialize it then pass it > > to ioasid_alloc to store in the xarray > > 3. caller will be responsible for setting private data inside > > ioasid_data and do call_rcu after update if needed. > > Basically, but you probably won't need a "private data" once the > caller has this struct as it can just embed it in whatever larger > struct makes sense for it and use container_of/etc > that makes sense. thanks! > I didn't look too closely at the whole thing though. Honestly I'm a > bit puzzled why we need a pluggable global allocator framework.. The > whole framework went to some trouble to isolate everything into iommu > drivers then that whole design is disturbed by this global thing. > Global and pluggable are for slightly separate reasons. - We need global PASID on VT-d in that we need to support shared workqueues (SWQ). E.g. One SWQ can be wrapped into two mdevs then assigned to two VMs. Each VM uses its private guest PASID to submit work but each guest PASID must be translated to a global (system-wide) host PASID to avoid conflict. Also, since PASID table storage is per PF, if two mdevs of the same PF are assigned to different VMs, the PASIDs must be unique. - The pluggable allocator is to support the option where the guest PASIDs are allocated by the hypervisor. Let it be the same as the host PASID or some arbitrary number cooked up by the hypervisor but backed by a host HW PASID. VT-d spec has this virtual command interface that requires the guest to use it instead of allocating from the guest ioasid xarray. This is the reason why it has to go down to iommu vendor driver. I guess that is what you meant by "went to some trouble to isolate everything into iommu"? For ARM, since the guest owns the per device PASID table. There is no need to allocate PASIDs from the host nor the hypervisor. Without SWQ, there is no need for global PASID/SSID either. So PASID being global for ARM is for simplicity in case of host PASID/SSID. > Jason Thanks, Jacob ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Wed, May 05, 2021 at 02:28:53PM +1000, Alexey Kardashevskiy wrote: > This is a good feature in general when let's say there is a linux supported > device which has a proprietary device firmware update tool which only exists > as an x86 binary and your hardware is not x86 - running qemu + vfio in full > emulation would provide a way to run the tool to update a physical device. That specific use case doesn't really need a vIOMMU though, does it? Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
Hi Jason, On 4/29/21 10:04 PM, Jason Gunthorpe wrote: > On Thu, Apr 29, 2021 at 03:26:55PM +0200, Auger Eric wrote: >> From the pseudo code, >> >> gpa_ioasid_id = ioctl(ioasid_fd, CREATE_IOASID, ..) >> ioctl(ioasid_fd, SET_IOASID_PAGE_TABLES, ..) >> >> I fail to understand whether the SET_IOASID_PAGE_TABLES would apply to >> the whole IOASIDs within /dev/ioasid or to a specific one. > > Sorry, nearly every IOCTL would be scoped to a specific IOASID as one > of the arguments. OK thank you for the clarification. > >> Also in subsequent emails when you talk about IOASID, is it the >> ioasid_id, just to double check the terminology. > > I am refering to IOASID as 'handle of the page table object inside the > /dev/ioasid fd'. If that is equal to some HW value or not I think > remains as decision point. OK > > Basically the fd has an xarray of 'struct [something] *' and the > IOASID is index to that FD's private xarray. This is necessary to > create proper security as even if we have global PASID numbers or > something they still need to be isolated to only the FD that has > been authorized access. > >>> nested_ioasid = ioctl(ioasid_fd, CREATE_NESTED_IOASID, gpa_ioasid_id); >>> ioctl(ioasid_fd, SET_NESTED_IOASID_PAGE_TABLES, nested_ioasid, ..) >> is the nested_ioasid the allocated PASID id or is it a complete >> different object id. > > It is the IOASID handle above. ok as per the following emails and below comment IOASID and PASID are different.The first would be a logic ID wgile the second the HW ID. Thanks Eric > >>> >>>// IOMMU will match on the device RID, no PASID: >>> ioctl(vfio_device, ATTACH_IOASID, nested_ioasid); >>> >>>// IOMMU will match on the device RID and PASID: >>> ioctl(vfio_device, ATTACH_IOASID_PASID, pasid, nested_ioasid); >> here I see you pass a different pasid, so I guess they are different, in >> which case you would need to have an allocator function for this pasid, >> right? > > Yes, the underlying HW ID (PASID or substream id or whatver) is > something slightly different > > Jason > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On 05/05/2021 04:15, Jason Gunthorpe wrote: On Tue, May 04, 2021 at 01:54:55PM +1000, David Gibson wrote: On Mon, May 03, 2021 at 01:05:30PM -0300, Jason Gunthorpe wrote: On Thu, Apr 29, 2021 at 01:20:22PM +1000, David Gibson wrote: There is a certain appeal to having some 'PPC_TCE_CREATE_SPECIAL_IOASID' entry point that has a wack of extra information like windows that can be optionally called by the viommu driver and it remains well defined and described. Windows really aren't ppc specific. They're absolutely there on x86 and everything else as well - it's just that people are used to having a window at 0.. that you can often get away with treating it sloppily. My point is this detailed control seems to go on to more than just windows. As you say the vIOMMU is emulating specific HW that needs to have kernel interfaces to match it exactly. It's really not that bad. The case of emulating the PAPR vIOMMU on something else is relatively easy, because all updates to the IO page tables go through hypercalls. So, as long as the backend IOMMU can map all the IOVAs that the guest IOMMU can, then qemu's implementation of those hypercalls just needs to put an equivalent mapping in the backend, which it can do with a generic VFIO_DMA_MAP. So you also want the PAPR vIOMMU driver to run on, say, an ARM IOMMU? This is a good feature in general when let's say there is a linux supported device which has a proprietary device firmware update tool which only exists as an x86 binary and your hardware is not x86 - running qemu + vfio in full emulation would provide a way to run the tool to update a physical device. -- Alexey ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Tue, May 04, 2021 at 03:11:54PM -0700, Jacob Pan wrote: > > It is a weird way to use xarray to have a structure which > > itself is just a wrapper around another RCU protected structure. > > > > Make the caller supply the ioasid_data memory, embedded in its own > > element, get rid of the void * and rely on XA_ZERO_ENTRY to hold > > allocated but not active entries. > > > Let me try to paraphrase to make sure I understand. Currently > struct ioasid_data is private to the iasid core, its memory is allocated by > the ioasid core. > > You are suggesting the following: > 1. make struct ioasid_data public > 2. caller allocates memory for ioasid_data, initialize it then pass it to > ioasid_alloc to store in the xarray > 3. caller will be responsible for setting private data inside ioasid_data > and do call_rcu after update if needed. Basically, but you probably won't need a "private data" once the caller has this struct as it can just embed it in whatever larger struct makes sense for it and use container_of/etc I didn't look too closely at the whole thing though. Honestly I'm a bit puzzled why we need a pluggable global allocator framework.. The whole framework went to some trouble to isolate everything into iommu drivers then that whole design is disturbed by this global thing. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
Hi Jason, On Tue, 4 May 2021 15:00:50 -0300, Jason Gunthorpe wrote: > On Tue, May 04, 2021 at 08:41:48AM -0700, Jacob Pan wrote: > > > > > > > > (also looking at ioasid.c, why do we need such a thin and odd > > > > wrapper around xarray?) > > > > > > > > > > I'll leave it to Jean and Jacob. > > > Could you elaborate? > > I mean stuff like this: > > int ioasid_set_data(ioasid_t ioasid, void *data) > { > struct ioasid_data *ioasid_data; > int ret = 0; > > spin_lock(_allocator_lock); > ioasid_data = xa_load(_allocator->xa, ioasid); > if (ioasid_data) > rcu_assign_pointer(ioasid_data->private, data); > else > ret = -ENOENT; > spin_unlock(_allocator_lock); > > /* > * Wait for readers to stop accessing the old private data, so the > * caller can free it. > */ > if (!ret) > synchronize_rcu(); > > return ret; > } > EXPORT_SYMBOL_GPL(ioasid_set_data); > > It is a weird way to use xarray to have a structure which > itself is just a wrapper around another RCU protected structure. > > Make the caller supply the ioasid_data memory, embedded in its own > element, get rid of the void * and rely on XA_ZERO_ENTRY to hold > allocated but not active entries. > Let me try to paraphrase to make sure I understand. Currently struct ioasid_data is private to the iasid core, its memory is allocated by the ioasid core. You are suggesting the following: 1. make struct ioasid_data public 2. caller allocates memory for ioasid_data, initialize it then pass it to ioasid_alloc to store in the xarray 3. caller will be responsible for setting private data inside ioasid_data and do call_rcu after update if needed. Correct? > Make the synchronize_rcu() the caller responsiblity, and callers > should really be able to use call_rcu() > > Jason Thanks, Jacob ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Tue, May 04, 2021 at 01:54:55PM +1000, David Gibson wrote: > On Mon, May 03, 2021 at 01:05:30PM -0300, Jason Gunthorpe wrote: > > On Thu, Apr 29, 2021 at 01:20:22PM +1000, David Gibson wrote: > > > > There is a certain appeal to having some > > > > 'PPC_TCE_CREATE_SPECIAL_IOASID' entry point that has a wack of extra > > > > information like windows that can be optionally called by the viommu > > > > driver and it remains well defined and described. > > > > > > Windows really aren't ppc specific. They're absolutely there on x86 > > > and everything else as well - it's just that people are used to having > > > a window at 0.. that you can often get away with > > > treating it sloppily. > > > > My point is this detailed control seems to go on to more than just > > windows. As you say the vIOMMU is emulating specific HW that needs to > > have kernel interfaces to match it exactly. > > It's really not that bad. The case of emulating the PAPR vIOMMU on > something else is relatively easy, because all updates to the IO page > tables go through hypercalls. So, as long as the backend IOMMU can > map all the IOVAs that the guest IOMMU can, then qemu's implementation > of those hypercalls just needs to put an equivalent mapping in the > backend, which it can do with a generic VFIO_DMA_MAP. So you also want the PAPR vIOMMU driver to run on, say, an ARM IOMMU? > vIOMMUs with page tables in guest memory are harder, but only really > in the usual ways that a vIOMMU of that type is harder (needs cache > mode or whatever). At whatever point you need to shadow from the > guest IO page tables to the host backend, you can again do that with > generic maps, as long as the backend supports the necessary IOVAs, and > has an IO page size that's equal to or a submultiple of the vIOMMU > page size. But this definitely all becomes HW specific. For instance I want to have an ARM vIOMMU driver it needs to do some ret = ioctl(ioasid_fd, CREATE_NESTED_IOASID, [page table format is ARMvXXX]) if (ret == -EOPNOTSUPP) ret = ioctl(ioasid_fd, CREATE_NORMAL_IOASID, ..) // and do completely different and more expensive emulation I can get a little bit of generality, but at the end of the day the IOMMU must create a specific HW layout of the nested page table, if it can't, it can't. > > I'm remarking that trying to unify every HW IOMMU implementation that > > ever has/will exist into a generic API complete enough to allow the > > vIOMMU to be created is likely to result in an API too complicated to > > understand.. > > Maybe not every one, but I think we can get a pretty wide range with a > reasonable interface. It sounds like a reasonable guideline is if the feature is actually general to all IOMMUs and can be used by qemu as part of a vIOMMU emulation when compatible vIOMMU HW is not available. Having 'requested window' support that isn't actually implemented in every IOMMU is going to mean the PAPR vIOMMU emulation won't work, defeating the whole point of making things general? Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Tue, May 04, 2021 at 08:41:48AM -0700, Jacob Pan wrote: > > > > > > (also looking at ioasid.c, why do we need such a thin and odd wrapper > > > around xarray?) > > > > > > > I'll leave it to Jean and Jacob. > Could you elaborate? I mean stuff like this: int ioasid_set_data(ioasid_t ioasid, void *data) { struct ioasid_data *ioasid_data; int ret = 0; spin_lock(_allocator_lock); ioasid_data = xa_load(_allocator->xa, ioasid); if (ioasid_data) rcu_assign_pointer(ioasid_data->private, data); else ret = -ENOENT; spin_unlock(_allocator_lock); /* * Wait for readers to stop accessing the old private data, so the * caller can free it. */ if (!ret) synchronize_rcu(); return ret; } EXPORT_SYMBOL_GPL(ioasid_set_data); It is a weird way to use xarray to have a structure which itself is just a wrapper around another RCU protected structure. Make the caller supply the ioasid_data memory, embedded in its own element, get rid of the void * and rely on XA_ZERO_ENTRY to hold allocated but not active entries. Make the synchronize_rcu() the caller responsiblity, and callers should really be able to use call_rcu() Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Wed, Apr 28, 2021 at 06:58:19AM +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Wednesday, April 28, 2021 1:12 AM > > > [...] > > > As Alex says, if this line fails because of the group restrictions, > > > that's not great because it's not very obvious what's gone wrong. > > > > Okay, that is fair, but let's solve that problem directly. For > > instance netlink has been going in the direction of adding a "extack" > > from the kernel which is a descriptive error string. If the failing > > ioctl returned the string: > > > > "cannot join this device to the IOASID because device XXX in the > >same group #10 is in use" > > > > Would you agree it is now obvious what has gone wrong? In fact would > > you agree this is a lot better user experience than what applications > > do today even though they have the group FD? > > > > Currently all the discussions are around implicit vs. explicit uAPI semantics > on the group restriction. However if we look beyond group the implicit > semantics might be inevitable when dealing with incompatible iommu > domains. An existing example of iommu incompatibility is IOMMU_ > CACHE. I still think we need to get rid of these incompatibilities somehow. Having multiple HW incompatible IOASID in the same platform is just bad all around. When modeling in userspace IOMMU_CACHE sounds like it is a property of each individual IOASID, not an attribute that requires a new domain. People that want to create cache bypass IOASID's should just ask for that that directly. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Tue, May 04, 2021 at 09:22:55AM -0700, Jacob Pan wrote: > Hi Jason, > > On Wed, 28 Apr 2021 17:46:06 -0300, Jason Gunthorpe wrote: > > > > > I think the name IOASID is fine for the uAPI, the kernel version can > > > > be called ioasid_id or something. > > > > > > ioasid is already an id and then ioasid_id just adds confusion. Another > > > point is that ioasid is currently used to represent both PCI PASID and > > > ARM substream ID in the kernel. It implies that if we want to separate > > > ioasid and pasid in the uAPI the 'pasid' also needs to be replaced with > > > another general term usable for substream ID. Are we making the > > > terms too confusing here? > > > > This is why I also am not so sure about exposing the PASID in the API > > because it is ultimately a HW specific item. > > > > As I said to David, one avenue is to have some generic uAPI that is > > very general and keep all this deeply detailed stuff, that really only > > matters for qemu, as part of a more HW specific vIOMMU driver > > interface. > I think it is not just for QEMU. I am assuming you meant PASID is > needed for guest driver to program assigned but not mediated devices. Anything that directly operates the device and tries to instantiate PASIDs for vfio-pci devices will need to understand the PASID. > User space drivers may also need to get the real HW PASID to program it on > to the HW. So this uAPI need to provide some lookup functionality. Perhaps > the kernel generic version can be called ioasid_hw_id? > > So we have the following per my understanding: > - IOASID: a userspace logical number which identifies a page table, this can > be a first level (GVA-GPA), or a second level (GPA->HPA) page table. > - PASID: strictly defined in PCIe term > - Substream ID: strictly defined in ARM SMMUv3 spec. > - IOASID_HW_ID: a generic ID backed by PASID, Substream ID, or any other >HW IDs used to tag DMA > > Is that right? It is reasonable. If a IOASID_HW_ID IOCTL can back with a enum that qualified its exact nature it might be perfectly fine. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
Hi Jason, On Wed, 28 Apr 2021 17:46:06 -0300, Jason Gunthorpe wrote: > > > I think the name IOASID is fine for the uAPI, the kernel version can > > > be called ioasid_id or something. > > > > ioasid is already an id and then ioasid_id just adds confusion. Another > > point is that ioasid is currently used to represent both PCI PASID and > > ARM substream ID in the kernel. It implies that if we want to separate > > ioasid and pasid in the uAPI the 'pasid' also needs to be replaced with > > another general term usable for substream ID. Are we making the > > terms too confusing here? > > This is why I also am not so sure about exposing the PASID in the API > because it is ultimately a HW specific item. > > As I said to David, one avenue is to have some generic uAPI that is > very general and keep all this deeply detailed stuff, that really only > matters for qemu, as part of a more HW specific vIOMMU driver > interface. I think it is not just for QEMU. I am assuming you meant PASID is needed for guest driver to program assigned but not mediated devices. User space drivers may also need to get the real HW PASID to program it on to the HW. So this uAPI need to provide some lookup functionality. Perhaps the kernel generic version can be called ioasid_hw_id? So we have the following per my understanding: - IOASID: a userspace logical number which identifies a page table, this can be a first level (GVA-GPA), or a second level (GPA->HPA) page table. - PASID: strictly defined in PCIe term - Substream ID: strictly defined in ARM SMMUv3 spec. - IOASID_HW_ID: a generic ID backed by PASID, Substream ID, or any other HW IDs used to tag DMA Is that right? Thanks, Jacob ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
Hi Kevin, On Wed, 28 Apr 2021 06:34:11 +, "Tian, Kevin" wrote: > > > > (also looking at ioasid.c, why do we need such a thin and odd wrapper > > around xarray?) > > > > I'll leave it to Jean and Jacob. I am not sure whether you are referring to the current ioasid.c or the changes proposed in this patchset. I added a per VM/ioasid_set (also per /dev/ioasid fd) xarray to store guest-host PASID mapping. The current code has a xarray for the allocators. struct ioasid_allocator_data { struct ioasid_allocator_ops *ops; struct list_head list; struct list_head slist; #define IOASID_ALLOCATOR_CUSTOM BIT(0) /* Needs framework to track results */ unsigned long flags; struct xarray xa; struct rcu_head rcu; }; Could you elaborate? Thanks, Jacob ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Mon, May 03, 2021 at 01:05:30PM -0300, Jason Gunthorpe wrote: > On Thu, Apr 29, 2021 at 01:20:22PM +1000, David Gibson wrote: > > > There is a certain appeal to having some > > > 'PPC_TCE_CREATE_SPECIAL_IOASID' entry point that has a wack of extra > > > information like windows that can be optionally called by the viommu > > > driver and it remains well defined and described. > > > > Windows really aren't ppc specific. They're absolutely there on x86 > > and everything else as well - it's just that people are used to having > > a window at 0.. that you can often get away with > > treating it sloppily. > > My point is this detailed control seems to go on to more than just > windows. As you say the vIOMMU is emulating specific HW that needs to > have kernel interfaces to match it exactly. It's really not that bad. The case of emulating the PAPR vIOMMU on something else is relatively easy, because all updates to the IO page tables go through hypercalls. So, as long as the backend IOMMU can map all the IOVAs that the guest IOMMU can, then qemu's implementation of those hypercalls just needs to put an equivalent mapping in the backend, which it can do with a generic VFIO_DMA_MAP. vIOMMUs with page tables in guest memory are harder, but only really in the usual ways that a vIOMMU of that type is harder (needs cache mode or whatever). At whatever point you need to shadow from the guest IO page tables to the host backend, you can again do that with generic maps, as long as the backend supports the necessary IOVAs, and has an IO page size that's equal to or a submultiple of the vIOMMU page size. > I'm remarking that trying to unify every HW IOMMU implementation that > ever has/will exist into a generic API complete enough to allow the > vIOMMU to be created is likely to result in an API too complicated to > understand.. Maybe not every one, but I think we can get a pretty wide range with a reasonable interface. Explicitly handling IOVA windows does most of it. And we kind of need to handle that anyway to expose what ranges the IOMMU is capable of translating anyway. I think making handling valid IOVA windows explicit makes things simpler than having per-backend-family interfaces to expose the limits of their translation ranges, which is what's likely to happen without it. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Thu, Apr 29, 2021 at 01:04:05PM +1000, David Gibson wrote: > Again, I don't know enough about VDPA to make sense of that. Are we > essentially talking non-PCI virtual devices here? In which case you > could define the VDPA "bus" to always have one-device groups. It is much worse than that. What these non-PCI devices need is for the kernel driver to be part of the IOMMU group of the underlying PCI device but tell VFIO land that "groups don't matter" Today mdev tries to fake this by using singleton iommu groups, but it is really horrible and direcly hacks up the VFIO IOMMU code to understand these special cases. Intel was proposing more special hacking in the VFIO IOMMU code to extend this to PASID. When we get to a /dev/ioasid this is all nonsense. The kernel device driver is going to have to tell drivers/iommu exactly what kind of ioasid it can accept, be it a PASID inside a kernel owned group, a SW emulated 'mdev' ioasid, or whatever. In these cases the "group" idea has become a fiction that just creates a pain. "Just reorganize VDPA to do something insane with the driver core so we can create a dummy group to satisfy an unnecessary uAPI restriction" is not a very compelling argument. So if the nonsensical groups goes away for PASID/mdev, where does it leave the uAPI in other cases? > I don't think simplified-but-wrong is a good goal. The thing about > groups is that if they're there, you can't just "not care" about them, > they affect you whether you like it or not. You really can. If one thing claims the group then all the other group devices become locked out. The main point to understand is that groups are NOT an application restriction! It is a whole system restriction that the operator needs to understand and deal with. This is why things like dpdk don't care about the group at all - there is nothing they can do with the information. If the operator says to run dpdk on a specific device then the operator is the one that has to deal with all the other devices in the group getting locked out. At best the application can make it more obvious that the operator is doing something dangerous, but the current kernel API doesn't seem to really support that either. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Thu, Apr 29, 2021 at 01:20:22PM +1000, David Gibson wrote: > > There is a certain appeal to having some > > 'PPC_TCE_CREATE_SPECIAL_IOASID' entry point that has a wack of extra > > information like windows that can be optionally called by the viommu > > driver and it remains well defined and described. > > Windows really aren't ppc specific. They're absolutely there on x86 > and everything else as well - it's just that people are used to having > a window at 0.. that you can often get away with > treating it sloppily. My point is this detailed control seems to go on to more than just windows. As you say the vIOMMU is emulating specific HW that needs to have kernel interfaces to match it exactly. I'm remarking that trying to unify every HW IOMMU implementation that ever has/will exist into a generic API complete enough to allow the vIOMMU to be created is likely to result in an API too complicated to understand.. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Thu, Apr 29, 2021 at 03:26:55PM +0200, Auger Eric wrote: > From the pseudo code, > > gpa_ioasid_id = ioctl(ioasid_fd, CREATE_IOASID, ..) > ioctl(ioasid_fd, SET_IOASID_PAGE_TABLES, ..) > > I fail to understand whether the SET_IOASID_PAGE_TABLES would apply to > the whole IOASIDs within /dev/ioasid or to a specific one. Sorry, nearly every IOCTL would be scoped to a specific IOASID as one of the arguments. > Also in subsequent emails when you talk about IOASID, is it the > ioasid_id, just to double check the terminology. I am refering to IOASID as 'handle of the page table object inside the /dev/ioasid fd'. If that is equal to some HW value or not I think remains as decision point. Basically the fd has an xarray of 'struct [something] *' and the IOASID is index to that FD's private xarray. This is necessary to create proper security as even if we have global PASID numbers or something they still need to be isolated to only the FD that has been authorized access. > > nested_ioasid = ioctl(ioasid_fd, CREATE_NESTED_IOASID, gpa_ioasid_id); > > ioctl(ioasid_fd, SET_NESTED_IOASID_PAGE_TABLES, nested_ioasid, ..) > is the nested_ioasid the allocated PASID id or is it a complete > different object id. It is the IOASID handle above. > > > >// IOMMU will match on the device RID, no PASID: > > ioctl(vfio_device, ATTACH_IOASID, nested_ioasid); > > > >// IOMMU will match on the device RID and PASID: > > ioctl(vfio_device, ATTACH_IOASID_PASID, pasid, nested_ioasid); > here I see you pass a different pasid, so I guess they are different, in > which case you would need to have an allocator function for this pasid, > right? Yes, the underlying HW ID (PASID or substream id or whatver) is something slightly different Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
Hi, On 4/22/21 2:10 PM, Jason Gunthorpe wrote: > On Thu, Apr 22, 2021 at 08:34:32AM +, Tian, Kevin wrote: > >> The shim layer could be considered as a new iommu backend in VFIO, >> which connects VFIO iommu ops to the internal helpers in >> drivers/ioasid. > > It may be the best we can do because of SPAPR, but the ideal outcome > should be to remove the entire pluggable IOMMU stuff from vfio > entirely and have it only use /dev/ioasid > > We should never add another pluggable IOMMU type to vfio - everything > should be done through drives/iommu now that it is much more capable. > >> Another tricky thing is that a container may be linked to multiple iommu >> domains in VFIO, as devices in the container may locate behind different >> IOMMUs with inconsistent capability (commit 1ef3e2bc). > > Frankly this sounds over complicated. I would think /dev/ioasid should > select the IOMMU when the first device is joined, and all future joins > must be compatible with the original IOMMU - ie there is only one set > of IOMMU capabilities in a /dev/ioasid. > > This means qemue might have multiple /dev/ioasid's if the system has > multiple incompatible IOMMUs (is this actually a thing?) The platform > should design its IOMMU domains to minimize the number of > /dev/ioasid's required. > > Is there a reason we need to share IOASID'd between completely > divergance IOMMU implementations? I don't expect the HW should be able > to physically share page tables?? > > That decision point alone might be the thing that just says we can't > ever have /dev/vfio/vfio == /dev/ioasid > >> Just to confirm. Above flow is for current map/unmap flavor as what >> VFIO/vDPA do today. Later when nested translation is supported, >> there is no need to detach gpa_ioasid_fd. Instead, a new cmd will >> be introduced to nest rid_ioasid_fd on top of gpa_ioasid_fd: > > Sure.. The tricky bit will be to define both of the common nested > operating modes. > >From the pseudo code, gpa_ioasid_id = ioctl(ioasid_fd, CREATE_IOASID, ..) ioctl(ioasid_fd, SET_IOASID_PAGE_TABLES, ..) I fail to understand whether the SET_IOASID_PAGE_TABLES would apply to the whole IOASIDs within /dev/ioasid or to a specific one. Also in subsequent emails when you talk about IOASID, is it the ioasid_id, just to double check the terminology. > nested_ioasid = ioctl(ioasid_fd, CREATE_NESTED_IOASID, gpa_ioasid_id); > ioctl(ioasid_fd, SET_NESTED_IOASID_PAGE_TABLES, nested_ioasid, ..) is the nested_ioasid the allocated PASID id or is it a complete different object id. > >// IOMMU will match on the device RID, no PASID: > ioctl(vfio_device, ATTACH_IOASID, nested_ioasid); > >// IOMMU will match on the device RID and PASID: > ioctl(vfio_device, ATTACH_IOASID_PASID, pasid, nested_ioasid); here I see you pass a different pasid, so I guess they are different, in which case you would need to have an allocator function for this pasid, right? Thanks Eric > > Notice that ATTACH (or bind, whatever) is always done on the > vfio_device FD. ATTACH tells the IOMMU HW to link the PCI BDF to > a specific page table defined by an IOASID. > > I expect we have many flavours of IOASID tables, eg we have normal, > and 'nested with table controlled by hypervisor'. ARM has 'nested with > table controlled by guest' right? So like this? > > nested_ioasid = ioctl(ioasid_fd, CREATE_DELGATED_IOASID, >gpa_ioasid_id, ) > // PASID now goes to > ioctl(vfio_device, ATTACH_IOASID_PASID, pasid, nested_ioasid); > > Where is some internal to the guest handle of the viommu > page table scoped within gpa_ioasid_id? Like maybe it is GPA of the > base of the page table? > > The guest can't select its own PASIDs without telling the hypervisor, > right? > >> I also feel hiding group from uAPI is a good thing and is interested in >> the rationale behind for explicitly managing group in vfio (which is >> essentially the same boundary as provided by iommu group), e.g. for >> better user experience when group security is broken? > > Indeed, I can see how things might have just evolved into this, but if > it has a purpose it seems pretty hidden. > we need it or not seems pretty hidden. > > Jason > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
Hi, On 4/22/21 2:10 PM, Jason Gunthorpe wrote: > On Thu, Apr 22, 2021 at 08:34:32AM +, Tian, Kevin wrote: > >> The shim layer could be considered as a new iommu backend in VFIO, >> which connects VFIO iommu ops to the internal helpers in >> drivers/ioasid. > > It may be the best we can do because of SPAPR, but the ideal outcome > should be to remove the entire pluggable IOMMU stuff from vfio > entirely and have it only use /dev/ioasid > > We should never add another pluggable IOMMU type to vfio - everything > should be done through drives/iommu now that it is much more capable. > >> Another tricky thing is that a container may be linked to multiple iommu >> domains in VFIO, as devices in the container may locate behind different >> IOMMUs with inconsistent capability (commit 1ef3e2bc). > > Frankly this sounds over complicated. I would think /dev/ioasid should > select the IOMMU when the first device is joined, and all future joins > must be compatible with the original IOMMU - ie there is only one set > of IOMMU capabilities in a /dev/ioasid. > > This means qemue might have multiple /dev/ioasid's if the system has > multiple incompatible IOMMUs (is this actually a thing?) The platform > should design its IOMMU domains to minimize the number of > /dev/ioasid's required. > > Is there a reason we need to share IOASID'd between completely > divergance IOMMU implementations? I don't expect the HW should be able > to physically share page tables?? > > That decision point alone might be the thing that just says we can't > ever have /dev/vfio/vfio == /dev/ioasid > >> Just to confirm. Above flow is for current map/unmap flavor as what >> VFIO/vDPA do today. Later when nested translation is supported, >> there is no need to detach gpa_ioasid_fd. Instead, a new cmd will >> be introduced to nest rid_ioasid_fd on top of gpa_ioasid_fd: > > Sure.. The tricky bit will be to define both of the common nested > operating modes. > > nested_ioasid = ioctl(ioasid_fd, CREATE_NESTED_IOASID, gpa_ioasid_id); > ioctl(ioasid_fd, SET_NESTED_IOASID_PAGE_TABLES, nested_ioasid, ..) > >// IOMMU will match on the device RID, no PASID: > ioctl(vfio_device, ATTACH_IOASID, nested_ioasid); > >// IOMMU will match on the device RID and PASID: > ioctl(vfio_device, ATTACH_IOASID_PASID, pasid, nested_ioasid); > > Notice that ATTACH (or bind, whatever) is always done on the > vfio_device FD. ATTACH tells the IOMMU HW to link the PCI BDF to > a specific page table defined by an IOASID. > > I expect we have many flavours of IOASID tables, eg we have normal, > and 'nested with table controlled by hypervisor'. ARM has 'nested with > table controlled by guest' right? So like this? yes the PASID table is fully controlled by the guest Same for the stage 1 table. > > nested_ioasid = ioctl(ioasid_fd, CREATE_DELGATED_IOASID, >gpa_ioasid_id, ) > // PASID now goes to > ioctl(vfio_device, ATTACH_IOASID_PASID, pasid, nested_ioasid); > > Where is some internal to the guest handle of the viommu > page table scoped within gpa_ioasid_id? Like maybe it is GPA of the > base of the page table? Yes the GPA of the first level page table + some misc info like the max number of IOASIDs. > > The guest can't select its own PASIDs without telling the hypervisor, > right? on ARM there is no system wide IOASID allocator as for x86. So the guest can select its own PASID without telling the hyp. Thanks Eric > >> I also feel hiding group from uAPI is a good thing and is interested in >> the rationale behind for explicitly managing group in vfio (which is >> essentially the same boundary as provided by iommu group), e.g. for >> better user experience when group security is broken? > > Indeed, I can see how things might have just evolved into this, but if > it has a purpose it seems pretty hidden. > we need it or not seems pretty hidden. > > Jason > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
Hi, On 4/23/21 1:49 PM, Jason Gunthorpe wrote: > On Fri, Apr 23, 2021 at 09:06:44AM +, Tian, Kevin wrote: > >> Or could we still have just one /dev/ioasid but allow userspace to create >> multiple gpa_ioasid_id's each associated to a different iommu domain? >> Then the compatibility check will be done at ATTACH_IOASID instead of >> JOIN_IOASID_FD. > > To my mind what makes sense that that /dev/ioasid presents a single > IOMMU behavior that is basically the same. This may ultimately not be > what we call a domain today. > > We may end up with a middle object which is a group of domains that > all have the same capabilities, and we define capabilities in a way > that most platforms have a single group of domains. > > The key capability of a group of domains is they can all share the HW > page table representation, so if an IOASID instantiates a page table > it can be assigned to any device on any domain in the gruop of domains. > > If you try to say that /dev/ioasid has many domains and they can't > have their HW page tables shared then I think the implementation > complexity will explode. > >> This does impose one burden to userspace though, to understand the >> IOMMU compatibilities and figure out which incompatible features may >> affect the page table management (while such knowledge is IOMMU >> vendor specific) and then explicitly manage multiple /dev/ioasid's or >> multiple gpa_ioasid_id's. > > Right, this seems very hard in the general case.. > >> Alternatively is it a good design by having the kernel return error at >> attach/join time to indicate that incompatibility is detected then the >> userspace should open a new /dev/ioasid or creates a new gpa_ioasid_id >> for the failing device upon such failure, w/o constructing its own >> compatibility knowledge? > > Yes, this feels workable too > >>> This means qemue might have multiple /dev/ioasid's if the system has >>> multiple incompatible IOMMUs (is this actually a thing?) The platform >> >> One example is Intel platform with igd. Typically there is one IOMMU >> dedicated for igd and the other IOMMU serving all the remaining devices. >> The igd IOMMU may not support IOMMU_CACHE while the other one >> does. > > If we can do as above the two domains may be in the same group of > domains and the IOMMU_CACHE is not exposed at the /dev/ioasid level. > > For instance the API could specifiy IOMMU_CACHE during attach, not > during IOASID creation. > > Getting all the data model right in the API is going to be trickiest > part of this. > >> yes, e.g. in vSVA both devices (behind divergence IOMMUs) are bound >> to a single guest process which has an unique PASID and 1st-level page >> table. Earlier incompatibility example is only for 2nd-level. > > Because when we get to here, things become inscrutable as an API if > you are trying to say two different IOMMU presentations can actually > be nested. > >>> Sure.. The tricky bit will be to define both of the common nested >>> operating modes. >>> >>> nested_ioasid = ioctl(ioasid_fd, CREATE_NESTED_IOASID, gpa_ioasid_id); >>> ioctl(ioasid_fd, SET_NESTED_IOASID_PAGE_TABLES, nested_ioasid, ..) >>> >>>// IOMMU will match on the device RID, no PASID: >>> ioctl(vfio_device, ATTACH_IOASID, nested_ioasid); >>> >>>// IOMMU will match on the device RID and PASID: >>> ioctl(vfio_device, ATTACH_IOASID_PASID, pasid, nested_ioasid); >> >> I'm a bit confused here why we have both pasid and ioasid notations together. >> Why not use nested_ioasid as pasid directly (i.e. every pasid in nested mode >> is created by CREATE_NESTED_IOASID)? > > The IOASID is not a PASID, it is just a page table. > > A generic IOMMU matches on either RID or (RID,PASID), so you should > specify the PASID when establishing the match. > > IOASID only specifies the page table. > > So you read the above as configuring the path > > PCI_DEVICE -> (RID,PASID) -> nested_ioasid -> gpa_ioasid_id -> physical > > Where (RID,PASID) indicate values taken from the PCI packet. > > In principle the IOMMU could also be commanded to reuse the same > ioasid page table with a different PASID: > > PCI_DEVICE_B -> (RID_B,PASID_B) -> nested_ioasid -> gpa_ioasid_id -> > physical > > This is impossible if the ioasid == PASID in the API. > >> Below I list different scenarios for ATTACH_IOASID in my view. Here >> vfio_device could be a real PCI function (RID), or a subfunction device >> (RID+def_ioasid). > > What is RID+def_ioasid? The IOMMU does not match on IOASID's. > > A subfunction device always need to use PASID, or an internal IOMMU, > confused what you are trying to explain? > >> If the whole PASID table is delegated to the guest in ARM case, the guest >> can select its own PASIDs w/o telling the hypervisor. > > The hypervisor has to route the PASID's to the guest at some point - a > guest can't just claim a PASID unilaterally, that would not be secure. AFAIU On ARM the stage 2 table is uniquely defined per
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Wed, Apr 28, 2021 at 11:56:22AM -0300, Jason Gunthorpe wrote: > On Wed, Apr 28, 2021 at 10:58:29AM +1000, David Gibson wrote: > > On Tue, Apr 27, 2021 at 02:12:12PM -0300, Jason Gunthorpe wrote: > > > On Tue, Apr 27, 2021 at 03:08:46PM +1000, David Gibson wrote: > > > > > Starting from a BDF the general pseudo code is: > > > > > device_name = first_directory_of("/sys/bus/pci/devices/BDF/vfio/") > > > > > device_fd = open("/dev/vfio/"+device_name) > > > > > ioasidfd = open("/dev/ioasid") > > > > > ioctl(device_fd, JOIN_IOASID_FD, ioasidfd) > > > > > > > > This line is the problem. > > > > > > > > [Historical aside: Alex's early drafts for the VFIO interface looked > > > > quite similar to this. Ben Herrenschmidt and myself persuaded him it > > > > was a bad idea, and groups were developed instead. I still think it's > > > > a bad idea, and not just for POWER] > > > > > > Spawning the VFIO device FD from the group FD is incredibly gross from > > > a kernel design perspective. Since that was done the struct > > > vfio_device missed out on a sysfs presence and doesn't have the > > > typical 'struct device' member or dedicated char device you'd expect a > > > FD based subsystem to have. > > > > > > This basically traded normal usage of the driver core for something > > > that doesn't serve a technical usage. Given we are now nearly 10 years > > > later and see that real widely deployed applications are not doing > > > anything special with the group FD it makes me question the wisdom of > > > this choice. > > > > I'm really not sure what "anything special" would constitute here. > > Well, really anything actually. All I see in, say, dpdk, is open the > group fd, get a device fd, do the container dance and never touch the > group fd again or care about groups in any way. It seems typical of > this class of application. Well, sure, the only operation you do on the group itself is attach it to the container (and then every container operation can be thought of as applying to all its attached groups). But that attach operation really is fundamentally about the group. It always, unavoidably, fundamentally affects every device in the group - including devices you may not typically think about, like bridges and switches. That is *not* true of the other device operations, like poking IO. > If dpdk is exposing other devices to a risk it certainly hasn't done > anything to make that obvious. And in practice I suspect it will just break if you give it a >1 device group. > > > Okay, that is fair, but let's solve that problem directly. For > > > instance netlink has been going in the direction of adding a "extack" > > > from the kernel which is a descriptive error string. If the failing > > > ioctl returned the string: > > > > > > "cannot join this device to the IOASID because device XXX in the > > >same group #10 is in use" > > > > Um.. is there a sane way to return strings from an ioctl()? > > Yes, it can be done, a string buffer pointer and length in the input > for instance. I suppose. Rare enough that I expect everyone will ignore it, alas :/. > > Getting the device fds from the group fd kind of follows, because it's > > unsafe to do basically anything on the device unless you already > > control the group (which in this case means attaching it to a > > container/ioasid). I'm entirely open to ways of doing that that are > > less inelegant from a sysfs integration point of view, but the point > > is you must manage the group before you can do anything at all with > > individual devices. > > I think most likely VFIO is going to be the only thing to manage a > multi-device group. You don't get to choose that. You could explicitly limit other things to only one-device groups, but that would be an explicit limitation. Essentially any device can end up in a multi-device group, if you put it behind a PCIe to PCI bridge, or a PCIe switch which doesn't support access controls. The groups are still there, whether or not other things want to deal with them. > I see things like VDPA being primarily about PASID, and an IOASID that > is matched to a PASID is inherently a single device IOMMU group. I don't know enough about PASID to make sense of that. > > I don't see why. I mean, sure, you don't want explicitly the *vfio* > > group as such. But IOMMU group is already a cross-subsystem concept > > and you can explicitly expose that in a different way. > > Yes, and no, the kernel drivers in something like VDPA have decided > what device and group they are in before we get to IOASID. It is > illogical to try to retro-actively bolt in a group concept to their > APIs. Again, I don't know enough about VDPA to make sense of that. Are we essentially talking non-PCI virtual devices here? In which case you could define the VDPA "bus" to always have one-device groups. > > Again, I realy think this is necessary complexity. You're right that > > far too little of the userspace properly
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Wed, Apr 28, 2021 at 09:21:49PM -0300, Jason Gunthorpe wrote: > On Wed, Apr 28, 2021 at 11:23:39AM +1000, David Gibson wrote: > > > Yes. My proposed model for a unified interface would be that when you > > create a new container/IOASID, *no* IOVAs are valid. > > Hurm, it is quite tricky. All IOMMUs seem to have a dead zone around > the MSI window, so negotiating this all in a general way is not going > to be a very simple API. > > To be general it would be nicer to say something like 'I need XXGB of > IOVA space' 'I need 32 bit IOVA space' etc and have the kernel return > ranges that sum up to at least that big. Then the kernel can do its > all its optimizations. Ah, yes, sorry. We do need an API that lets the kernel make more of the decisions too. For userspace drivers it would generally be sufficient to just ask for XXX size of IOVA space wherever you can get it. Handling guests requires more precision. So, maybe a request interface with a bunch of hint variables and a matching set of MAP_FIXED-like flags to assert which ones aren't negotiable. > I guess you are going to say that the qemu PPC vIOMMU driver needs > more exact control.. *Every* vIOMMU driver needs more exact control. The guest drivers will expect to program the guest devices with IOVAs matching the guest platform's IOMMU model. Therefore the backing host IOMMU has to be programmed to respond to those IOVAs. If it can't be, there's no way around it, and you want to fail out early. With this model that will happen when qemu (say) requests the host IOMMU window(s) to match the guest's expected IOVA ranges. Actually, come to that even guests without a vIOMMU need more exact control: they'll expect IOVA to match GPA, so if your host IOMMU can't be set up translate the full range of GPAs, again, you're out of luck. The only reason x86 has been able to ignore this is that the assumption has been that all IOMMUs can translate IOVAs from 0... Once you really start to look at what the limits are, you need the exact window control I'm describing. > > I expect we'd need some kind of query operation to expose limitations > > on the number of windows, addresses for them, available pagesizes etc. > > Is page size an assumption that hugetlbfs will always be used for backing > memory or something? So for TCEs (and maybe other IOMMUs out there), the IO page tables are independent of the CPU page tables. They don't have the same format, and they don't necessarily have the same page size. In the case of a bare metal kernel working in physical addresses they can use that TCE page size however they like. For userspace you get another layer of complexity. Essentially to implement things correctly the backing IOMMU needs to have a page size granularity that's the minimum of whatever granularity the userspace or guest driver expects and the host page size backing the memory. > > > As an ideal, only things like the HW specific qemu vIOMMU driver > > > should be reaching for all the special stuff. > > > > I'm hoping we can even avoid that, usually. With the explicitly > > created windows model I propose above, it should be able to: qemu will > > create the windows according to the IOVA windows the guest platform > > expects to see and they either will or won't work on the host platform > > IOMMU. If they do, generic maps/unmaps should be sufficient. If they > > don't well, the host IOMMU simply cannot emulate the vIOMMU so you're > > out of luck anyway. > > It is not just P9 that has special stuff, and this whole area of PASID > seems to be quite different on every platform > > If things fit very naturally and generally then maybe, but I've been > down this road before of trying to make a general description of a > group of very special HW. It ended in tears after 10 years when nobody > could understand the "general" API after it was Frankenstein'd up with > special cases for everything. Cautionary tale > > There is a certain appeal to having some > 'PPC_TCE_CREATE_SPECIAL_IOASID' entry point that has a wack of extra > information like windows that can be optionally called by the viommu > driver and it remains well defined and described. Windows really aren't ppc specific. They're absolutely there on x86 and everything else as well - it's just that people are used to having a window at 0.. that you can often get away with treating it sloppily. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Wed, Apr 28, 2021 at 11:23:39AM +1000, David Gibson wrote: > Yes. My proposed model for a unified interface would be that when you > create a new container/IOASID, *no* IOVAs are valid. Hurm, it is quite tricky. All IOMMUs seem to have a dead zone around the MSI window, so negotiating this all in a general way is not going to be a very simple API. To be general it would be nicer to say something like 'I need XXGB of IOVA space' 'I need 32 bit IOVA space' etc and have the kernel return ranges that sum up to at least that big. Then the kernel can do its all its optimizations. I guess you are going to say that the qemu PPC vIOMMU driver needs more exact control.. > I expect we'd need some kind of query operation to expose limitations > on the number of windows, addresses for them, available pagesizes etc. Is page size an assumption that hugetlbfs will always be used for backing memory or something? > > As an ideal, only things like the HW specific qemu vIOMMU driver > > should be reaching for all the special stuff. > > I'm hoping we can even avoid that, usually. With the explicitly > created windows model I propose above, it should be able to: qemu will > create the windows according to the IOVA windows the guest platform > expects to see and they either will or won't work on the host platform > IOMMU. If they do, generic maps/unmaps should be sufficient. If they > don't well, the host IOMMU simply cannot emulate the vIOMMU so you're > out of luck anyway. It is not just P9 that has special stuff, and this whole area of PASID seems to be quite different on every platform If things fit very naturally and generally then maybe, but I've been down this road before of trying to make a general description of a group of very special HW. It ended in tears after 10 years when nobody could understand the "general" API after it was Frankenstein'd up with special cases for everything. Cautionary tale There is a certain appeal to having some 'PPC_TCE_CREATE_SPECIAL_IOASID' entry point that has a wack of extra information like windows that can be optionally called by the viommu driver and it remains well defined and described. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Wed, Apr 28, 2021 at 06:34:11AM +, Tian, Kevin wrote: > > If /dev/ioasid is single HW page table only then I would focus on that > > implementation and leave it to userspace to span different > > /dev/ioasids if needed. > > > > > OK, now I see where the disconnection comes from. In my context ioasid > > > is the identifier that is actually used in the wire, but seems you treat > > > it as > > > a sw-defined namespace purely for representing page tables. We should > > > clear this concept first before further discussing other details. > > > > There is no general HW requirement that every IO page table be > > referred to by the same PASID and this API would have to support > > Yes, but what is the value of allowing multiple PASIDs referring to the > the same I/O page table (except the nesting pgtable case)? Doesn't it > lead to poor iotlb efficiency issue similar to multiple iommu domains > referring to the same page table? I think iotlb efficiency is up to the platform. The general use case is to make an IOASID for something like the GPA and use it concurrently with three say three devices: - VFIO (not PASID) - VDPA (PASID capable HW) - 'Future VDPA storage' (PASID capable HW) The uAPI for this should be very general and the kernel should decide the optimal way to configure the HW. Maybe it is one page table and one PASID, or maybe it is something else. Allowing the kernel to choose the PASID once it knows the RID is the highest generality. > > non-PASID IO page tables as well. So I'd keep the two things > > separated in the uAPI - even though the kernel today has a global > > PASID pool. > > for non-PASID usages the allocated PASID will be wasted if we don't > separate ioasid from pasid. But it may be worthwhile given 1m available > pasids and the simplification in the uAPI which only needs to care about > one id space then. I'd prefer this be a platform choice and not forced in the uAPI, because we can never go back on it if we see that yes we need to optimize here. I understand many platforms have different available PASID spaces already. > > Simple things like DPDK can use #2 and potentially have better PASID > > limits. hypervisors will most likely have to use #1, but it depends on > > how their vIOMMU interface works. > > Can you elaborate why DPDK wants to use #2 i.e. not using a global > PASID? It gives the kernel an option to make the decision about the PASID when it has the full information, including the RID. > > I think the name IOASID is fine for the uAPI, the kernel version can > > be called ioasid_id or something. > > ioasid is already an id and then ioasid_id just adds confusion. Another > point is that ioasid is currently used to represent both PCI PASID and > ARM substream ID in the kernel. It implies that if we want to separate > ioasid and pasid in the uAPI the 'pasid' also needs to be replaced with > another general term usable for substream ID. Are we making the > terms too confusing here? This is why I also am not so sure about exposing the PASID in the API because it is ultimately a HW specific item. As I said to David, one avenue is to have some generic uAPI that is very general and keep all this deeply detailed stuff, that really only matters for qemu, as part of a more HW specific vIOMMU driver interface. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Wed, Apr 28, 2021 at 07:47:56AM +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Wednesday, April 28, 2021 1:12 AM > > > [...] > > One option is VFIO can keep its group FD but nothing else will have > > anthing like it. However I don't much like the idea that VFIO will > > have a special and unique programming model to do that same things > > other subsystem will do. That will make it harder for userspace to > > implement. > > Hi, Jason, > > I have a question here. Based on discussions so far, it's clearly that the > new ioasid uAPI will differ from existing VFIO uAPI a lot, e.g. ioasid- > centric operations, no group fd, no incompatible domains, etc. Then > I wonder how we plan to support legacy VFIO applications in this > transition phase. I suspect the VFIO group fd will have to be registered with /dev/ioasid in addition to each device if we are to retain the same model. > Earlier you ever mentioned the desire of directly replacing > /dev/vfio/vfio with /dev/ioasid and having ioasid to present both > VFIO and new uAPI. Doesn't it imply that we have to copy the VFIO > container/group semantics into /dev/ioasid although it's a special > programming model only for VFIO? I gave that as a something to think about, if it doesn't work out then it is just a bad idea to discard. > Alternatively we could keep all the container/group legacy within VFIO > and having /dev/ioasid support only the new uAPI semantics. In this case > VFIO will include a shim iommu backend to connect its legacy uAPI into > drivers/ioasid backend functions for backward compatibility. Then VFIO > will also support a new model which only uses its device uAPI to bind > to new ioasid fd w/o using any legacy container/group/iommu uAPI. > Does this sound a plan? It may be where we end up.. Though I fear it will make it overly complex inside VFIO to access the new stuff. It would be very nice if we could see a path where VFIO insides could only deal with the in-kernel ioasid handles, whatever they are. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Wed, 28 Apr 2021 06:34:11 + "Tian, Kevin" wrote: > > From: Jason Gunthorpe > > Sent: Monday, April 26, 2021 8:38 PM > > > [...] > > > Want to hear your opinion for one open here. There is no doubt that > > > an ioasid represents a HW page table when the table is constructed by > > > userspace and then linked to the IOMMU through the bind/unbind > > > API. But I'm not very sure about whether an ioasid should represent > > > the exact pgtable or the mapping metadata when the underlying > > > pgtable is indirectly constructed through map/unmap API. VFIO does > > > the latter way, which is why it allows multiple incompatible domains > > > in a single container which all share the same mapping metadata. > > > > I think VFIO's map/unmap is way too complex and we know it has bad > > performance problems. > > Can you or Alex elaborate where the complexity and performance problem > locate in VFIO map/umap? We'd like to understand more detail and see how > to avoid it in the new interface. The map/unmap interface is really only good for long lived mappings, the overhead is too high for things like vIOMMU use cases or any case where the mapping is intended to be dynamic. Userspace drivers must make use of a long lived buffer mapping in order to achieve performance. The mapping and unmapping granularity has been a problem as well, type1v1 allowed arbitrary unmaps to bisect the original mapping, with the massive caveat that the caller relies on the return value of the unmap to determine what was actually unmapped because the IOMMU use of superpages is transparent to the caller. This led to type1v2 that simply restricts the user to avoid ever bisecting mappings. That still leaves us with problems for things like virtio-mem support where we need to create initial mappings with a granularity that allows us to later remove entries, which can prevent effective use of IOMMU superpages. Locked page accounting has been another constant issue. We perform locked page accounting at the container level, where each container accounts independently. A user may require multiple containers, the containers may pin the same physical memory, but be accounted against the user once per container. Those are the main ones I can think of. It is nice to have a simple map/unmap interface, I'd hope that a new /dev/ioasid interface wouldn't raise the barrier to entry too high, but the user needs to have the ability to have more control of their mappings and locked page accounting should probably be offloaded somewhere. Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
On Wed, Apr 28, 2021 at 10:58:29AM +1000, David Gibson wrote: > On Tue, Apr 27, 2021 at 02:12:12PM -0300, Jason Gunthorpe wrote: > > On Tue, Apr 27, 2021 at 03:08:46PM +1000, David Gibson wrote: > > > > Starting from a BDF the general pseudo code is: > > > > device_name = first_directory_of("/sys/bus/pci/devices/BDF/vfio/") > > > > device_fd = open("/dev/vfio/"+device_name) > > > > ioasidfd = open("/dev/ioasid") > > > > ioctl(device_fd, JOIN_IOASID_FD, ioasidfd) > > > > > > This line is the problem. > > > > > > [Historical aside: Alex's early drafts for the VFIO interface looked > > > quite similar to this. Ben Herrenschmidt and myself persuaded him it > > > was a bad idea, and groups were developed instead. I still think it's > > > a bad idea, and not just for POWER] > > > > Spawning the VFIO device FD from the group FD is incredibly gross from > > a kernel design perspective. Since that was done the struct > > vfio_device missed out on a sysfs presence and doesn't have the > > typical 'struct device' member or dedicated char device you'd expect a > > FD based subsystem to have. > > > > This basically traded normal usage of the driver core for something > > that doesn't serve a technical usage. Given we are now nearly 10 years > > later and see that real widely deployed applications are not doing > > anything special with the group FD it makes me question the wisdom of > > this choice. > > I'm really not sure what "anything special" would constitute here. Well, really anything actually. All I see in, say, dpdk, is open the group fd, get a device fd, do the container dance and never touch the group fd again or care about groups in any way. It seems typical of this class of application. If dpdk is exposing other devices to a risk it certainly hasn't done anything to make that obvious. > > Okay, that is fair, but let's solve that problem directly. For > > instance netlink has been going in the direction of adding a "extack" > > from the kernel which is a descriptive error string. If the failing > > ioctl returned the string: > > > > "cannot join this device to the IOASID because device XXX in the > >same group #10 is in use" > > Um.. is there a sane way to return strings from an ioctl()? Yes, it can be done, a string buffer pointer and length in the input for instance. > Getting the device fds from the group fd kind of follows, because it's > unsafe to do basically anything on the device unless you already > control the group (which in this case means attaching it to a > container/ioasid). I'm entirely open to ways of doing that that are > less inelegant from a sysfs integration point of view, but the point > is you must manage the group before you can do anything at all with > individual devices. I think most likely VFIO is going to be the only thing to manage a multi-device group. I see things like VDPA being primarily about PASID, and an IOASID that is matched to a PASID is inherently a single device IOMMU group. > I don't see why. I mean, sure, you don't want explicitly the *vfio* > group as such. But IOMMU group is already a cross-subsystem concept > and you can explicitly expose that in a different way. Yes, and no, the kernel drivers in something like VDPA have decided what device and group they are in before we get to IOASID. It is illogical to try to retro-actively bolt in a group concept to their APIs. > Again, I realy think this is necessary complexity. You're right that > far too little of the userspace properly understands group > restrictions.. but these come from real hardware limitations, and I > don't feel like making it *less* obvious in the interface is going to > help that. The appeal of making it less obvious is we can have a single simplified API flow so that an application that doesn't understand or care about groups can have uniformity. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
> From: Jason Gunthorpe > Sent: Wednesday, April 28, 2021 1:12 AM > [...] > One option is VFIO can keep its group FD but nothing else will have > anthing like it. However I don't much like the idea that VFIO will > have a special and unique programming model to do that same things > other subsystem will do. That will make it harder for userspace to > implement. Hi, Jason, I have a question here. Based on discussions so far, it's clearly that the new ioasid uAPI will differ from existing VFIO uAPI a lot, e.g. ioasid- centric operations, no group fd, no incompatible domains, etc. Then I wonder how we plan to support legacy VFIO applications in this transition phase. Earlier you ever mentioned the desire of directly replacing /dev/vfio/vfio with /dev/ioasid and having ioasid to present both VFIO and new uAPI. Doesn't it imply that we have to copy the VFIO container/group semantics into /dev/ioasid although it's a special programming model only for VFIO? Alternatively we could keep all the container/group legacy within VFIO and having /dev/ioasid support only the new uAPI semantics. In this case VFIO will include a shim iommu backend to connect its legacy uAPI into drivers/ioasid backend functions for backward compatibility. Then VFIO will also support a new model which only uses its device uAPI to bind to new ioasid fd w/o using any legacy container/group/iommu uAPI. Does this sound a plan? Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
> From: Jason Gunthorpe > Sent: Wednesday, April 28, 2021 1:12 AM > [...] > > As Alex says, if this line fails because of the group restrictions, > > that's not great because it's not very obvious what's gone wrong. > > Okay, that is fair, but let's solve that problem directly. For > instance netlink has been going in the direction of adding a "extack" > from the kernel which is a descriptive error string. If the failing > ioctl returned the string: > > "cannot join this device to the IOASID because device XXX in the >same group #10 is in use" > > Would you agree it is now obvious what has gone wrong? In fact would > you agree this is a lot better user experience than what applications > do today even though they have the group FD? > Currently all the discussions are around implicit vs. explicit uAPI semantics on the group restriction. However if we look beyond group the implicit semantics might be inevitable when dealing with incompatible iommu domains. An existing example of iommu incompatibility is IOMMU_ CACHE. In the future there could be other incompatibilities such as whether nested translation is supported. In the end the userspace has to do some due diligence on understanding iommu topology and attributes to decide how many VFIO containers or ioasid fds should be created. It does push some burden to userspace but it's difficult to define a group- like kernel object to enforce such restriction for iommu compatibility. Then the best that the kernel can do is to return an informational error message in case an incompatible device is attached to the existing domain. If this is the perceived way to move forward anyway, I feel that removing explicit group FD from uAPI doesn't make userspace worse... Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
> From: Jason Gunthorpe > Sent: Monday, April 26, 2021 8:38 PM > [...] > > Want to hear your opinion for one open here. There is no doubt that > > an ioasid represents a HW page table when the table is constructed by > > userspace and then linked to the IOMMU through the bind/unbind > > API. But I'm not very sure about whether an ioasid should represent > > the exact pgtable or the mapping metadata when the underlying > > pgtable is indirectly constructed through map/unmap API. VFIO does > > the latter way, which is why it allows multiple incompatible domains > > in a single container which all share the same mapping metadata. > > I think VFIO's map/unmap is way too complex and we know it has bad > performance problems. Can you or Alex elaborate where the complexity and performance problem locate in VFIO map/umap? We'd like to understand more detail and see how to avoid it in the new interface. > > If /dev/ioasid is single HW page table only then I would focus on that > implementation and leave it to userspace to span different > /dev/ioasids if needed. > > > OK, now I see where the disconnection comes from. In my context ioasid > > is the identifier that is actually used in the wire, but seems you treat it > > as > > a sw-defined namespace purely for representing page tables. We should > > clear this concept first before further discussing other details. > > There is no general HW requirement that every IO page table be > referred to by the same PASID and this API would have to support Yes, but what is the value of allowing multiple PASIDs referring to the the same I/O page table (except the nesting pgtable case)? Doesn't it lead to poor iotlb efficiency issue similar to multiple iommu domains referring to the same page table? > non-PASID IO page tables as well. So I'd keep the two things > separated in the uAPI - even though the kernel today has a global > PASID pool. for non-PASID usages the allocated PASID will be wasted if we don't separate ioasid from pasid. But it may be worthwhile given 1m available pasids and the simplification in the uAPI which only needs to care about one id space then. > > > Then following your proposal, does it mean that we need another > > interface for allocating PASID? and since ioasid means different > > thing in uAPI and in-kernel API, possibly a new name is required to > > avoid confusion? > > I would suggest have two ways to control the PASID > > 1) Over /dev/ioasid allocate a PASID for an IOASID. All future PASID > based usages of the IOASID will use that global PASID > > 2) Over the device FD, when the IOASID is bound return the PASID that > was selected. If the IOASID does not have a global PASID then the > kernel is free to make something up. In this mode a single IOASID > can have multiple PASIDs. > > Simple things like DPDK can use #2 and potentially have better PASID > limits. hypervisors will most likely have to use #1, but it depends on > how their vIOMMU interface works. Can you elaborate why DPDK wants to use #2 i.e. not using a global PASID? > > I think the name IOASID is fine for the uAPI, the kernel version can > be called ioasid_id or something. ioasid is already an id and then ioasid_id just adds confusion. Another point is that ioasid is currently used to represent both PCI PASID and ARM substream ID in the kernel. It implies that if we want to separate ioasid and pasid in the uAPI the 'pasid' also needs to be replaced with another general term usable for substream ID. Are we making the terms too confusing here? > > (also looking at ioasid.c, why do we need such a thin and odd wrapper > around xarray?) > I'll leave it to Jean and Jacob. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu