RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces
Hi, Jason, > From: Jason Gunthorpe > Sent: Wednesday, September 29, 2021 8:59 PM > > On Wed, Sep 29, 2021 at 12:38:35AM +, Tian, Kevin wrote: > > > /* If set the driver must call iommu_XX as the first action in probe() or > > * before it attempts to do DMA > > */ > > bool suppress_dma_owner:1; > > It is not "attempts to do DMA" but more "operates the physical device > in any away" > > Not having ownership means another entity could be using user space > DMA to manipulate the device state and attack the integrity of the > kernel's programming of the device. > Does suppress_kernel_dma sounds better than suppress_dma_owner? We found the latter causing some confusion when doing internal code review. Somehow this flag represents "don't claim the kernel dma ownership during driver binding". suppress_dma_owner sounds the entire ownership is disabled... Another thing is about DMA_OWNER_SHARED, which is set to indicate no dma at all. Thinking more we feel that this flag is meaningless. Its sole purpose is to show compatibility to any USER/KERNEL ownership, and essentially the same semantics as a device which is not bound to any driver. So we plan to remove it then pci-stub just needs one line change to set the suppress flag. But want to check with you first in case any oversight. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
> From: Jason Gunthorpe > Sent: Thursday, October 14, 2021 11:43 PM > > > > > I think the key is whether other archs allow driver to decide DMA > > > > coherency and indirectly the underlying I/O page table format. > > > > If yes, then I don't see a reason why such decision should not be > > > > given to userspace for passthrough case. > > > > > > The choice all comes down to if the other arches have cache > > > maintenance instructions in the VM that *don't work* > > > > Looks vfio always sets IOMMU_CACHE on all platforms as long as > > iommu supports it (true on all platforms except intel iommu which > > is dedicated for GPU): > > > > vfio_iommu_type1_attach_group() > > { > > ... > > if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY)) > > domain->prot |= IOMMU_CACHE; > > ... > > } > > > > Should above be set according to whether a device is coherent? > > For IOMMU_CACHE there are two questions related to the overloaded > meaning: > > - Should VFIO ask the IOMMU to use non-coherent DMA (ARM meaning) >This depends on how the VFIO user expects to operate the DMA. >If the VFIO user can issue cache maintenance ops then IOMMU_CACHE >should be controlled by the user. I have no idea what platforms >support user space cache maintenance ops. But just like you said for intel meaning below, even if those ops are privileged a uAPI can be provided to support such usage if necessary. > > - Should VFIO ask the IOMMU to suppress no-snoop (Intel meaning) >This depends if the VFIO user has access to wbinvd or not. > >wbinvd is a privileged instruction so normally userspace will not >be able to access it. > >Per Paolo recommendation there should be a uAPI someplace that >allows userspace to issue wbinvd - basically the suppress no-snoop >is also user controllable. > > The two things are very similar and ultimately are a choice userspace > should be making. yes > > From something like a qemu perspective things are more murkey - eg on > ARM qemu needs to co-ordinate with the guest. Whatever IOMMU_CACHE > mode VFIO is using must match the device coherent flag in the Linux > guest. I'm guessing all Linux guest VMs only use coherent DMA for all > devices today. I don't know if the cache maintaince ops are even > permitted in an ARM VM. > I'll leave it to Jean to confirm. If only coherent DMA can be used in the guest on other platforms, suppose VFIO should not blindly set IOMMU_CACHE and in concept it should deny assigning a non-coherent device since no co-ordination with guest exists today. So the bottomline is that we'll keep this no-snoop thing Intel-specific. For the basic skeleton we'll not support no-snoop thus the user needs to set enforce-snoop flag when creating an IOAS like this RFC v1 does. Also need to introduce a new flag instead of abusing IOMMU_CACHE in the kernel. For other platforms it may need a fix to deny non-coherent device (based on above open) for now. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] iommu: Use put_pages_list
On Thu, Oct 14, 2021 at 05:17:18PM +0100, Robin Murphy wrote: > On 2021-10-14 12:52, John Garry wrote: > > On 14/10/2021 12:20, Matthew Wilcox wrote: > > > I'm going to keep pinging this patch weekly. > > > > > > On Thu, Oct 07, 2021 at 07:17:02PM +0100, Matthew Wilcox wrote: > > > > ping? > > > > Robin, Were you checking this? You mentioned "I got > > side-tracked trying to make io-pgtable use that freelist properly" in > > another thread, which seems related. > > Ooh, thanks for the heads-up John - I'm still only just starting to catch up > on my mailing list folders since I got back off holiday. > > Indeed I already started untangling the freelist handling in the flush queue > code (to make the move into iommu-dma smaller). Once I'd figured out how it > worked I did wonder whether there was any more "standard" field to borrow, > since page->freelist did seem very much in the minority. If page->lru is it > then great! From a quick skim of the patch I think I'd only have a few > trivial review comments to make - certainly no objection to the fundamental > change itself (indeed I hit a point in io-pgtable-arm where adding to the > pointer chain got rather awkward, so having proper lists to splice would be > lovely). Great to hear! > Matthew - is this something getting in the way of mm development, or just a > nice cleanup? I'd be happy either to pursue merging it on its own, or to > pick it up and work it into a series with my stuff. This is probably going to get in the way of MM development in ~6 months time. I'm happy for you to pick it up and put it in a series of your own! BTW, the optimisation of the implementation of put_pages_list() is sitting in akpm's tree, so if you see a performance problem, please give that a try. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] iommu: Use put_pages_list
On 2021-10-14 12:52, John Garry wrote: On 14/10/2021 12:20, Matthew Wilcox wrote: I'm going to keep pinging this patch weekly. On Thu, Oct 07, 2021 at 07:17:02PM +0100, Matthew Wilcox wrote: ping? Robin, Were you checking this? You mentioned "I got side-tracked trying to make io-pgtable use that freelist properly" in another thread, which seems related. Ooh, thanks for the heads-up John - I'm still only just starting to catch up on my mailing list folders since I got back off holiday. Indeed I already started untangling the freelist handling in the flush queue code (to make the move into iommu-dma smaller). Once I'd figured out how it worked I did wonder whether there was any more "standard" field to borrow, since page->freelist did seem very much in the minority. If page->lru is it then great! From a quick skim of the patch I think I'd only have a few trivial review comments to make - certainly no objection to the fundamental change itself (indeed I hit a point in io-pgtable-arm where adding to the pointer chain got rather awkward, so having proper lists to splice would be lovely). Matthew - is this something getting in the way of mm development, or just a nice cleanup? I'd be happy either to pursue merging it on its own, or to pick it up and work it into a series with my stuff. Cheers, Robin. Thanks, John On Thu, Sep 30, 2021 at 05:20:42PM +0100, Matthew Wilcox (Oracle) wrote: page->freelist is for the use of slab. We already have the ability to free a list of pages in the core mm, but it requires the use of a list_head and for the pages to be chained together through page->lru. Switch the iommu code over to using free_pages_list(). Signed-off-by: Matthew Wilcox (Oracle) --- drivers/iommu/amd/io_pgtable.c | 99 +++--- drivers/iommu/dma-iommu.c | 11 +--- drivers/iommu/intel/iommu.c | 89 +++--- include/linux/iommu.h | 3 +- 4 files changed, 77 insertions(+), 125 deletions(-) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
On Thu, Oct 14, 2021 at 09:11:58AM +, Tian, Kevin wrote: > But in both cases cache maintenance instructions are available from > guest p.o.v and no coherency semantics would be violated. You've described how Intel's solution papers over the problem. In part wbinvd is defined to restore CPU cache coherence after a no-snoop DMA. Having wbinvd NOP breaks this contract. To counter-act the broken wbinvd the IOMMU completely prevents the use of no-snoop DMA. It converts them to snoop instead. The driver thinks it has no-snoop. The platform appears to support no-snoop. The driver issues wbinvd - but all of it does nothing. Don't think any of this is even remotely related to what ARM is doing here. ARM has neither the broken VM cache ops, nor the IOMMU ability to suppress no-snoop. > > > I think the key is whether other archs allow driver to decide DMA > > > coherency and indirectly the underlying I/O page table format. > > > If yes, then I don't see a reason why such decision should not be > > > given to userspace for passthrough case. > > > > The choice all comes down to if the other arches have cache > > maintenance instructions in the VM that *don't work* > > Looks vfio always sets IOMMU_CACHE on all platforms as long as > iommu supports it (true on all platforms except intel iommu which > is dedicated for GPU): > > vfio_iommu_type1_attach_group() > { > ... > if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY)) > domain->prot |= IOMMU_CACHE; > ... > } > > Should above be set according to whether a device is coherent? For IOMMU_CACHE there are two questions related to the overloaded meaning: - Should VFIO ask the IOMMU to use non-coherent DMA (ARM meaning) This depends on how the VFIO user expects to operate the DMA. If the VFIO user can issue cache maintenance ops then IOMMU_CACHE should be controlled by the user. I have no idea what platforms support user space cache maintenance ops. - Should VFIO ask the IOMMU to suppress no-snoop (Intel meaning) This depends if the VFIO user has access to wbinvd or not. wbinvd is a privileged instruction so normally userspace will not be able to access it. Per Paolo recommendation there should be a uAPI someplace that allows userspace to issue wbinvd - basically the suppress no-snoop is also user controllable. The two things are very similar and ultimately are a choice userspace should be making. >From something like a qemu perspective things are more murkey - eg on ARM qemu needs to co-ordinate with the guest. Whatever IOMMU_CACHE mode VFIO is using must match the device coherent flag in the Linux guest. I'm guessing all Linux guest VMs only use coherent DMA for all devices today. I don't know if the cache maintaince ops are even permitted in an ARM VM. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Thu, Oct 14, 2021 at 03:33:21PM +1100, da...@gibson.dropbear.id.au wrote: > > If the HW can attach multiple non-overlapping IOAS's to the same > > device then the HW is routing to the correct IOAS by using the address > > bits. This is not much different from the prior discussion we had > > where we were thinking of the PASID as an 80 bit address > > Ah... that might be a workable approach. And it even helps me get my > head around multiple attachment which I was struggling with before. > > So, the rule would be that you can attach multiple IOASes to a device, > as long as none of them overlap. The non-overlapping could be because > each IOAS covers a disjoint address range, or it could be because > there's some attached information - such as a PASID - to disambiguate. Right exactly - it is very parallel to PASID And obviously HW support is required to have multiple page table pointers per RID - which sounds like PPC does (high/low pointer?) > What remains a question is where the disambiguating information comes > from in each case: does it come from properties of the IOAS, > propertues of the device, or from extra parameters supplied at attach > time. IIUC, the current draft suggests it always comes at attach time > for the PASID information. Obviously the more consistency we can have > here the better. >From a generic view point I'd say all are fair game. It is up to the IOMMU driver to take the requested set of IOAS's, the "at attachment" information (like PASID) and decide what to do, or fail. > I can also see an additional problem in implementation, once we start > looking at hot-adding devices to existing address spaces. I won't pretend to guess how to implement this :) Just from a modeling perspective is something that works logically. If the kernel implementation is too hard then PPC should do one of the other ideas. Personally I'd probably try for a nice multi-domain attachment model like PASID and not try to create/destroy domains. As I said in my last email I think it is up to each IOMMU HW driver to make these decisions, the iommufd framework just provides a standardized API toward the attaching driver that the IOMMU HW must fit into. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Thu, Oct 14, 2021 at 03:53:33PM +1100, David Gibson wrote: > > My feeling is that qemu should be dealing with the host != target > > case, not the kernel. > > > > The kernel's job should be to expose the IOMMU HW it has, with all > > features accessible, to userspace. > > See... to me this is contrary to the point we agreed on above. I'm not thinking of these as exclusive ideas. The IOCTL interface in iommu can quite happily expose: Create IOAS generically Manipulate IOAS generically Create IOAS with IOMMU driver specific attributes HW specific Manipulate IOAS IOCTL commands all together. So long as everything is focused on a generic in-kernel IOAS object it is fine to have multiple ways in the uAPI to create and manipulate the objects. When I speak about a generic interface I mean "Create IOAS generically" - ie a set of IOCTLs that work on most IOMMU HW and can be relied upon by things like DPDK/etc to always work and be portable. This is why I like "hints" to provide some limited widely applicable micro-optimization. When I said "expose the IOMMU HW it has with all features accessible" I mean also providing "Create IOAS with IOMMU driver specific attributes". These other IOCTLs would allow the IOMMU driver to expose every configuration knob its HW has, in a natural HW centric language. There is no pretense of genericness here, no crazy foo=A, foo=B hidden device specific interface. Think of it as a high level/low level interface to the same thing. > Those are certainly wrong, but they came about explicitly by *not* > being generic rather than by being too generic. So I'm really > confused aso to what you're arguing for / against. IMHO it is not having a PPC specific interface that was the problem, it was making the PPC specific interface exclusive to the type 1 interface. If type 1 continued to work on PPC then DPDK/etc would never learned PPC specific code. For iommufd with the high/low interface each IOMMU HW should ask basic questions: - What should the generic high level interface do on this HW? For instance what should 'Create IOAS generically' do for PPC? It should not fail, it should create *something* What is the best thing for DPDK? I guess the 64 bit window is most broadly useful. - How to accurately describe the HW in terms of standard IOAS objects and where to put HW specific structs to support this. This is where PPC would decide how best to expose a control over its low/high window (eg 1,2,3 IOAS). Whatever the IOMMU driver wants, so long as it fits into the kernel IOAS model facing the connected device driver. QEMU would have IOMMU userspace drivers. One would be the "generic driver" using only the high level generic interface. It should work as best it can on all HW devices. This is the fallback path you talked of. QEMU would also have HW specific IOMMU userspace drivers that know how to operate the exact HW. eg these drivers would know how to use userspace page tables, how to form IOPTEs and how to access the special features. This is how QEMU could use an optimzed path with nested page tables, for instance. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC] virtio: wrap config->reset calls
On 13.10.2021 12:55, Michael S. Tsirkin wrote: This will enable cleanups down the road. The idea is to disable cbs, then add "flush_queued_cbs" callback as a parameter, this way drivers can flush any work queued after callbacks have been disabled. Signed-off-by: Michael S. Tsirkin --- sound/virtio/virtio_card.c | 4 ++-- Reviewed-by: Anton Yakovlev -- Anton Yakovlev Senior Software Engineer OpenSynergy GmbH Rotherstr. 20, 10245 Berlin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] iommu: Use put_pages_list
On 14/10/2021 12:20, Matthew Wilcox wrote: I'm going to keep pinging this patch weekly. On Thu, Oct 07, 2021 at 07:17:02PM +0100, Matthew Wilcox wrote: ping? Robin, Were you checking this? You mentioned "I got side-tracked trying to make io-pgtable use that freelist properly" in another thread, which seems related. Thanks, John On Thu, Sep 30, 2021 at 05:20:42PM +0100, Matthew Wilcox (Oracle) wrote: page->freelist is for the use of slab. We already have the ability to free a list of pages in the core mm, but it requires the use of a list_head and for the pages to be chained together through page->lru. Switch the iommu code over to using free_pages_list(). Signed-off-by: Matthew Wilcox (Oracle) --- drivers/iommu/amd/io_pgtable.c | 99 +++--- drivers/iommu/dma-iommu.c | 11 +--- drivers/iommu/intel/iommu.c| 89 +++--- include/linux/iommu.h | 3 +- 4 files changed, 77 insertions(+), 125 deletions(-) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] iommu: Use put_pages_list
I'm going to keep pinging this patch weekly. On Thu, Oct 07, 2021 at 07:17:02PM +0100, Matthew Wilcox wrote: > ping? > > On Thu, Sep 30, 2021 at 05:20:42PM +0100, Matthew Wilcox (Oracle) wrote: > > page->freelist is for the use of slab. We already have the ability > > to free a list of pages in the core mm, but it requires the use of a > > list_head and for the pages to be chained together through page->lru. > > Switch the iommu code over to using free_pages_list(). > > > > Signed-off-by: Matthew Wilcox (Oracle) > > --- > > drivers/iommu/amd/io_pgtable.c | 99 +++--- > > drivers/iommu/dma-iommu.c | 11 +--- > > drivers/iommu/intel/iommu.c| 89 +++--- > > include/linux/iommu.h | 3 +- > > 4 files changed, 77 insertions(+), 125 deletions(-) > > > > diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c > > index 182c93a43efd..8dfa6ee58b76 100644 > > --- a/drivers/iommu/amd/io_pgtable.c > > +++ b/drivers/iommu/amd/io_pgtable.c > > @@ -74,49 +74,37 @@ static u64 *first_pte_l7(u64 *pte, unsigned long > > *page_size, > > * > > > > / > > > > -static void free_page_list(struct page *freelist) > > -{ > > - while (freelist != NULL) { > > - unsigned long p = (unsigned long)page_address(freelist); > > - > > - freelist = freelist->freelist; > > - free_page(p); > > - } > > -} > > - > > -static struct page *free_pt_page(unsigned long pt, struct page *freelist) > > +static void free_pt_page(unsigned long pt, struct list_head *list) > > { > > struct page *p = virt_to_page((void *)pt); > > > > - p->freelist = freelist; > > - > > - return p; > > + list_add_tail(>lru, list); > > } > > > > #define DEFINE_FREE_PT_FN(LVL, FN) > > \ > > -static struct page *free_pt_##LVL (unsigned long __pt, struct page > > *freelist) \ > > -{ > > \ > > - unsigned long p; > > \ > > - u64 *pt; > > \ > > - int i; > > \ > > - > > \ > > - pt = (u64 *)__pt; > > \ > > - > > \ > > - for (i = 0; i < 512; ++i) { > > \ > > - /* PTE present? */ > > \ > > - if (!IOMMU_PTE_PRESENT(pt[i])) > > \ > > - continue; > > \ > > - > > \ > > - /* Large PTE? */ > > \ > > - if (PM_PTE_LEVEL(pt[i]) == 0 || > > \ > > - PM_PTE_LEVEL(pt[i]) == 7) > > \ > > - continue; > > \ > > - > > \ > > - p = (unsigned long)IOMMU_PTE_PAGE(pt[i]); > > \ > > - freelist = FN(p, freelist); > > \ > > - } > > \ > > - > > \ > > - return free_pt_page((unsigned long)pt, freelist); > > \ > > +static void free_pt_##LVL (unsigned long __pt, struct list_head *list) > > \ > > +{ \ > > + unsigned long p;\ > > + u64 *pt;\ > > + int i; \ > > + \ > > + pt = (u64 *)__pt; \ > > + \ > > + for (i = 0; i < 512; ++i) { \ > > + /* PTE present? */ \ > > + if (!IOMMU_PTE_PRESENT(pt[i])) \ > > + continue; \ > > + \ > > + /* Large PTE? */\ > > + if
RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
> From: Jason Gunthorpe > Sent: Friday, October 1, 2021 6:24 AM > > On Thu, Sep 30, 2021 at 09:35:45AM +, Tian, Kevin wrote: > > > > The Intel functional issue is that Intel blocks the cache maintaince > > > ops from the VM and the VM has no way to self-discover that the cache > > > maintaince ops don't work. > > > > the VM doesn't need to know whether the maintenance ops > > actually works. > > Which is the whole problem. > > Intel has a design where the device driver tells the device to issue > non-cachable TLPs. > > The driver is supposed to know if it can issue the cache maintaince > instructions - if it can then it should ask the device to issue > no-snoop TLPs. > > For instance the same PCI driver on non-x86 should never ask the > device to issue no-snoop TLPs because it has no idea how to restore > cache coherence on eg ARM. > > Do you see the issue? This configuration where the hypervisor silently > make wbsync a NOP breaks the x86 architecture because the guest has no > idea it can no longer use no-snoop features. Thanks for explanation. But I still have one puzzle about the 'break' part. If hypervisor makes wbinvd a NOP then it will also set enforce_snoop bit in PTE to convert non-snoop packet to snoop. No function in the guest is broken, just the performance may lag. If performance matters then hypervisor configures IOMMU to allow non-snoop packet and then emulate wbinvd properly. The contract between vfio and kvm is to convey above requirement on how wbinvd is handled. But in both cases cache maintenance instructions are available from guest p.o.v and no coherency semantics would be violated. > > Using the IOMMU to forcibly prevent the device from issuing no-snoop > makes this whole issue of the broken wbsync moot. it's not prevent issuing. Instead, IOMMU converts non-snoop request to snoop. > > It is important to be really clear on what this is about - this is not > some idealized nice iommu feature - it is working around alot of > backwards compatability baggage that is probably completely unique to > x86. > > > > Other arches don't seem to have this specific problem... > > > > I think the key is whether other archs allow driver to decide DMA > > coherency and indirectly the underlying I/O page table format. > > If yes, then I don't see a reason why such decision should not be > > given to userspace for passthrough case. > > The choice all comes down to if the other arches have cache > maintenance instructions in the VM that *don't work* > Looks vfio always sets IOMMU_CACHE on all platforms as long as iommu supports it (true on all platforms except intel iommu which is dedicated for GPU): vfio_iommu_type1_attach_group() { ... if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY)) domain->prot |= IOMMU_CACHE; ... } Should above be set according to whether a device is coherent? Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC] virtio: wrap config->reset calls
On Wed, Oct 13, 2021 at 06:55:31AM -0400, Michael S. Tsirkin wrote: > This will enable cleanups down the road. > The idea is to disable cbs, then add "flush_queued_cbs" callback > as a parameter, this way drivers can flush any work > queued after callbacks have been disabled. > > Signed-off-by: Michael S. Tsirkin > --- > drivers/iommu/virtio-iommu.c | 2 +- Reviewed-by: Jean-Philippe Brucker ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/arm: fix ARM_SMMU_QCOM compilation
On Wed, Oct 13, 2021 at 09:31:40PM +0200, Arnd Bergmann wrote: > On Wed, Oct 13, 2021 at 6:20 PM Will Deacon wrote: > > On Wed, Oct 13, 2021 at 10:33:55AM +0200, Arnd Bergmann wrote: > > > On Wed, Oct 13, 2021 at 9:58 AM Will Deacon wrote: > > > > On Tue, Oct 12, 2021 at 05:18:00PM +0200, Arnd Bergmann wrote: > > > > I was hoping you and Joerg could just pick your preferred patch > > > into the iommu fixes tree for v5.15. > > > > > > I currently have nothing else pending for my asm-generic tree that > > > introduced the regression, but I can take it through there if that helps > > > you. > > > > I also don't have any fixes pending, and I don't see any in Joerg's tree so > > it's probably quickest if you send it on yourself. Is that ok? > > Sure, no problem. I ended up adding it to the arm/fixes branch of the > soc tree, as I just merged some other fixes there, and it seems as good > as any of the other trees. Thanks, Arnd! Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC 13/20] iommu: Extend iommu_at[de]tach_device() for multiple devices group
> From: David Gibson > Sent: Thursday, October 14, 2021 1:24 PM > > On Sun, Sep 19, 2021 at 02:38:41PM +0800, Liu Yi L wrote: > > From: Lu Baolu > > > > These two helpers could be used when 1) the iommu group is singleton, > > or 2) the upper layer has put the iommu group into the secure state by > > calling iommu_device_init_user_dma(). > > > > As we want the iommufd design to be a device-centric model, we want to > > remove any group knowledge in iommufd. Given that we already have > > iommu_at[de]tach_device() interface, we could extend it for iommufd > > simply by doing below: > > > > - first device in a group does group attach; > > - last device in a group does group detach. > > > > as long as the group has been put into the secure context. > > > > The commit <426a273834eae> ("iommu: Limit > iommu_attach/detach_device to > > device with their own group") deliberately restricts the two interfaces > > to single-device group. To avoid the conflict with existing usages, we > > keep this policy and put the new extension only when the group has been > > marked for user_dma. > > I still kind of hate this interface because it means an operation that > appears to be explicitly on a single device has an implicit effect on > other devices. > I still didn't get your concern why it's such a big deal. With this proposal the group restriction will be 'explicitly' documented in the attach uAPI comment and sample flow in iommufd.rst. A sane user should read all those information to understand how this new sub-system works and follow whatever constraints claimed there. In the end the user should maintain the same group knowledge regardless of whether to use an explicit group uAPI or a device uAPI which has group constraint... Thanks, Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
> From: David Gibson > Sent: Thursday, October 14, 2021 1:00 PM > > On Wed, Oct 13, 2021 at 07:00:58AM +, Tian, Kevin wrote: > > > From: David Gibson > > > Sent: Friday, October 1, 2021 2:11 PM > > > > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > > > > This patch adds IOASID allocation/free interface per iommufd. When > > > > allocating an IOASID, userspace is expected to specify the type and > > > > format information for the target I/O page table. > > > > > > > > This RFC supports only one type > (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > > > > implying a kernel-managed I/O page table with vfio type1v2 mapping > > > > semantics. For this type the user should specify the addr_width of > > > > the I/O address space and whether the I/O page table is created in > > > > an iommu enfore_snoop format. enforce_snoop must be true at this > point, > > > > as the false setting requires additional contract with KVM on handling > > > > WBINVD emulation, which can be added later. > > > > > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next > patch) > > > > for what formats can be specified when allocating an IOASID. > > > > > > > > Open: > > > > - Devices on PPC platform currently use a different iommu driver in > > > > vfio. > > > > Per previous discussion they can also use vfio type1v2 as long as > > > > there > > > > is a way to claim a specific iova range from a system-wide address > space. > > > > This requirement doesn't sound PPC specific, as addr_width for pci > > > devices > > > > can be also represented by a range [0, 2^addr_width-1]. This RFC > hasn't > > > > adopted this design yet. We hope to have formal alignment in v1 > > > discussion > > > > and then decide how to incorporate it in v2. > > > > > > Ok, there are several things we need for ppc. None of which are > > > inherently ppc specific and some of which will I think be useful for > > > most platforms. So, starting from most general to most specific > > > here's basically what's needed: > > > > > > 1. We need to represent the fact that the IOMMU can only translate > > >*some* IOVAs, not a full 64-bit range. You have the addr_width > > >already, but I'm entirely sure if the translatable range on ppc > > >(or other platforms) is always a power-of-2 size. It usually will > > >be, of course, but I'm not sure that's a hard requirement. So > > >using a size/max rather than just a number of bits might be safer. > > > > > >I think basically every platform will need this. Most platforms > > >don't actually implement full 64-bit translation in any case, but > > >rather some smaller number of bits that fits their page table > > >format. > > > > > > 2. The translatable range of IOVAs may not begin at 0. So we need to > > >advertise to userspace what the base address is, as well as the > > >size. POWER's main IOVA range begins at 2^59 (at least on the > > >models I know about). > > > > > >I think a number of platforms are likely to want this, though I > > >couldn't name them apart from POWER. Putting the translated IOVA > > >window at some huge address is a pretty obvious approach to making > > >an IOMMU which can translate a wide address range without colliding > > >with any legacy PCI addresses down low (the IOMMU can check if this > > >transaction is for it by just looking at some high bits in the > > >address). > > > > > > 3. There might be multiple translatable ranges. So, on POWER the > > >IOMMU can typically translate IOVAs from 0..2GiB, and also from > > >2^59..2^59+. The two ranges have completely separate IO > > >page tables, with (usually) different layouts. (The low range will > > >nearly always be a single-level page table with 4kiB or 64kiB > > >entries, the high one will be multiple levels depending on the size > > >of the range and pagesize). > > > > > >This may be less common, but I suspect POWER won't be the only > > >platform to do something like this. As above, using a high range > > >is a pretty obvious approach, but clearly won't handle older > > >devices which can't do 64-bit DMA. So adding a smaller range for > > >those devices is again a pretty obvious solution. Any platform > > >with an "IO hole" can be treated as having two ranges, one below > > >the hole and one above it (although in that case they may well not > > >have separate page tables > > > > 1-3 are common on all platforms with fixed reserved ranges. Current > > vfio already reports permitted iova ranges to user via VFIO_IOMMU_ > > TYPE1_INFO_CAP_IOVA_RANGE and the user is expected to construct > > maps only in those ranges. iommufd can follow the same logic for the > > baseline uAPI. > > > > For above cases a [base, max] hint can be provided by the user per > > Jason's recommendation. > > Provided at which stage? IOMMU_IOASID_ALLOC > > > It is a hint as no additional
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Wed, Oct 13, 2021 at 07:00:58AM +, Tian, Kevin wrote: > > From: David Gibson > > Sent: Friday, October 1, 2021 2:11 PM > > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > > > This patch adds IOASID allocation/free interface per iommufd. When > > > allocating an IOASID, userspace is expected to specify the type and > > > format information for the target I/O page table. > > > > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > > > implying a kernel-managed I/O page table with vfio type1v2 mapping > > > semantics. For this type the user should specify the addr_width of > > > the I/O address space and whether the I/O page table is created in > > > an iommu enfore_snoop format. enforce_snoop must be true at this point, > > > as the false setting requires additional contract with KVM on handling > > > WBINVD emulation, which can be added later. > > > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch) > > > for what formats can be specified when allocating an IOASID. > > > > > > Open: > > > - Devices on PPC platform currently use a different iommu driver in vfio. > > > Per previous discussion they can also use vfio type1v2 as long as there > > > is a way to claim a specific iova range from a system-wide address > > > space. > > > This requirement doesn't sound PPC specific, as addr_width for pci > > devices > > > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't > > > adopted this design yet. We hope to have formal alignment in v1 > > discussion > > > and then decide how to incorporate it in v2. > > > > Ok, there are several things we need for ppc. None of which are > > inherently ppc specific and some of which will I think be useful for > > most platforms. So, starting from most general to most specific > > here's basically what's needed: > > > > 1. We need to represent the fact that the IOMMU can only translate > >*some* IOVAs, not a full 64-bit range. You have the addr_width > >already, but I'm entirely sure if the translatable range on ppc > >(or other platforms) is always a power-of-2 size. It usually will > >be, of course, but I'm not sure that's a hard requirement. So > >using a size/max rather than just a number of bits might be safer. > > > >I think basically every platform will need this. Most platforms > >don't actually implement full 64-bit translation in any case, but > >rather some smaller number of bits that fits their page table > >format. > > > > 2. The translatable range of IOVAs may not begin at 0. So we need to > >advertise to userspace what the base address is, as well as the > >size. POWER's main IOVA range begins at 2^59 (at least on the > >models I know about). > > > >I think a number of platforms are likely to want this, though I > >couldn't name them apart from POWER. Putting the translated IOVA > >window at some huge address is a pretty obvious approach to making > >an IOMMU which can translate a wide address range without colliding > >with any legacy PCI addresses down low (the IOMMU can check if this > >transaction is for it by just looking at some high bits in the > >address). > > > > 3. There might be multiple translatable ranges. So, on POWER the > >IOMMU can typically translate IOVAs from 0..2GiB, and also from > >2^59..2^59+. The two ranges have completely separate IO > >page tables, with (usually) different layouts. (The low range will > >nearly always be a single-level page table with 4kiB or 64kiB > >entries, the high one will be multiple levels depending on the size > >of the range and pagesize). > > > >This may be less common, but I suspect POWER won't be the only > >platform to do something like this. As above, using a high range > >is a pretty obvious approach, but clearly won't handle older > >devices which can't do 64-bit DMA. So adding a smaller range for > >those devices is again a pretty obvious solution. Any platform > >with an "IO hole" can be treated as having two ranges, one below > >the hole and one above it (although in that case they may well not > >have separate page tables > > 1-3 are common on all platforms with fixed reserved ranges. Current > vfio already reports permitted iova ranges to user via VFIO_IOMMU_ > TYPE1_INFO_CAP_IOVA_RANGE and the user is expected to construct > maps only in those ranges. iommufd can follow the same logic for the > baseline uAPI. > > For above cases a [base, max] hint can be provided by the user per > Jason's recommendation. Provided at which stage? > It is a hint as no additional restriction is > imposed, For the qemu type use case, that's not true. In that case we *require* the available mapping ranges to match what the guest platform expects. > since the kernel only cares about no violation on permitted > ranges that it reports to the user. Underlying iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Mon, Oct 11, 2021 at 02:17:48PM -0300, Jason Gunthorpe wrote: > On Mon, Oct 11, 2021 at 04:37:38PM +1100, da...@gibson.dropbear.id.au wrote: > > > PASID support will already require that a device can be multi-bound to > > > many IOAS's, couldn't PPC do the same with the windows? > > > > I don't see how that would make sense. The device has no awareness of > > multiple windows the way it does of PASIDs. It just sends > > transactions over the bus with the IOVAs it's told. If those IOVAs > > lie within one of the windows, the IOMMU picks them up and translates > > them. If they don't, it doesn't. > > To my mind that address centric routing is awareness. I don't really understand that position. A PASID capable device has to be built to be PASID capable, and will generally have registers into which you store PASIDs to use. Any 64-bit DMA capable device can use the POWER IOMMU just fine - it's up to the driver to program it with addresses that will be translated (and in Linux the driver will get those from the DMA subsystem). > If the HW can attach multiple non-overlapping IOAS's to the same > device then the HW is routing to the correct IOAS by using the address > bits. This is not much different from the prior discussion we had > where we were thinking of the PASID as an 80 bit address Ah... that might be a workable approach. And it even helps me get my head around multiple attachment which I was struggling with before. So, the rule would be that you can attach multiple IOASes to a device, as long as none of them overlap. The non-overlapping could be because each IOAS covers a disjoint address range, or it could be because there's some attached information - such as a PASID - to disambiguate. What remains a question is where the disambiguating information comes from in each case: does it come from properties of the IOAS, propertues of the device, or from extra parameters supplied at attach time. IIUC, the current draft suggests it always comes at attach time for the PASID information. Obviously the more consistency we can have here the better. I can also see an additional problem in implementation, once we start looking at hot-adding devices to existing address spaces. Suppose our software (maybe qemu) wants to set up a single DMA view for a bunch of devices, that has such a split window. It can set up IOASes easily enough for the two windows, then it needs to attach them. Presumbly, it attaches them one at a time, which means that each device (or group) goes through an interim state where it's attached to one, but not the other. That can probably be achieved by using an extra IOMMU domain (or the local equivalent) in the hardware for that interim state. However it means we have to repeatedly create and destroy that extra domain for each device after the first we add, rather than simply adding each device to the domain which has both windows. [I think this doesn't arise on POWER when running under PowerVM. That has no concept like IOMMU domains, and instead the mapping is always done per "partitionable endpoint" (PE), essentially a group. That means it's just a question of whether we mirror mappings on both windows into a given PE or just those from one IOAS. It's not an unreasonable extension/combination of existing hardware quirks to consider, though] > The fact the PPC HW actually has multiple page table roots and those > roots even have different page tables layouts while still connected to > the same device suggests this is not even an unnatural modelling > approach... > > Jason > > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Mon, Oct 11, 2021 at 03:49:14PM -0300, Jason Gunthorpe wrote: > On Mon, Oct 11, 2021 at 05:02:01PM +1100, David Gibson wrote: > > > > This means we cannot define an input that has a magic HW specific > > > value. > > > > I'm not entirely sure what you mean by that. > > I mean if you make a general property 'foo' that userspace must > specify correctly then your API isn't general anymore. Userspace must > know if it is A or B HW to set foo=A or foo=B. I absolutely agree. Which is exactly why I'm advocating that userspace should request from the kernel what it needs (providing a *minimum* of information) and the kernel satisfies that (filling in the missing information as suitable for the platform) or outright fails. I think that is more robust across multiple platforms and usecases than advertising a bunch of capabilities and forcing userspace to interpret those to work out what it can do. > Supported IOVA ranges are easially like that as every IOMMU is > different. So DPDK shouldn't provide such specific or binding > information. Absolutely, DPDK should not provide that. qemu *should* provide that, because the specific IOVAs matter to the guest. That will inevitably mean that the request is more likely to fail, but that's a fundamental tradeoff. > > No, I don't think that needs to be a condition. I think it's > > perfectly reasonable for a constraint to be given, and for the host > > IOMMU to just say "no, I can't do that". But that does mean that each > > of these values has to have an explicit way of userspace specifying "I > > don't care", so that the kernel will select a suitable value for those > > instead - that's what DPDK or other userspace would use nearly all the > > time. > > My feeling is that qemu should be dealing with the host != target > case, not the kernel. > > The kernel's job should be to expose the IOMMU HW it has, with all > features accessible, to userspace. See... to me this is contrary to the point we agreed on above. > Qemu's job should be to have a userspace driver for each kernel IOMMU > and the internal infrastructure to make accelerated emulations for all > supported target IOMMUs. This seems the wrong way around to me. I see qemu as providing logic to emulate each target IOMMU. Where that matches the host, there's the potential for an accelerated implementation, but it makes life a lot easier if we can at least have a fallback that will work on any sufficiently capable host IOMMU. > In other words, it is not the kernel's job to provide target IOMMU > emulation. Absolutely not. But it *is* the kernel's job to let qemu do as mach as it can with the *host* IOMMU. > The kernel should provide truely generic "works everywhere" interface > that qemu/etc can rely on to implement the least accelerated emulation > path. Right... seems like we're agreeing again. > So when I see proposals to have "generic" interfaces that actually > require very HW specific setup, and cannot be used by a generic qemu > userpace driver, I think it breaks this model. If qemu needs to know > it is on PPC (as it does today with VFIO's PPC specific API) then it > may as well speak PPC specific language and forget about pretending to > be generic. Absolutely, the current situation is a mess. > This approach is grounded in 15 years of trying to build these > user/kernel split HW subsystems (particularly RDMA) where it has > become painfully obvious that the kernel is the worst place to try and > wrangle really divergent HW into a "common" uAPI. > > This is because the kernel/user boundary is fixed. Introducing > anything generic here requires a lot of time, thought, arguing and > risk. Usually it ends up being done wrong (like the PPC specific > ioctls, for instance) Those are certainly wrong, but they came about explicitly by *not* being generic rather than by being too generic. So I'm really confused aso to what you're arguing for / against. > and when this happens we can't learn and adapt, > we are stuck with stable uABI forever. > > Exposing a device's native programming interface is much simpler. Each > device is fixed, defined and someone can sit down and figure out how > to expose it. Then that is it, it doesn't need revisiting, it doesn't > need harmonizing with a future slightly different device, it just > stays as is. I can certainly see the case for that approach. That seems utterly at odds with what /dev/iommu is trying to do, though. > The cost, is that there must be a userspace driver component for each > HW piece - which we are already paying here! > > > Ideally the host /dev/iommu will say "ok!", since both those ranges > > are within the 0..2^60 translated range of the host IOMMU, and don't > > touch the IO hole. When the guest calls the IO mapping hypercalls, > > qemu translates those into DMA_MAP operations, and since they're all > > within the previously verified windows, they should work fine. > > For instance, we are going to see HW with nested
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Mon, Oct 11, 2021 at 09:49:57AM +0100, Jean-Philippe Brucker wrote: > On Mon, Oct 11, 2021 at 05:02:01PM +1100, David Gibson wrote: > > qemu wants to emulate a PAPR vIOMMU, so it says (via interfaces yet to > > be determined) that it needs an IOAS where things can be mapped in the > > range 0..2GiB (for the 32-bit window) and 2^59..2^59+1TiB (for the > > 64-bit window). > > > > Ideally the host /dev/iommu will say "ok!", since both those ranges > > are within the 0..2^60 translated range of the host IOMMU, and don't > > touch the IO hole. When the guest calls the IO mapping hypercalls, > > qemu translates those into DMA_MAP operations, and since they're all > > within the previously verified windows, they should work fine. > > Seems like we don't need the negotiation part? The host kernel > communicates available IOVA ranges to userspace including holes (patch > 17), and userspace can check that the ranges it needs are within the IOVA > space boundaries. That part is necessary for DPDK as well since it needs > to know about holes in the IOVA space where DMA wouldn't work as expected > (MSI doorbells for example). And there already is a negotiation happening, > when the host kernel rejects MAP ioctl outside the advertised area. The problem with the approach where the kernel advertises and userspace selects based on that, is that it locks us into a specific representation of what's possible. If we get new hardware with new weird constraints that can't be expressed with the representation we chose, we're kind of out of stuffed. Userspace will have to change to accomodate the new extension and have any chance of working on the new hardware. With the model where userspace requests, and the kernel acks or nacks, we can still support existing userspace if the only things it requests can still be accomodated in the new constraints. That's pretty likely if the majority of userspaces request very simple things (say a single IOVA block where it doesn't care about the base address). -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 13/20] iommu: Extend iommu_at[de]tach_device() for multiple devices group
On Sun, Sep 19, 2021 at 02:38:41PM +0800, Liu Yi L wrote: > From: Lu Baolu > > These two helpers could be used when 1) the iommu group is singleton, > or 2) the upper layer has put the iommu group into the secure state by > calling iommu_device_init_user_dma(). > > As we want the iommufd design to be a device-centric model, we want to > remove any group knowledge in iommufd. Given that we already have > iommu_at[de]tach_device() interface, we could extend it for iommufd > simply by doing below: > > - first device in a group does group attach; > - last device in a group does group detach. > > as long as the group has been put into the secure context. > > The commit <426a273834eae> ("iommu: Limit iommu_attach/detach_device to > device with their own group") deliberately restricts the two interfaces > to single-device group. To avoid the conflict with existing usages, we > keep this policy and put the new extension only when the group has been > marked for user_dma. I still kind of hate this interface because it means an operation that appears to be explicitly on a single device has an implicit effect on other devices. > Signed-off-by: Lu Baolu > --- > drivers/iommu/iommu.c | 25 + > 1 file changed, 21 insertions(+), 4 deletions(-) > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c > index bffd84e978fb..b6178997aef1 100644 > --- a/drivers/iommu/iommu.c > +++ b/drivers/iommu/iommu.c > @@ -47,6 +47,7 @@ struct iommu_group { > struct list_head entry; > unsigned long user_dma_owner_id; > refcount_t owner_cnt; > + refcount_t attach_cnt; > }; > > struct group_device { > @@ -1994,7 +1995,7 @@ static int __iommu_attach_device(struct iommu_domain > *domain, > int iommu_attach_device(struct iommu_domain *domain, struct device *dev) > { > struct iommu_group *group; > - int ret; > + int ret = 0; > > group = iommu_group_get(dev); > if (!group) > @@ -2005,11 +2006,23 @@ int iommu_attach_device(struct iommu_domain *domain, > struct device *dev) >* change while we are attaching >*/ > mutex_lock(>mutex); > - ret = -EINVAL; > - if (iommu_group_device_count(group) != 1) > + if (group->user_dma_owner_id) { > + if (group->domain) { > + if (group->domain != domain) > + ret = -EBUSY; > + else > + refcount_inc(>attach_cnt); > + > + goto out_unlock; > + } > + } else if (iommu_group_device_count(group) != 1) { With this condition in the else, how can you ever attach the first device of a multi-device group? > + ret = -EINVAL; > goto out_unlock; > + } > > ret = __iommu_attach_group(domain, group); > + if (!ret && group->user_dma_owner_id) > + refcount_set(>attach_cnt, 1); > > out_unlock: > mutex_unlock(>mutex); > @@ -2261,7 +2274,10 @@ void iommu_detach_device(struct iommu_domain *domain, > struct device *dev) > return; > > mutex_lock(>mutex); > - if (iommu_group_device_count(group) != 1) { > + if (group->user_dma_owner_id) { > + if (!refcount_dec_and_test(>attach_cnt)) > + goto out_unlock; > + } else if (iommu_group_device_count(group) != 1) { Shouldn't this path (detach a thing that's not attached), be a no-op regardless of whether it's a singleton group or not? Why does one deserve a warning and one not? > WARN_ON(1); > goto out_unlock; > } > @@ -3368,6 +3384,7 @@ static int iommu_group_init_user_dma(struct iommu_group > *group, > > group->user_dma_owner_id = owner; > refcount_set(>owner_cnt, 1); > + refcount_set(>attach_cnt, 0); > > /* default domain is unsafe for user-initiated dma */ > if (group->domain == group->default_domain) -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu