Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Fri, Oct 01, 2021 at 09:25:05AM -0300, Jason Gunthorpe wrote: > On Fri, Oct 01, 2021 at 04:19:22PM +1000, da...@gibson.dropbear.id.au wrote: > > On Wed, Sep 22, 2021 at 11:09:11AM -0300, Jason Gunthorpe wrote: > > > On Wed, Sep 22, 2021 at 03:40:25AM +, Tian, Kevin wrote: > > > > > From: Jason Gunthorpe > > > > > Sent: Wednesday, September 22, 2021 1:45 AM > > > > > > > > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > > > > > > This patch adds IOASID allocation/free interface per iommufd. When > > > > > > allocating an IOASID, userspace is expected to specify the type and > > > > > > format information for the target I/O page table. > > > > > > > > > > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > > > > > > implying a kernel-managed I/O page table with vfio type1v2 mapping > > > > > > semantics. For this type the user should specify the addr_width of > > > > > > the I/O address space and whether the I/O page table is created in > > > > > > an iommu enfore_snoop format. enforce_snoop must be true at this > > > > > > point, > > > > > > as the false setting requires additional contract with KVM on > > > > > > handling > > > > > > WBINVD emulation, which can be added later. > > > > > > > > > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch) > > > > > > for what formats can be specified when allocating an IOASID. > > > > > > > > > > > > Open: > > > > > > - Devices on PPC platform currently use a different iommu driver in > > > > > > vfio. > > > > > > Per previous discussion they can also use vfio type1v2 as long as > > > > > > there > > > > > > is a way to claim a specific iova range from a system-wide > > > > > > address space. > > > > > > This requirement doesn't sound PPC specific, as addr_width for pci > > > > > devices > > > > > > can be also represented by a range [0, 2^addr_width-1]. This RFC > > > > > > hasn't > > > > > > adopted this design yet. We hope to have formal alignment in v1 > > > > > discussion > > > > > > and then decide how to incorporate it in v2. > > > > > > > > > > I think the request was to include a start/end IO address hint when > > > > > creating the ios. When the kernel creates it then it can return the > > > > > > > > is the hint single-range or could be multiple-ranges? > > > > > > David explained it here: > > > > > > https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/ > > > > Apparently not well enough. I've attempted again in this thread. > > > > > qeumu needs to be able to chooose if it gets the 32 bit range or 64 > > > bit range. > > > > No. qemu needs to supply *both* the 32-bit and 64-bit range to its > > guest, and therefore needs to request both from the host. > > As I understood your remarks each IOAS can only be one of the formats > as they have a different PTE layout. So here I ment that qmeu needs to > be able to pick *for each IOAS* which of the two formats it is. No. Both windows are in the same IOAS. A device could do DMA simultaneously to both windows. More realstically a 64-bit DMA capable and a non-64-bit DMA capable device could be in the same group and be doing DMAs to different windows simultaneously. > > Or rather, it *might* need to supply both. It will supply just the > > 32-bit range by default, but the guest can request the 64-bit range > > and/or remove and resize the 32-bit range via hypercall interfaces. > > Vaguely recent Linux guests certainly will request the 64-bit range in > > addition to the default 32-bit range. > > And this would result in two different IOAS objects There might be two different IOAS objects for setup, but at some point they need to be combined into one IOAS to which the device is actually attached. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Upstream-Patching for iommu (intel)
Hi All. What is the upstream list, wherein patches for iommu (intel) might be posted? Is it iommu@lists.linux-foundation.org? Thanks and Regards, Ajay ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
On 2021-10-01 4:46 p.m., Jason Gunthorpe wrote: > On Fri, Oct 01, 2021 at 04:22:28PM -0600, Logan Gunthorpe wrote: > >>> It would close this issue, however synchronize_rcu() is very slow >>> (think > 1second) in some cases and thus cannot be inserted here. >> >> It shouldn't be *that* slow, at least not the vast majority of the >> time... it seems a bit unreasonable that a CPU wouldn't schedule for >> more than a second. > > I've seen bug reports on exactly this, it is well known. Loaded > big multi-cpu systems have high delays here, for whatever reason. > >> But these aren't fast paths and synchronize_rcu() already gets >> called in the unbind path for p2pdma a of couple times. I'm sure it >> would also be fine to slow down the vma_close() path as well. > > vma_close is done in a loop destroying vma's and if each synchronize > costs > 1s it can take forever to close a process. We had to kill a > similar use of synchronize_rcu in RDMA because users were complaining > of > 40s process exit times. Ah, fair. This adds a bit of complexity, but we could do a call_rcu() in vma_close to do the page frees. Logan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
On 10/1/21 15:46, Jason Gunthorpe wrote: On Fri, Oct 01, 2021 at 04:22:28PM -0600, Logan Gunthorpe wrote: It would close this issue, however synchronize_rcu() is very slow (think > 1second) in some cases and thus cannot be inserted here. It shouldn't be *that* slow, at least not the vast majority of the time... it seems a bit unreasonable that a CPU wouldn't schedule for more than a second. I've seen bug reports on exactly this, it is well known. Loaded big multi-cpu systems have high delays here, for whatever reason. So have I. One reason is that synchronize_rcu() doesn't merely wait for a context switch on each CPU--it also waits for callbacks (such as those set up by call_rcu(), if I understand correctly) to run. These can really add up to something quite substantial. In fact, I don't think there is an upper limit on the running times, anywhere. thanks, -- John Hubbard NVIDIA ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
On Fri, Oct 01, 2021 at 04:22:28PM -0600, Logan Gunthorpe wrote: > > It would close this issue, however synchronize_rcu() is very slow > > (think > 1second) in some cases and thus cannot be inserted here. > > It shouldn't be *that* slow, at least not the vast majority of the > time... it seems a bit unreasonable that a CPU wouldn't schedule for > more than a second. I've seen bug reports on exactly this, it is well known. Loaded big multi-cpu systems have high delays here, for whatever reason. > But these aren't fast paths and synchronize_rcu() already gets > called in the unbind path for p2pdma a of couple times. I'm sure it > would also be fine to slow down the vma_close() path as well. vma_close is done in a loop destroying vma's and if each synchronize costs > 1s it can take forever to close a process. We had to kill a similar use of synchronize_rcu in RDMA because users were complaining of > 40s process exit times. The driver unload path is fine to be slow, and is probably done on an unloaded system where synchronize_rcu is not so bad Anyway, it is not really something for this series to fix, just something we should all be aware of and probably ought to get fixed before we do much more with ZONE_DEVICE pages Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
On 2021-10-01 4:14 p.m., Jason Gunthorpe wrote: > On Fri, Oct 01, 2021 at 02:13:14PM -0600, Logan Gunthorpe wrote: >> >> >> On 2021-10-01 11:45 a.m., Jason Gunthorpe wrote: Before the invalidation, an active flag is cleared to ensure no new mappings can be created while the unmap is proceeding. unmap_mapping_range() should sequence itself with the TLB flush and >>> >>> AFIAK unmap_mapping_range() kicks off the TLB flush and then >>> returns. It doesn't always wait for the flush to fully finish. Ie some >>> cases use RCU to lock the page table against GUP fast and so the >>> put_page() doesn't happen until the call_rcu completes - after a grace >>> period. The unmap_mapping_range() does not wait for grace periods. >> >> Admittedly, the tlb flush code isn't the easiest code to understand. >> But, yes it seems at least on some arches the pages are freed by >> call_rcu(). But can't this be fixed easily by adding a synchronize_rcu() >> call after calling unmap_mapping_range()? Certainly after a >> synchronize_rcu(), the TLB has been flushed and it is safe to free those >> pages. > > It would close this issue, however synchronize_rcu() is very slow > (think > 1second) in some cases and thus cannot be inserted here. It shouldn't be *that* slow, at least not the vast majority of the time... it seems a bit unreasonable that a CPU wouldn't schedule for more than a second. But these aren't fast paths and synchronize_rcu() already gets called in the unbind path for p2pdma a of couple times. I'm sure it would also be fine to slow down the vma_close() path as well. > I'm also not completely sure that rcu is the only case, I don't know > how every arch handles its gather structure.. I have a feeling the > general intention was for this to be asynchronous Yeah, this is not clear to me either. > My preferences are to either remove devmap from gup_fast, or fix it to > not use special pages - the latter being obviously better. Yeah, I rather expect DAX users want the optimization provided by gup_fast. I don't think P2PDMA users would be happy about being stuck with slow gup either. Loga ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
On Fri, Oct 01, 2021 at 02:13:14PM -0600, Logan Gunthorpe wrote: > > > On 2021-10-01 11:45 a.m., Jason Gunthorpe wrote: > >> Before the invalidation, an active flag is cleared to ensure no new > >> mappings can be created while the unmap is proceeding. > >> unmap_mapping_range() should sequence itself with the TLB flush and > > > > AFIAK unmap_mapping_range() kicks off the TLB flush and then > > returns. It doesn't always wait for the flush to fully finish. Ie some > > cases use RCU to lock the page table against GUP fast and so the > > put_page() doesn't happen until the call_rcu completes - after a grace > > period. The unmap_mapping_range() does not wait for grace periods. > > Admittedly, the tlb flush code isn't the easiest code to understand. > But, yes it seems at least on some arches the pages are freed by > call_rcu(). But can't this be fixed easily by adding a synchronize_rcu() > call after calling unmap_mapping_range()? Certainly after a > synchronize_rcu(), the TLB has been flushed and it is safe to free those > pages. It would close this issue, however synchronize_rcu() is very slow (think > 1second) in some cases and thus cannot be inserted here. I'm also not completely sure that rcu is the only case, I don't know how every arch handles its gather structure.. I have a feeling the general intention was for this to be asynchronous My preferences are to either remove devmap from gup_fast, or fix it to not use special pages - the latter being obviously better. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
On 2021-10-01 11:45 a.m., Jason Gunthorpe wrote: >> Before the invalidation, an active flag is cleared to ensure no new >> mappings can be created while the unmap is proceeding. >> unmap_mapping_range() should sequence itself with the TLB flush and > > AFIAK unmap_mapping_range() kicks off the TLB flush and then > returns. It doesn't always wait for the flush to fully finish. Ie some > cases use RCU to lock the page table against GUP fast and so the > put_page() doesn't happen until the call_rcu completes - after a grace > period. The unmap_mapping_range() does not wait for grace periods. Admittedly, the tlb flush code isn't the easiest code to understand. But, yes it seems at least on some arches the pages are freed by call_rcu(). But can't this be fixed easily by adding a synchronize_rcu() call after calling unmap_mapping_range()? Certainly after a synchronize_rcu(), the TLB has been flushed and it is safe to free those pages. >> P2PDMA follows this pattern, except pages are not mapped linearly and >> are returned to the genalloc when their refcount falls to 1. This only >> happens after a VMA is closed which should imply the PTEs have already >> been unlinked from the pages. > > And here is the problem, since the genalloc is being used we now care > that a page should not continue to be accessed by userspace after it > has be placed back into the genalloc. I suppose fsdax has the same > basic issue too. Ok, similar question. Then if we call synchronize_rcu() in vm_close(), before the put_page() calls which return the pages to the genalloc, would that not guarantee the TLBs have been appropriately flushed? >> Not to say that all this couldn't use a big conceptual cleanup. A >> similar question exists with the single find_special_page() user >> (xen/gntdev) and it's definitely not clear what the differences are >> between the find_special_page() and vmf_insert_mixed() techniques and >> when one should be used over the other. Or could they both be merged to >> use the same technique? > > Oh that gntdev stuff is just nonsense. IIRC is trying to delegate > control over a PTE entry itself to the hypervisor. > > /* >* gntdev takes the address of the PTE in find_grant_ptes() and >* passes it to the hypervisor in gntdev_map_grant_pages(). The >* purpose of the notifier is to prevent the hypervisor pointer >* to the PTE from going stale. >* >* Since this vma's mappings can't be touched without the >* mmap_lock, and we are holding it now, there is no need for >* the notifier_range locking pattern. > > I vaugely recall it stuffs in a normal page then has the hypervisor > overwrite the PTE. When it comes time to free the PTE it recovers the > normal page via the 'find_special_page' hack and frees it. Somehow the > hypervisor is also using the normal page for something. > > It is all very strange and one shouldn't think about it :| Found this from an old commit message which seems to be a better explanation, though I still don't fully understand it: In a Xen PV guest, the PTEs contain MFNs so get_user_pages() (for example) must do an MFN to PFN (M2P) lookup before it can get the page. For foreign pages (those owned by another guest) the M2P lookup returns the PFN as seen by the foreign guest (which would be completely the wrong page for the local guest). This cannot be fixed up improving the M2P lookup since one MFN may be mapped onto two or more pages so getting the right page is impossible given just the MFN. Yes, all very strange. Logan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [git pull] IOMMU Fixes for Linux v5.15-rc3
The pull request you sent on Fri, 1 Oct 2021 18:01:38 +0200: > git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git > tags/iommu-fixes-v5.15-rc3 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/89e503592385fbed872c7ea1fb89931ece3409a5 Thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/prtracker.html ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
On Fri, Oct 01, 2021 at 11:01:49AM -0600, Logan Gunthorpe wrote: > In device-dax, the refcount is only used to prevent the device, and > therefore the pages, from going away on device unbind. Pages cannot be > recycled, as you say, as they are mapped linearly within the device. The > address space invalidation is done only when the device is unbound. By address space invalidation I mean invalidation of the VMA that is pointing to those pages. device-dax may not have a issue with use-after-VMA-invalidation by it's very nature since every PFN always points to the same thing. fsdax and this p2p stuff are different though. > Before the invalidation, an active flag is cleared to ensure no new > mappings can be created while the unmap is proceeding. > unmap_mapping_range() should sequence itself with the TLB flush and AFIAK unmap_mapping_range() kicks off the TLB flush and then returns. It doesn't always wait for the flush to fully finish. Ie some cases use RCU to lock the page table against GUP fast and so the put_page() doesn't happen until the call_rcu completes - after a grace period. The unmap_mapping_range() does not wait for grace periods. This is why for normal memory the put_page is done after the TLB flush completes, not when unmap_mapping_range() finishes. This ensures that before the refcount reaches 0 no concurrent GUP fast can still observe the old PTEs. > GUP-fast using the same mechanism it does for regular pages. As far as I > can see, by the time unmap_mapping_range() returns, we should be > confident that there are no pages left in any mapping (seeing no new > pages could be added since before the call). When viewed under the page table locks this is true, but the 'fast' walkers like gup_fast and hmm_range_fault can continue to be working on old data in the ptes because they don't take the page table locks. They interact with unmap_mapping_range() via the IPI/rcu (gup fast) or mmu notifier sequence count (hmm_range_fault) > P2PDMA follows this pattern, except pages are not mapped linearly and > are returned to the genalloc when their refcount falls to 1. This only > happens after a VMA is closed which should imply the PTEs have already > been unlinked from the pages. And here is the problem, since the genalloc is being used we now care that a page should not continue to be accessed by userspace after it has be placed back into the genalloc. I suppose fsdax has the same basic issue too. > Not to say that all this couldn't use a big conceptual cleanup. A > similar question exists with the single find_special_page() user > (xen/gntdev) and it's definitely not clear what the differences are > between the find_special_page() and vmf_insert_mixed() techniques and > when one should be used over the other. Or could they both be merged to > use the same technique? Oh that gntdev stuff is just nonsense. IIRC is trying to delegate control over a PTE entry itself to the hypervisor. /* * gntdev takes the address of the PTE in find_grant_ptes() and * passes it to the hypervisor in gntdev_map_grant_pages(). The * purpose of the notifier is to prevent the hypervisor pointer * to the PTE from going stale. * * Since this vma's mappings can't be touched without the * mmap_lock, and we are holding it now, there is no need for * the notifier_range locking pattern. I vaugely recall it stuffs in a normal page then has the hypervisor overwrite the PTE. When it comes time to free the PTE it recovers the normal page via the 'find_special_page' hack and frees it. Somehow the hypervisor is also using the normal page for something. It is all very strange and one shouldn't think about it :| Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 1/5] [RFC]iommu: Add a IOMMU_DEVONLY protection flag
> The IOMMU_DEVONLY flag allows the caller to flag a mappings backed by > device-private buffers. That means other devices or CPUs are not > expected to access the physical memory region pointed by the mapping, > and the MMU driver can safely restrict the shareability domain to the > device itself. > > Will be used by the ARM MMU driver to flag Mali mappings accessed only > by the GPU as Inner-shareable. > > Signed-off-by: Boris Brezillon > --- > include/linux/iommu.h | 7 +++ > 1 file changed, 7 insertions(+) > > diff --git a/include/linux/iommu.h b/include/linux/iommu.h > index d2f3435e7d17..db14781b522f 100644 > --- a/include/linux/iommu.h > +++ b/include/linux/iommu.h > @@ -31,6 +31,13 @@ > * if the IOMMU page table format is equivalent. > */ > #define IOMMU_PRIV (1 << 5) > +/* > + * Mapping is only accessed by the device behind the iommu. That means other > + * devices or CPUs are not expected to access this physical memory region, > + * and the MMU driver can safely restrict the shareability domain to the > + * device itself. > + */ > +#define IOMMU_DEVONLY(1 << 6) > > struct iommu_ops; > struct iommu_group; This seems totally reasonable to me, but it is well-known that I'm not on good terms with the iommu subsystem. Let's wait for Robin to NAK :-P ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 4/5] drm/panfrost: Add a PANFROST_BO_GPUONLY flag
> > This seems reasonable to me - it matches the kbase > > BASE_MEM_COHERENT_SYSTEM (only backwards obviously) and it worked > > reasonably well for the blob. Oh, is that what that was for? I remember seeing it set on Midgard for varyings. Good to go full circle now. > > But I'm wondering if we need to do anything special to deal with the > > fact we will now have some non-coherent mappings on an otherwise > > coherent device. > > > > There are certainly some oddities around how these buffers will be > > mapped into user space if requested, e.g. panfrost_gem_create_object() > > sets 'map_wc' based on pfdev->coherent which is arguably wrong for > > GPUONLY. So there are two things we could consider: > > > > a) Actually prevent user space mapping GPUONLY flagged buffers. Which > > matches the intention of the name. > > I intended to do that, just forgot to add wrappers around > drm_gem_shmem_{mmap,vmap}() to forbid CPU-mappings on gpuonly buffers. This feels like the cleaner solution to me. > > b) Attempt to provide user space with the tools to safely interact with > > the buffers (this is the kbase approach). This does have the benefit of > > allowing *mostly* GPU access. An example here is the tiler heap where > > the CPU could zero out as necessary but mostly the GPU has ownership and > > the CPU never reads the contents. GPUONLY/DEVONLY might not be the best > > name in that case. > > Uh, right, I forgot we had to zero the tiler heap on Midgard (most of > the time done with a WRITE_VALUE job, but there's an exception on some > old Midgard GPUs IIRC). "Attempt" is the key word here :| We indeed only touch the tiler heap from the CPU on v4, and life's too short to care about new optimizations for v4. Unless the patch is trivial, my vote is for a) preventing the mappings and only setting GPUONLY on the tiler_heap starting with v5. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/2] iommu: arm-smmu-qcom: Add compatible for qcm2290
On Fri 01 Oct 07:00 PDT 2021, Loic Poulain wrote: > Signed-off-by: Loic Poulain > --- > drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c > b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c > index 55690af..d105186 100644 > --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c > @@ -412,6 +412,7 @@ static const struct of_device_id __maybe_unused > qcom_smmu_impl_of_match[] = { > { .compatible = "qcom,sm8150-smmu-500" }, > { .compatible = "qcom,sm8250-smmu-500" }, > { .compatible = "qcom,sm8350-smmu-500" }, > + { .compatible = "qcom,qcm2290-smmu-500" }, Would be nice if we kept the alphabetical sort order on these and in the binding. With that, please feel free to add my R-b Regards, Bjorn > { } > }; > > -- > 2.7.4 > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
On 2021-10-01 7:48 a.m., Jason Gunthorpe wrote: > On Wed, Sep 29, 2021 at 09:36:52PM -0300, Jason Gunthorpe wrote: > >> Why would DAX want to do this in the first place?? This means the >> address space zap is much more important that just speeding up >> destruction, it is essential for correctness since the PTEs are not >> holding refcounts naturally... > > It is not really for this series to fix, but I think the whole thing > is probably racy once you start allowing pte_special pages to be > accessed by GUP. > > If we look at unmapping the PTE relative to GUP fast the important > sequence is how the TLB flushing doesn't decrement the page refcount > until after it knows any concurrent GUP fast is completed. This is > arch specific, eg it could be done async through a call_rcu handler. > > This ensures that pages can't cross back into the free pool and be > reallocated until we know for certain that nobody is walking the PTEs > and could potentially take an additional reference on it. The scheme > cannot rely on the page refcount being 0 because oce it goes into the > free pool it could be immeidately reallocated back to a non-zero > refcount. > > A DAX user that simply does an address space invalidation doesn't > sequence itself with any of this mechanism. So we can race with a > thread doing GUP fast and another thread re-cycling the page into > another use - creating a leakage of the page from one security context > to another. > > This seems to be made worse for the pgmap stuff due to the wonky > refcount usage - at least if the refcount had dropped to zero gup fast > would be blocked for a time, but even that doesn't happen. > > In short, I think using pg special for anything that can be returned > by gup fast (and maybe even gup!) is racy/wrong. We must have the > normal refcount mechanism work for correctness of the recycling flow. I'm not quite following all of this. I'm not entirely sure how fs/dax works in this regard, but for device-dax (and similarly p2pdma) it doesn't seem as bad as you say. In device-dax, the refcount is only used to prevent the device, and therefore the pages, from going away on device unbind. Pages cannot be recycled, as you say, as they are mapped linearly within the device. The address space invalidation is done only when the device is unbound. Before the invalidation, an active flag is cleared to ensure no new mappings can be created while the unmap is proceeding. unmap_mapping_range() should sequence itself with the TLB flush and GUP-fast using the same mechanism it does for regular pages. As far as I can see, by the time unmap_mapping_range() returns, we should be confident that there are no pages left in any mapping (seeing no new pages could be added since before the call). Then before finishing the unbind, device-dax decrements the refcount of all pages and then waits for the refcount of all pages to go to zero. Thus, any pages that successfully were got with GUP, during or before unmap_mapping_range should hold a reference and once all those references are returned, unbind can finish. P2PDMA follows this pattern, except pages are not mapped linearly and are returned to the genalloc when their refcount falls to 1. This only happens after a VMA is closed which should imply the PTEs have already been unlinked from the pages. And the same situation occurs on unbind with a flag preventing new mappings from being created before unmap_mapping_range(), etc. Not to say that all this couldn't use a big conceptual cleanup. A similar question exists with the single find_special_page() user (xen/gntdev) and it's definitely not clear what the differences are between the find_special_page() and vmf_insert_mixed() techniques and when one should be used over the other. Or could they both be merged to use the same technique? Logan ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 4/5] drm/panfrost: Add a PANFROST_BO_GPUONLY flag
On Fri, 1 Oct 2021 16:13:42 +0100 Steven Price wrote: > On 01/10/2021 15:34, Boris Brezillon wrote: > > This lets the driver reduce shareability domain of the MMU mapping, > > which can in turn reduce access time and save power on cache-coherent > > systems. > > > > Signed-off-by: Boris Brezillon > > This seems reasonable to me - it matches the kbase > BASE_MEM_COHERENT_SYSTEM (only backwards obviously) and it worked > reasonably well for the blob. > > But I'm wondering if we need to do anything special to deal with the > fact we will now have some non-coherent mappings on an otherwise > coherent device. > > There are certainly some oddities around how these buffers will be > mapped into user space if requested, e.g. panfrost_gem_create_object() > sets 'map_wc' based on pfdev->coherent which is arguably wrong for > GPUONLY. So there are two things we could consider: > > a) Actually prevent user space mapping GPUONLY flagged buffers. Which > matches the intention of the name. I intended to do that, just forgot to add wrappers around drm_gem_shmem_{mmap,vmap}() to forbid CPU-mappings on gpuonly buffers. > > b) Attempt to provide user space with the tools to safely interact with > the buffers (this is the kbase approach). This does have the benefit of > allowing *mostly* GPU access. An example here is the tiler heap where > the CPU could zero out as necessary but mostly the GPU has ownership and > the CPU never reads the contents. GPUONLY/DEVONLY might not be the best > name in that case. Uh, right, I forgot we had to zero the tiler heap on Midgard (most of the time done with a WRITE_VALUE job, but there's an exception on some old Midgard GPUs IIRC). ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[git pull] IOMMU Fixes for Linux v5.15-rc3
Hi Linus, The following changes since commit 5816b3e6577eaa676ceb00a848f0fd65fe2adc29: Linux 5.15-rc3 (2021-09-26 14:08:19 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git tags/iommu-fixes-v5.15-rc3 for you to fetch changes up to f0b636804c7c4c564efbca5981e3c56b5c6fe0c5: iommu/dart: Clear sid2group entry when a group is freed (2021-09-28 11:47:24 +0200) IOMMU Fixes for Linux v5.15-rc3 Including: - Two fixes for the new Apple DART driver to fix a kernel panic and a stale data usage issue - Intel VT-d fix for how PCI device ids are printed Bjorn Helgaas (1): iommu/vt-d: Drop "0x" prefix from PCI bus & device addresses Sven Peter (2): iommu/dart: Remove iommu_flush_ops iommu/dart: Clear sid2group entry when a group is freed drivers/iommu/apple-dart.c | 56 +- drivers/iommu/intel/dmar.c | 6 ++--- 2 files changed, 38 insertions(+), 24 deletions(-) Please pull. Thanks, Joerg signature.asc Description: Digital signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2] iommu/arm-smmu-v3: poll cmdq until it has space
On 26/09/2021 19:51, Fernand Sieber wrote: When a thread sends commands to the SMMU, it needs to allocate some space to write its commands in a ring buffer. The allocation algorithms works as follows: until enough free spaced is available in the queue, repeat the following outer loop. First, try to acquire an exclusive lock to read the consumer index from the SMMU register over MMIO. If that fails, it means that another thread holds the lock (or multiple threads, in shared mode, for sync commands). The current code guarantees that when a thread is holding the lock, the consumer index will be updated from the SMMU register. So then when the acquisition of the exclusive lock fails, we can safely assume that another thread will eventually update the consumer index and enter an inner waiting loop until that happens. You have written a lot in the commit message. I think that this paragraph can be dropped to try to make it more concise and readable. The problem that this patch fixes is that the waiting loop exits as soon as any space is available in the queue, so it is possible that it exits immediately if there's some space available but not enough to write the thread's commands. To me, this seems strange. From my experience, I would expect that if the queue is full, and then some space comes available after updating the cons pointer from HW, then the cons pointer would have advanced so far such that we have more than enough space for the maximum amount of commands in a batch, 64. That means the cmdq allocation queue will fail (outer loop) and the thread will spin attempting to acquire the exclusive lock to update the consumer index from the SMMU register. If a lot of threads are failing to allocate commands, As above, what is special for your HW such that when the queue becomes not full that there is still not enough space? The cons pointer is only updated at 2x locations: - in arm_smmu_cmdq_poll_until_not_full() - the last CPU gathered unlocks the shared lock at the function exit So it's not updated very regularly. this can cause heavy contention on the lock, to the point where the system slowdown or livelocks. The livelock is possible if some other threads are attempting to execute synchronous commands. What are synchronous commands in this context? Do you mean we're issuing a CMD_SYNC? These threads need to ensure that they control updates to the consumer index so that they can correctly observe when their command is executed, they enforce that by acquiring the lock in shared mode. If there's too much contention, they never succeed to acquire the lock via the read+cmpxchg mechanism, and their progress stall. Why is this? Why can they not get the lock? They spin waiting for it. But because they already hold allocated space in the command queue, their stall prevents progress from other threads attempting to allocate space in the cmdq. This causes a livelock. This patch makes the waiting loop exit as soon as enough space is available, rather than as soon as any space is available. This means that when two threads are competing for the exclusive lock when allocating space in the queue, one of them will fail to acquiire the lock in exclusive lock and be knocked to the waiting loop and stay there until there's enough free space rather than exiting it immediately when any space is available. Threads in the waiting loop do not compete for the lock, reducing contention enough to enable synchronous threads to make progress, when applicable. Note that we cannot afford to have all threads parked in the waiting loop unless there are synchronous threads executing concurrenty, nit: I'd say CPU when describing this problem, rather than thread otherwise no thread is observing the SMMU register and updating the consumer index. Thus if we succeed to acquire the lock in exclusive mode, we cannot enter the waiting loop because we could be the last thread observing the SMMU. Similarly, if the producer index is updated, we need to exit the waiting loop because it could mean that the latest thread to observe the SMMU has succeeded to allocate commands and thus has moved on. Signed-off-by: Fernand Sieber --- Changes in v2 - Fix inverted condition to break out the loop when queue_has_space - Replace obsolete comment reference to llq->state->val by llq->val drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 41 +++-- 1 file changed, 29 insertions(+), 12 deletions(-) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index a388e318f86e..a0a04cc9c57d 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -118,12 +118,6 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n) return space >= n; } -static bool queue_full(struct arm_smmu_ll_queue *q) -{ - return Q_IDX(q, q->prod) == Q_IDX(q, q->cons) && - Q_WRP(q, q->prod) !=
Re: [PATCH v2 4/5] drm/panfrost: Add a PANFROST_BO_GPUONLY flag
On 01/10/2021 15:34, Boris Brezillon wrote: > This lets the driver reduce shareability domain of the MMU mapping, > which can in turn reduce access time and save power on cache-coherent > systems. > > Signed-off-by: Boris Brezillon This seems reasonable to me - it matches the kbase BASE_MEM_COHERENT_SYSTEM (only backwards obviously) and it worked reasonably well for the blob. But I'm wondering if we need to do anything special to deal with the fact we will now have some non-coherent mappings on an otherwise coherent device. There are certainly some oddities around how these buffers will be mapped into user space if requested, e.g. panfrost_gem_create_object() sets 'map_wc' based on pfdev->coherent which is arguably wrong for GPUONLY. So there are two things we could consider: a) Actually prevent user space mapping GPUONLY flagged buffers. Which matches the intention of the name. b) Attempt to provide user space with the tools to safely interact with the buffers (this is the kbase approach). This does have the benefit of allowing *mostly* GPU access. An example here is the tiler heap where the CPU could zero out as necessary but mostly the GPU has ownership and the CPU never reads the contents. GPUONLY/DEVONLY might not be the best name in that case. Any thoughts? Thanks, Steve > --- > drivers/gpu/drm/panfrost/panfrost_drv.c | 3 ++- > drivers/gpu/drm/panfrost/panfrost_gem.c | 1 + > drivers/gpu/drm/panfrost/panfrost_gem.h | 1 + > drivers/gpu/drm/panfrost/panfrost_mmu.c | 3 +++ > include/uapi/drm/panfrost_drm.h | 1 + > 5 files changed, 8 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c > b/drivers/gpu/drm/panfrost/panfrost_drv.c > index b29ac942ae2d..b176921b9392 100644 > --- a/drivers/gpu/drm/panfrost/panfrost_drv.c > +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c > @@ -77,7 +77,8 @@ static int panfrost_ioctl_get_param(struct drm_device > *ddev, void *data, struct > > #define PANFROST_BO_FLAGS \ > (PANFROST_BO_NOEXEC | PANFROST_BO_HEAP | \ > - PANFROST_BO_NOREAD | PANFROST_BO_NOWRITE) > + PANFROST_BO_NOREAD | PANFROST_BO_NOWRITE | \ > + PANFROST_BO_GPUONLY) > > static int panfrost_ioctl_create_bo(struct drm_device *dev, void *data, > struct drm_file *file) > diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.c > b/drivers/gpu/drm/panfrost/panfrost_gem.c > index d6c1bb1445f2..4b1f85c0b98f 100644 > --- a/drivers/gpu/drm/panfrost/panfrost_gem.c > +++ b/drivers/gpu/drm/panfrost/panfrost_gem.c > @@ -254,6 +254,7 @@ panfrost_gem_create_with_handle(struct drm_file > *file_priv, > bo->noread = !!(flags & PANFROST_BO_NOREAD); > bo->nowrite = !!(flags & PANFROST_BO_NOWRITE); > bo->is_heap = !!(flags & PANFROST_BO_HEAP); > + bo->gpuonly = !!(flags & PANFROST_BO_GPUONLY); > > /* >* Allocate an id of idr table where the obj is registered > diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.h > b/drivers/gpu/drm/panfrost/panfrost_gem.h > index 6246b5fef446..e332d5a4c24f 100644 > --- a/drivers/gpu/drm/panfrost/panfrost_gem.h > +++ b/drivers/gpu/drm/panfrost/panfrost_gem.h > @@ -40,6 +40,7 @@ struct panfrost_gem_object { > bool noread :1; > bool nowrite:1; > bool is_heap:1; > + bool gpuonly:1; > }; > > struct panfrost_gem_mapping { > diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c > b/drivers/gpu/drm/panfrost/panfrost_mmu.c > index 6a5c9d94d6f2..89eee8e80aa5 100644 > --- a/drivers/gpu/drm/panfrost/panfrost_mmu.c > +++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c > @@ -321,6 +321,9 @@ int panfrost_mmu_map(struct panfrost_gem_mapping *mapping) > if (!bo->noread) > prot |= IOMMU_READ; > > + if (bo->gpuonly) > + prot |= IOMMU_DEVONLY; > + > sgt = drm_gem_shmem_get_pages_sgt(obj); > if (WARN_ON(IS_ERR(sgt))) > return PTR_ERR(sgt); > diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h > index a2de81225125..538b58b2d095 100644 > --- a/include/uapi/drm/panfrost_drm.h > +++ b/include/uapi/drm/panfrost_drm.h > @@ -88,6 +88,7 @@ struct drm_panfrost_wait_bo { > #define PANFROST_BO_HEAP 2 > #define PANFROST_BO_NOREAD 4 > #define PANFROST_BO_NOWRITE 8 > +#define PANFROST_BO_GPUONLY 0x10 > > /** > * struct drm_panfrost_create_bo - ioctl argument for creating Panfrost BOs. > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH kernel] powerpc/iommu: Report the correct most efficient DMA mask for PCI devices
On Thu, 30 Sep 2021 13:44:54 +1000, Alexey Kardashevskiy wrote: > According to dma-api.rst, the dma_get_required_mask() helper should return > "the mask that the platform requires to operate efficiently". Which in > the case of PPC64 means the bypass mask and not a mask from an IOMMU table > which is shorter and slower to use due to map/unmap operations (especially > expensive on "pseries"). > > However the existing implementation ignores the possibility of bypassing > and returns the IOMMU table mask on the pseries platform which makes some > drivers (mpt3sas is one example) choose 32bit DMA even though bypass is > supported. The powernv platform sort of handles it by having a bigger > default window with a mask >=40 but it only works as drivers choose > 63/64bit if the required mask is >32 which is rather pointless. > > [...] Applied to powerpc/fixes. [1/1] powerpc/iommu: Report the correct most efficient DMA mask for PCI devices https://git.kernel.org/powerpc/c/23c216b335d1fbd716076e8263b54a714ea3cf0e cheers ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 3/5] drm/panfrost: Add PANFROST_BO_NO{READ,WRITE} flags
On Fri, 1 Oct 2021 16:34:25 +0200 Boris Brezillon wrote: > So we can create GPU mappings without R/W permissions. Particularly > useful to debug corruptions caused by out-of-bound writes. > > Signed-off-by: Boris Brezillon Oops, forgot: Reviewed-by: Alyssa Rosenzweig Reviewed-by: Steven Price > --- > drivers/gpu/drm/panfrost/panfrost_drv.c | 11 ++- > drivers/gpu/drm/panfrost/panfrost_gem.c | 2 ++ > drivers/gpu/drm/panfrost/panfrost_gem.h | 2 ++ > drivers/gpu/drm/panfrost/panfrost_mmu.c | 8 +++- > include/uapi/drm/panfrost_drm.h | 2 ++ > 5 files changed, 23 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c > b/drivers/gpu/drm/panfrost/panfrost_drv.c > index 82ad9a67f251..b29ac942ae2d 100644 > --- a/drivers/gpu/drm/panfrost/panfrost_drv.c > +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c > @@ -75,6 +75,10 @@ static int panfrost_ioctl_get_param(struct drm_device > *ddev, void *data, struct > return 0; > } > > +#define PANFROST_BO_FLAGS \ > + (PANFROST_BO_NOEXEC | PANFROST_BO_HEAP | \ > + PANFROST_BO_NOREAD | PANFROST_BO_NOWRITE) > + > static int panfrost_ioctl_create_bo(struct drm_device *dev, void *data, > struct drm_file *file) > { > @@ -84,7 +88,7 @@ static int panfrost_ioctl_create_bo(struct drm_device *dev, > void *data, > struct panfrost_gem_mapping *mapping; > > if (!args->size || args->pad || > - (args->flags & ~(PANFROST_BO_NOEXEC | PANFROST_BO_HEAP))) > + (args->flags & ~PANFROST_BO_FLAGS)) > return -EINVAL; > > /* Heaps should never be executable */ > @@ -92,6 +96,11 @@ static int panfrost_ioctl_create_bo(struct drm_device > *dev, void *data, > !(args->flags & PANFROST_BO_NOEXEC)) > return -EINVAL; > > + /* Executable implies readable */ > + if ((args->flags & PANFROST_BO_NOREAD) && > + !(args->flags & PANFROST_BO_NOEXEC)) > + return -EINVAL; > + > bo = panfrost_gem_create_with_handle(file, dev, args->size, args->flags, >&args->handle); > if (IS_ERR(bo)) > diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.c > b/drivers/gpu/drm/panfrost/panfrost_gem.c > index 23377481f4e3..d6c1bb1445f2 100644 > --- a/drivers/gpu/drm/panfrost/panfrost_gem.c > +++ b/drivers/gpu/drm/panfrost/panfrost_gem.c > @@ -251,6 +251,8 @@ panfrost_gem_create_with_handle(struct drm_file > *file_priv, > > bo = to_panfrost_bo(&shmem->base); > bo->noexec = !!(flags & PANFROST_BO_NOEXEC); > + bo->noread = !!(flags & PANFROST_BO_NOREAD); > + bo->nowrite = !!(flags & PANFROST_BO_NOWRITE); > bo->is_heap = !!(flags & PANFROST_BO_HEAP); > > /* > diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.h > b/drivers/gpu/drm/panfrost/panfrost_gem.h > index 8088d5fd8480..6246b5fef446 100644 > --- a/drivers/gpu/drm/panfrost/panfrost_gem.h > +++ b/drivers/gpu/drm/panfrost/panfrost_gem.h > @@ -37,6 +37,8 @@ struct panfrost_gem_object { > atomic_t gpu_usecount; > > bool noexec :1; > + bool noread :1; > + bool nowrite:1; > bool is_heap:1; > }; > > diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c > b/drivers/gpu/drm/panfrost/panfrost_mmu.c > index f51d3f791a17..6a5c9d94d6f2 100644 > --- a/drivers/gpu/drm/panfrost/panfrost_mmu.c > +++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c > @@ -307,7 +307,7 @@ int panfrost_mmu_map(struct panfrost_gem_mapping *mapping) > struct drm_gem_object *obj = &bo->base.base; > struct panfrost_device *pfdev = to_panfrost_device(obj->dev); > struct sg_table *sgt; > - int prot = IOMMU_READ | IOMMU_WRITE; > + int prot = 0; > > if (WARN_ON(mapping->active)) > return 0; > @@ -315,6 +315,12 @@ int panfrost_mmu_map(struct panfrost_gem_mapping > *mapping) > if (bo->noexec) > prot |= IOMMU_NOEXEC; > > + if (!bo->nowrite) > + prot |= IOMMU_WRITE; > + > + if (!bo->noread) > + prot |= IOMMU_READ; > + > sgt = drm_gem_shmem_get_pages_sgt(obj); > if (WARN_ON(IS_ERR(sgt))) > return PTR_ERR(sgt); > diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h > index 061e700dd06c..a2de81225125 100644 > --- a/include/uapi/drm/panfrost_drm.h > +++ b/include/uapi/drm/panfrost_drm.h > @@ -86,6 +86,8 @@ struct drm_panfrost_wait_bo { > > #define PANFROST_BO_NOEXEC 1 > #define PANFROST_BO_HEAP 2 > +#define PANFROST_BO_NOREAD 4 > +#define PANFROST_BO_NOWRITE 8 > > /** > * struct drm_panfrost_create_bo - ioctl argument for creating Panfrost BOs. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 5/5] drm/panfrost: Bump the driver version to 1.3
Bump the driver version to 1.3 to account for the PANFROST_BO_NO{READ,WRITE,GPUONLY} flags addition. Signed-off-by: Boris Brezillon --- drivers/gpu/drm/panfrost/panfrost_drv.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c index b176921b9392..5afcec5ab4db 100644 --- a/drivers/gpu/drm/panfrost/panfrost_drv.c +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c @@ -530,6 +530,7 @@ DEFINE_DRM_GEM_FOPS(panfrost_drm_driver_fops); * - 1.0 - initial interface * - 1.1 - adds HEAP and NOEXEC flags for CREATE_BO * - 1.2 - adds AFBC_FEATURES query + * - 1.3 - adds PANFROST_BO_NO{READ,WRITE,GPUONLY} flags */ static const struct drm_driver panfrost_drm_driver = { .driver_features= DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ, @@ -542,7 +543,7 @@ static const struct drm_driver panfrost_drm_driver = { .desc = "panfrost DRM", .date = "20180908", .major = 1, - .minor = 2, + .minor = 3, .gem_create_object = panfrost_gem_create_object, .prime_handle_to_fd = drm_gem_prime_handle_to_fd, -- 2.31.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 2/5] [RFC]iommu/io-pgtable-arm: Take the DEVONLY flag into account on ARM_MALI_LPAE
Restrict the shareability domain when mapping buffers that are GPU-visible only. Signed-off-by: Boris Brezillon --- Flagged RFC because I'm not sure adding a new flag is the right way to convey the 'dev-private buffer' information. --- drivers/iommu/io-pgtable-arm.c | 25 + 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c index dd9e47189d0d..6ac3defb9ae1 100644 --- a/drivers/iommu/io-pgtable-arm.c +++ b/drivers/iommu/io-pgtable-arm.c @@ -450,16 +450,25 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data, << ARM_LPAE_PTE_ATTRINDX_SHIFT); } - /* -* Also Mali has its own notions of shareability wherein its Inner -* domain covers the cores within the GPU, and its Outer domain is -* "outside the GPU" (i.e. either the Inner or System domain in CPU -* terms, depending on coherency). -*/ - if (prot & IOMMU_CACHE && data->iop.fmt != ARM_MALI_LPAE) + if (data->iop.fmt == ARM_MALI_LPAE) { + /* +* Mali has its own notions of shareability wherein its Inner +* domain covers the cores within the GPU, and its Outer domain +* is "outside the GPU" (i.e. either the Inner or System domain +* in CPU terms, depending on coherency). +* If the mapping is only device-visible, we can use the Inner +* domain, otherwise we need to stick to Outer domain +* shareability. +*/ + if (prot & IOMMU_DEVONLY) + pte |= ARM_LPAE_PTE_SH_IS; + else + pte |= ARM_LPAE_PTE_SH_OS; + } else if (prot & IOMMU_CACHE) { pte |= ARM_LPAE_PTE_SH_IS; - else + } else { pte |= ARM_LPAE_PTE_SH_OS; + } if (prot & IOMMU_NOEXEC) pte |= ARM_LPAE_PTE_XN; -- 2.31.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 3/5] drm/panfrost: Add PANFROST_BO_NO{READ,WRITE} flags
So we can create GPU mappings without R/W permissions. Particularly useful to debug corruptions caused by out-of-bound writes. Signed-off-by: Boris Brezillon --- drivers/gpu/drm/panfrost/panfrost_drv.c | 11 ++- drivers/gpu/drm/panfrost/panfrost_gem.c | 2 ++ drivers/gpu/drm/panfrost/panfrost_gem.h | 2 ++ drivers/gpu/drm/panfrost/panfrost_mmu.c | 8 +++- include/uapi/drm/panfrost_drm.h | 2 ++ 5 files changed, 23 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c index 82ad9a67f251..b29ac942ae2d 100644 --- a/drivers/gpu/drm/panfrost/panfrost_drv.c +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c @@ -75,6 +75,10 @@ static int panfrost_ioctl_get_param(struct drm_device *ddev, void *data, struct return 0; } +#define PANFROST_BO_FLAGS \ + (PANFROST_BO_NOEXEC | PANFROST_BO_HEAP | \ +PANFROST_BO_NOREAD | PANFROST_BO_NOWRITE) + static int panfrost_ioctl_create_bo(struct drm_device *dev, void *data, struct drm_file *file) { @@ -84,7 +88,7 @@ static int panfrost_ioctl_create_bo(struct drm_device *dev, void *data, struct panfrost_gem_mapping *mapping; if (!args->size || args->pad || - (args->flags & ~(PANFROST_BO_NOEXEC | PANFROST_BO_HEAP))) + (args->flags & ~PANFROST_BO_FLAGS)) return -EINVAL; /* Heaps should never be executable */ @@ -92,6 +96,11 @@ static int panfrost_ioctl_create_bo(struct drm_device *dev, void *data, !(args->flags & PANFROST_BO_NOEXEC)) return -EINVAL; + /* Executable implies readable */ + if ((args->flags & PANFROST_BO_NOREAD) && + !(args->flags & PANFROST_BO_NOEXEC)) + return -EINVAL; + bo = panfrost_gem_create_with_handle(file, dev, args->size, args->flags, &args->handle); if (IS_ERR(bo)) diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.c b/drivers/gpu/drm/panfrost/panfrost_gem.c index 23377481f4e3..d6c1bb1445f2 100644 --- a/drivers/gpu/drm/panfrost/panfrost_gem.c +++ b/drivers/gpu/drm/panfrost/panfrost_gem.c @@ -251,6 +251,8 @@ panfrost_gem_create_with_handle(struct drm_file *file_priv, bo = to_panfrost_bo(&shmem->base); bo->noexec = !!(flags & PANFROST_BO_NOEXEC); + bo->noread = !!(flags & PANFROST_BO_NOREAD); + bo->nowrite = !!(flags & PANFROST_BO_NOWRITE); bo->is_heap = !!(flags & PANFROST_BO_HEAP); /* diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.h b/drivers/gpu/drm/panfrost/panfrost_gem.h index 8088d5fd8480..6246b5fef446 100644 --- a/drivers/gpu/drm/panfrost/panfrost_gem.h +++ b/drivers/gpu/drm/panfrost/panfrost_gem.h @@ -37,6 +37,8 @@ struct panfrost_gem_object { atomic_t gpu_usecount; bool noexec :1; + bool noread :1; + bool nowrite:1; bool is_heap:1; }; diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c b/drivers/gpu/drm/panfrost/panfrost_mmu.c index f51d3f791a17..6a5c9d94d6f2 100644 --- a/drivers/gpu/drm/panfrost/panfrost_mmu.c +++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c @@ -307,7 +307,7 @@ int panfrost_mmu_map(struct panfrost_gem_mapping *mapping) struct drm_gem_object *obj = &bo->base.base; struct panfrost_device *pfdev = to_panfrost_device(obj->dev); struct sg_table *sgt; - int prot = IOMMU_READ | IOMMU_WRITE; + int prot = 0; if (WARN_ON(mapping->active)) return 0; @@ -315,6 +315,12 @@ int panfrost_mmu_map(struct panfrost_gem_mapping *mapping) if (bo->noexec) prot |= IOMMU_NOEXEC; + if (!bo->nowrite) + prot |= IOMMU_WRITE; + + if (!bo->noread) + prot |= IOMMU_READ; + sgt = drm_gem_shmem_get_pages_sgt(obj); if (WARN_ON(IS_ERR(sgt))) return PTR_ERR(sgt); diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h index 061e700dd06c..a2de81225125 100644 --- a/include/uapi/drm/panfrost_drm.h +++ b/include/uapi/drm/panfrost_drm.h @@ -86,6 +86,8 @@ struct drm_panfrost_wait_bo { #define PANFROST_BO_NOEXEC 1 #define PANFROST_BO_HEAP 2 +#define PANFROST_BO_NOREAD 4 +#define PANFROST_BO_NOWRITE8 /** * struct drm_panfrost_create_bo - ioctl argument for creating Panfrost BOs. -- 2.31.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 4/5] drm/panfrost: Add a PANFROST_BO_GPUONLY flag
This lets the driver reduce shareability domain of the MMU mapping, which can in turn reduce access time and save power on cache-coherent systems. Signed-off-by: Boris Brezillon --- drivers/gpu/drm/panfrost/panfrost_drv.c | 3 ++- drivers/gpu/drm/panfrost/panfrost_gem.c | 1 + drivers/gpu/drm/panfrost/panfrost_gem.h | 1 + drivers/gpu/drm/panfrost/panfrost_mmu.c | 3 +++ include/uapi/drm/panfrost_drm.h | 1 + 5 files changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c b/drivers/gpu/drm/panfrost/panfrost_drv.c index b29ac942ae2d..b176921b9392 100644 --- a/drivers/gpu/drm/panfrost/panfrost_drv.c +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c @@ -77,7 +77,8 @@ static int panfrost_ioctl_get_param(struct drm_device *ddev, void *data, struct #define PANFROST_BO_FLAGS \ (PANFROST_BO_NOEXEC | PANFROST_BO_HEAP | \ -PANFROST_BO_NOREAD | PANFROST_BO_NOWRITE) +PANFROST_BO_NOREAD | PANFROST_BO_NOWRITE | \ +PANFROST_BO_GPUONLY) static int panfrost_ioctl_create_bo(struct drm_device *dev, void *data, struct drm_file *file) diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.c b/drivers/gpu/drm/panfrost/panfrost_gem.c index d6c1bb1445f2..4b1f85c0b98f 100644 --- a/drivers/gpu/drm/panfrost/panfrost_gem.c +++ b/drivers/gpu/drm/panfrost/panfrost_gem.c @@ -254,6 +254,7 @@ panfrost_gem_create_with_handle(struct drm_file *file_priv, bo->noread = !!(flags & PANFROST_BO_NOREAD); bo->nowrite = !!(flags & PANFROST_BO_NOWRITE); bo->is_heap = !!(flags & PANFROST_BO_HEAP); + bo->gpuonly = !!(flags & PANFROST_BO_GPUONLY); /* * Allocate an id of idr table where the obj is registered diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.h b/drivers/gpu/drm/panfrost/panfrost_gem.h index 6246b5fef446..e332d5a4c24f 100644 --- a/drivers/gpu/drm/panfrost/panfrost_gem.h +++ b/drivers/gpu/drm/panfrost/panfrost_gem.h @@ -40,6 +40,7 @@ struct panfrost_gem_object { bool noread :1; bool nowrite:1; bool is_heap:1; + bool gpuonly:1; }; struct panfrost_gem_mapping { diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c b/drivers/gpu/drm/panfrost/panfrost_mmu.c index 6a5c9d94d6f2..89eee8e80aa5 100644 --- a/drivers/gpu/drm/panfrost/panfrost_mmu.c +++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c @@ -321,6 +321,9 @@ int panfrost_mmu_map(struct panfrost_gem_mapping *mapping) if (!bo->noread) prot |= IOMMU_READ; + if (bo->gpuonly) + prot |= IOMMU_DEVONLY; + sgt = drm_gem_shmem_get_pages_sgt(obj); if (WARN_ON(IS_ERR(sgt))) return PTR_ERR(sgt); diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h index a2de81225125..538b58b2d095 100644 --- a/include/uapi/drm/panfrost_drm.h +++ b/include/uapi/drm/panfrost_drm.h @@ -88,6 +88,7 @@ struct drm_panfrost_wait_bo { #define PANFROST_BO_HEAP 2 #define PANFROST_BO_NOREAD 4 #define PANFROST_BO_NOWRITE8 +#define PANFROST_BO_GPUONLY0x10 /** * struct drm_panfrost_create_bo - ioctl argument for creating Panfrost BOs. -- 2.31.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 0/5] drm/panfrost: Add extra GPU-usage flags
Hello, This is a follow-up of [1], which was adding the read/write restrictions on GPU buffers. Robin and Steven suggested that I add a flag to restrict the shareability domain on GPU-private buffers, so here it is. As you can see, the first patch is flagges RFC, since I'm not sure adding a new IOMMU_ flag is the right solution, but IOMMU_CACHE doesn't feel like a good fit either. Please let me know if you have better ideas. Regards, Boris [1]https://patchwork.kernel.org/project/dri-devel/patch/20210930184723.1482426-1-boris.brezil...@collabora.com/ Boris Brezillon (5): [RFC]iommu: Add a IOMMU_DEVONLY protection flag [RFC]iommu/io-pgtable-arm: Take the DEVONLY flag into account on ARM_MALI_LPAE drm/panfrost: Add PANFROST_BO_NO{READ,WRITE} flags drm/panfrost: Add a PANFROST_BO_GPUONLY flag drm/panfrost: Bump the driver version to 1.3 drivers/gpu/drm/panfrost/panfrost_drv.c | 15 +-- drivers/gpu/drm/panfrost/panfrost_gem.c | 3 +++ drivers/gpu/drm/panfrost/panfrost_gem.h | 3 +++ drivers/gpu/drm/panfrost/panfrost_mmu.c | 11 ++- drivers/iommu/io-pgtable-arm.c | 25 + include/linux/iommu.h | 7 +++ include/uapi/drm/panfrost_drm.h | 3 +++ 7 files changed, 56 insertions(+), 11 deletions(-) -- 2.31.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 1/5] [RFC]iommu: Add a IOMMU_DEVONLY protection flag
The IOMMU_DEVONLY flag allows the caller to flag a mappings backed by device-private buffers. That means other devices or CPUs are not expected to access the physical memory region pointed by the mapping, and the MMU driver can safely restrict the shareability domain to the device itself. Will be used by the ARM MMU driver to flag Mali mappings accessed only by the GPU as Inner-shareable. Signed-off-by: Boris Brezillon --- include/linux/iommu.h | 7 +++ 1 file changed, 7 insertions(+) diff --git a/include/linux/iommu.h b/include/linux/iommu.h index d2f3435e7d17..db14781b522f 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -31,6 +31,13 @@ * if the IOMMU page table format is equivalent. */ #define IOMMU_PRIV (1 << 5) +/* + * Mapping is only accessed by the device behind the iommu. That means other + * devices or CPUs are not expected to access this physical memory region, + * and the MMU driver can safely restrict the shareability domain to the + * device itself. + */ +#define IOMMU_DEVONLY (1 << 6) struct iommu_ops; struct iommu_group; -- 2.31.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 1/2] iommu: arm-smmu-qcom: Add compatible for qcm2290
Signed-off-by: Loic Poulain --- drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c index 55690af..d105186 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c @@ -412,6 +412,7 @@ static const struct of_device_id __maybe_unused qcom_smmu_impl_of_match[] = { { .compatible = "qcom,sm8150-smmu-500" }, { .compatible = "qcom,sm8250-smmu-500" }, { .compatible = "qcom,sm8350-smmu-500" }, + { .compatible = "qcom,qcm2290-smmu-500" }, { } }; -- 2.7.4 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA
On Wed, Sep 22, 2021 at 5:14 PM Jacob Pan wrote: > > Hi Joerg/Jason/Christoph et all, > > The current in-kernel supervisor PASID support is based on the SVM/SVA > machinery in sva-lib. Kernel SVA is achieved by extending a special flag > to indicate the binding of the device and a page table should be performed > on init_mm instead of the mm of the current process.Page requests and other > differences between user and kernel SVA are handled as special cases. > > This unrestricted binding with the kernel page table is being challenged > for security and the convention that in-kernel DMA must be compatible with > DMA APIs. > (https://lore.kernel.org/linux-iommu/20210511194726.gp1002...@nvidia.com/) > There is also the lack of IOTLB synchronization upon kernel page table > updates. > > This patchset is trying to address these concerns by having an explicit DMA > API compatible model while continue to support in-kernel use of DMA requests > with PASID. Specifically, the following DMA-IOMMU APIs are introduced: > > int iommu_dma_pasid_enable/disable(struct device *dev, >struct iommu_domain **domain, >enum iommu_dma_pasid_mode mode); > int iommu_map/unmap_kva(struct iommu_domain *domain, > void *cpu_addr,size_t size, int prot); > > The following three addressing modes are supported with example API usages > by device drivers. > > 1. Physical address (bypass) mode. Similar to DMA direct where trusted devices > can DMA pass through IOMMU on a per PASID basis. > Example: > pasid = iommu_dma_pasid_enable(dev, NULL, IOMMU_DMA_PASID_BYPASS); > /* Use the returning PASID and PA for work submission */ > > 2. IOVA mode. DMA API compatible. Map a supervisor PASID the same way as the > PCI requester ID (RID) > Example: > pasid = iommu_dma_pasid_enable(dev, NULL, IOMMU_DMA_PASID_IOVA); > /* Use the PASID and DMA API allocated IOVA for work submission */ Hi Jacob, might be stupid question, what is the performance benefit of this IOVA mode comparing with the current dma_map/unmap_single/sg API which have enabled IOMMU like drivers/iommu/arm/arm-smmu-v3? Do we still need to flush IOTLB by sending commands to IOMMU each time while doing dma_unmap? > > 3. KVA mode. New kva map/unmap APIs. Support fast and strict sub-modes > transparently based on device trustfulness. > Example: > pasid = iommu_dma_pasid_enable(dev, &domain, IOMMU_DMA_PASID_KVA); > iommu_map_kva(domain, &buf, size, prot); > /* Use the returned PASID and KVA to submit work */ > Where: > Fast mode: Shared CPU page tables for trusted devices only > Strict mode: IOMMU domain returned for the untrusted device to > replicate KVA-PA mapping in IOMMU page tables. a huge bottleneck of IOMMU we have seen before is that dma_unmap will require IOTLB flush, for example, in arm_smmu_cmdq_issue_cmdlist(), we are having serious contention on acquiring lock and delay on waiting for iotlb flush completion in arm_smmu_cmdq_poll_until_sync() while multi-threads run. I assume KVA mode can avoid this iotlb flush as the device is using the page table of the kernel and sharing the whole kernel space. But will users be glad to accept this mode? It seems users are enduring the performance decrease of IOVA mapping and unmapping because it has better security. dma operations can only run on some specific dma buffers which have been mapped in the current dma-map/unmap with IOMMU backend. some drivers are using bouncing buffer to overcome the performance loss of dma_map/unmap as copying is faster than unmapping: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=907676b130711fd1f BTW, we have been debugging on dma_map/unmap performance by this benchmark: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/dma/map_benchmark.c you might be able to use it for your benchmarking as well :-) > > On a per device basis, DMA address and performance modes are enabled by the > device drivers. Platform information such as trustability, user command line > input (not included in this set) could also be taken into consideration (not > implemented in this RFC). > > This RFC is intended to communicate the API directions. Little testing is done > outside IDXD and DMA engine tests. > > For PA and IOVA modes, the implementation is straightforward and tested with > Intel IDXD driver. But several opens remain in KVA fast mode thus not tested: > 1. Lack of IOTLB synchronization, kernel direct map alias can be updated as a > result of module loading/eBPF load. Adding kernel mmu notifier? > 2. The use of the auxiliary domain for KVA map, will aux domain stay in the > long term? Is there another way to represent sub-device granu isolation? > 3. Is limiting the KVA sharing to the direct map range reasonable and > practical for all architectures? > > > Many thanks to Ashok Raj, Kevin
Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA
On Sat, Oct 2, 2021 at 1:36 AM Jason Gunthorpe wrote: > > On Sat, Oct 02, 2021 at 01:24:54AM +1300, Barry Song wrote: > > > I assume KVA mode can avoid this iotlb flush as the device is using > > the page table of the kernel and sharing the whole kernel space. But > > will users be glad to accept this mode? > > You can avoid the lock be identity mapping the physical address space > of the kernel and maping map/unmap a NOP. > > KVA is just a different way to achive this identity map with slightly > different security properties than the normal way, but it doesn't > reach to the same security level as proper map/unmap. > > I'm not sure anyone who cares about DMA security would see value in > the slight difference between KVA and a normal identity map. yes. This is an important question. if users want a high security level, kva might not their choice; if users don't want the security, they are using iommu passthrough. So when will users choose KVA? > > > which have been mapped in the current dma-map/unmap with IOMMU backend. > > some drivers are using bouncing buffer to overcome the performance loss of > > dma_map/unmap as copying is faster than unmapping: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=907676b130711fd1f > > It is pretty unforuntate that drivers are hard coding behaviors based > on assumptions of what the portable API is doing under the covers. not real when it has a tx_copybreak which can be set by ethtool or similar userspace tools . if users are using iommu passthrough, copying won't happen by the default tx_copybreak. if users are using restrict iommu mode, socket buffers are copied into the buffers allocated and mapped in the driver. so this won't require mapping and unmapping socket buffers frequently. > > Jason Thanks barry ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 2/2] dt-bindings: arm-smmu: Add qcm2290 compatible strings
Add compatible strings for qcm2290 iommu to documentation. Signed-off-by: Loic Poulain --- Documentation/devicetree/bindings/iommu/arm,smmu.yaml | 1 + 1 file changed, 1 insertion(+) diff --git a/Documentation/devicetree/bindings/iommu/arm,smmu.yaml b/Documentation/devicetree/bindings/iommu/arm,smmu.yaml index 03f2b2d..e66f20d 100644 --- a/Documentation/devicetree/bindings/iommu/arm,smmu.yaml +++ b/Documentation/devicetree/bindings/iommu/arm,smmu.yaml @@ -40,6 +40,7 @@ properties: - qcom,sm8150-smmu-500 - qcom,sm8250-smmu-500 - qcom,sm8350-smmu-500 + - qcom,qcm2290-smmu-500 - const: arm,mmu-500 - description: Qcom Adreno GPUs implementing "arm,smmu-v2" items: -- 2.7.4 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 0/2] arm64: retry dropping HAVE_ARCH_PFN_VALID
On Thu, 30 Sep 2021 04:30:37 +0300, Mike Rapoport wrote: > From: Mike Rapoport > > Hi, > > This is a new attempt to drop HAVE_ARCH_PFN_VALID on arm64 and start using > the generic implementation of pfn_valid(). > > [...] Applied to arm64 (for-next/pfn-valid), thanks! [1/2] dma-mapping: remove bogus test for pfn_valid from dma_map_resource https://git.kernel.org/arm64/c/a9c38c5d267c [2/2] arm64/mm: drop HAVE_ARCH_PFN_VALID https://git.kernel.org/arm64/c/3de360c3fdb3 Cheers, -- Will https://fixes.arm64.dev https://next.arm64.dev https://will.arm64.dev ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()
On Wed, Sep 29, 2021 at 09:36:52PM -0300, Jason Gunthorpe wrote: > Why would DAX want to do this in the first place?? This means the > address space zap is much more important that just speeding up > destruction, it is essential for correctness since the PTEs are not > holding refcounts naturally... It is not really for this series to fix, but I think the whole thing is probably racy once you start allowing pte_special pages to be accessed by GUP. If we look at unmapping the PTE relative to GUP fast the important sequence is how the TLB flushing doesn't decrement the page refcount until after it knows any concurrent GUP fast is completed. This is arch specific, eg it could be done async through a call_rcu handler. This ensures that pages can't cross back into the free pool and be reallocated until we know for certain that nobody is walking the PTEs and could potentially take an additional reference on it. The scheme cannot rely on the page refcount being 0 because oce it goes into the free pool it could be immeidately reallocated back to a non-zero refcount. A DAX user that simply does an address space invalidation doesn't sequence itself with any of this mechanism. So we can race with a thread doing GUP fast and another thread re-cycling the page into another use - creating a leakage of the page from one security context to another. This seems to be made worse for the pgmap stuff due to the wonky refcount usage - at least if the refcount had dropped to zero gup fast would be blocked for a time, but even that doesn't happen. In short, I think using pg special for anything that can be returned by gup fast (and maybe even gup!) is racy/wrong. We must have the normal refcount mechanism work for correctness of the recycling flow. I don't know why DAX did this, I think we should be talking about udoing it all of it, not just the wonky refcounting Alistair and Felix are working on, but also the use of MIXEDMAP and pte special for struct page backed memory. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: DPAA2 triggers, [PATCH] dma debug: report -EEXIST errors in add_dma_entry
On Thu, 30 Sep 2021 15:37:33 +0200 Karsten Graul wrote: > On 14/09/2021 17:45, Ioana Ciornei wrote: > > On Wed, Sep 08, 2021 at 10:33:26PM -0500, Jeremy Linton wrote: > >> +DPAA2, netdev maintainers > >> Hi, > >> > >> On 5/18/21 7:54 AM, Hamza Mahfooz wrote: > >>> Since, overlapping mappings are not supported by the DMA API we should > >>> report an error if active_cacheline_insert returns -EEXIST. > >> > >> It seems this patch found a victim. I was trying to run iperf3 on a > >> honeycomb (5.14.0, fedora 35) and the console is blasting this error > >> message > >> at 100% cpu. So, I changed it to a WARN_ONCE() to get the call trace, which > >> is attached below. > >> > > > > These frags are allocated by the stack, transformed into a scatterlist > > by skb_to_sgvec and then DMA mapped with dma_map_sg. It was not the > > dpaa2-eth's decision to use two fragments from the same page (that will > > also end un in the same cacheline) in two different in-flight skbs. > > > > Is this behavior normal? > > > > We see the same problem here and it started with 5.15-rc2 in our nightly CI > runs. > The CI has panic_on_warn enabled so we see the panic every day now. Adding a WARN for a case that be detected false-positive seems not acceptable, exactly for this reason (kernel panic on unaffected systems). So I guess it boils down to the question if the behavior that Ioana described is legit behavior, on a system that is dma coherent. We are apparently hitting the same scenario, although it could not yet be reproduced with debug printks for some reason. If the answer is yes, than please remove at lease the WARN, so that it will not make systems crash that behave valid, and have panic_on_warn set. Even a normal printk feels wrong to me in that case, it really sounds rather like you want to fix / better refine the overlap check, if you want to report anything here. BTW, there is already a WARN in the add_dma_entry() path, related to cachlline overlap and -EEXIST: add_dma_entry() -> active_cacheline_insert() -> -EEXIST -> active_cacheline_inc_overlap() That will only trigger when "overlap > ACTIVE_CACHELINE_MAX_OVERLAP". Not familiar with that code, but it seems that there are now two warnings for more or less the same, and the new warning is much more prone to false-positives. How do these 2 warnings relate, are they both really necessary? I think the new warning was only introduced because of some old TODO comment in add_dma_entry(), see commit 2b4bbc6231d78 ("dma-debug: report -EEXIST errors in add_dma_entry"). That comment was initially added by Dan long time ago, and he added several fix-ups for overlap detection after that, including the "overlap > ACTIVE_CACHELINE_MAX_OVERLAP" stuff in active_cacheline_inc_overlap(). So could it be that the TODO comment was simply not valid any more, and better be removed instead of adding new / double warnings, that also generate false-positives and kernel crashes? ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 07/20] iommu/iommufd: Add iommufd_[un]bind_device()
On Thu, Sep 30, 2021 at 01:10:29PM +1000, David Gibson wrote: > On Wed, Sep 29, 2021 at 09:24:57AM -0300, Jason Gunthorpe wrote: > 65;6402;1c> On Wed, Sep 29, 2021 at 03:25:54PM +1000, David Gibson wrote: > > > > > > +struct iommufd_device { > > > > + unsigned int id; > > > > + struct iommufd_ctx *ictx; > > > > + struct device *dev; /* always be the physical device */ > > > > + u64 dev_cookie; > > > > > > Why do you need both an 'id' and a 'dev_cookie'? Since they're both > > > unique, couldn't you just use the cookie directly as the index into > > > the xarray? > > > > ID is the kernel value in the xarray - xarray is much more efficient & > > safe with small kernel controlled values. > > > > dev_cookie is a user assigned value that may not be unique. It's > > purpose is to allow userspace to receive and event and go back to its > > structure. Most likely userspace will store a pointer here, but it is > > also possible userspace could not use it. > > > > It is a pretty normal pattern > > Hm, ok. Could you point me at an example? For instance user_data vs fd in io_uring RDMA has many similar examples. More or less anytime you want to allow the kernel to async retun some information providing a 64 bit user_data lets userspace have an easier time to deal with it. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA
On Sat, Oct 02, 2021 at 01:24:54AM +1300, Barry Song wrote: > I assume KVA mode can avoid this iotlb flush as the device is using > the page table of the kernel and sharing the whole kernel space. But > will users be glad to accept this mode? You can avoid the lock be identity mapping the physical address space of the kernel and maping map/unmap a NOP. KVA is just a different way to achive this identity map with slightly different security properties than the normal way, but it doesn't reach to the same security level as proper map/unmap. I'm not sure anyone who cares about DMA security would see value in the slight difference between KVA and a normal identity map. > which have been mapped in the current dma-map/unmap with IOMMU backend. > some drivers are using bouncing buffer to overcome the performance loss of > dma_map/unmap as copying is faster than unmapping: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=907676b130711fd1f It is pretty unforuntate that drivers are hard coding behaviors based on assumptions of what the portable API is doing under the covers. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Fri, Oct 01, 2021 at 04:19:22PM +1000, da...@gibson.dropbear.id.au wrote: > On Wed, Sep 22, 2021 at 11:09:11AM -0300, Jason Gunthorpe wrote: > > On Wed, Sep 22, 2021 at 03:40:25AM +, Tian, Kevin wrote: > > > > From: Jason Gunthorpe > > > > Sent: Wednesday, September 22, 2021 1:45 AM > > > > > > > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > > > > > This patch adds IOASID allocation/free interface per iommufd. When > > > > > allocating an IOASID, userspace is expected to specify the type and > > > > > format information for the target I/O page table. > > > > > > > > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > > > > > implying a kernel-managed I/O page table with vfio type1v2 mapping > > > > > semantics. For this type the user should specify the addr_width of > > > > > the I/O address space and whether the I/O page table is created in > > > > > an iommu enfore_snoop format. enforce_snoop must be true at this > > > > > point, > > > > > as the false setting requires additional contract with KVM on handling > > > > > WBINVD emulation, which can be added later. > > > > > > > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch) > > > > > for what formats can be specified when allocating an IOASID. > > > > > > > > > > Open: > > > > > - Devices on PPC platform currently use a different iommu driver in > > > > > vfio. > > > > > Per previous discussion they can also use vfio type1v2 as long as > > > > > there > > > > > is a way to claim a specific iova range from a system-wide address > > > > > space. > > > > > This requirement doesn't sound PPC specific, as addr_width for pci > > > > devices > > > > > can be also represented by a range [0, 2^addr_width-1]. This RFC > > > > > hasn't > > > > > adopted this design yet. We hope to have formal alignment in v1 > > > > discussion > > > > > and then decide how to incorporate it in v2. > > > > > > > > I think the request was to include a start/end IO address hint when > > > > creating the ios. When the kernel creates it then it can return the > > > > > > is the hint single-range or could be multiple-ranges? > > > > David explained it here: > > > > https://lore.kernel.org/kvm/YMrKksUeNW%2FPEGPM@yekko/ > > Apparently not well enough. I've attempted again in this thread. > > > qeumu needs to be able to chooose if it gets the 32 bit range or 64 > > bit range. > > No. qemu needs to supply *both* the 32-bit and 64-bit range to its > guest, and therefore needs to request both from the host. As I understood your remarks each IOAS can only be one of the formats as they have a different PTE layout. So here I ment that qmeu needs to be able to pick *for each IOAS* which of the two formats it is. > Or rather, it *might* need to supply both. It will supply just the > 32-bit range by default, but the guest can request the 64-bit range > and/or remove and resize the 32-bit range via hypercall interfaces. > Vaguely recent Linux guests certainly will request the 64-bit range in > addition to the default 32-bit range. And this would result in two different IOAS objects Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE
On Fri, Oct 01, 2021 at 04:13:58PM +1000, David Gibson wrote: > On Tue, Sep 21, 2021 at 02:44:38PM -0300, Jason Gunthorpe wrote: > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote: > > > This patch adds IOASID allocation/free interface per iommufd. When > > > allocating an IOASID, userspace is expected to specify the type and > > > format information for the target I/O page table. > > > > > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2), > > > implying a kernel-managed I/O page table with vfio type1v2 mapping > > > semantics. For this type the user should specify the addr_width of > > > the I/O address space and whether the I/O page table is created in > > > an iommu enfore_snoop format. enforce_snoop must be true at this point, > > > as the false setting requires additional contract with KVM on handling > > > WBINVD emulation, which can be added later. > > > > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch) > > > for what formats can be specified when allocating an IOASID. > > > > > > Open: > > > - Devices on PPC platform currently use a different iommu driver in vfio. > > > Per previous discussion they can also use vfio type1v2 as long as there > > > is a way to claim a specific iova range from a system-wide address > > > space. > > > This requirement doesn't sound PPC specific, as addr_width for pci > > > devices > > > can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't > > > adopted this design yet. We hope to have formal alignment in v1 > > > discussion > > > and then decide how to incorporate it in v2. > > > > I think the request was to include a start/end IO address hint when > > creating the ios. When the kernel creates it then it can return the > > actual geometry including any holes via a query. > > So part of the point of specifying start/end addresses is that > explicitly querying holes shouldn't be necessary: if the requested > range crosses a hole, it should fail. If you didn't really need all > that range, you shouldn't have asked for it. > > Which means these aren't really "hints" but optionally supplied > constraints. We have to be very careful here, there are two very different use cases. When we are talking about the generic API I am mostly interested to see that applications like DPDK can use this API and be portable to any IOMMU HW the kernel supports. I view the fact that there is VFIO PPC specific code in DPDK as a failing of the kernel to provide a HW abstraction. This means we cannot define an input that has a magic HW specific value. DPDK can never provide that portably. Thus all these kinds of inputs in the generic API need to be hints, if they exist at all. As 'address space size hint'/'address space start hint' is both generic, useful, and providable by DPDK I think it is OK. PPC can use it to pick which of the two page table formats to use for this IOAS if it wants. The second use case is when we have a userspace driver for a specific HW IOMMU. Eg a vIOMMU in qemu doing specific PPC/ARM/x86 acceleration. We can look here for things to make general, but I would expect a fairly high bar. Instead, I would rather see the userspace driver communicate with the kernel driver in its own private language, so that the entire functionality of the unique HW can be used. So, when it comes to providing exact ranges as an input parameter we have to decide if that is done as some additional general data, or if it should be part of a IOAS_FORMAT_KERNEL_PPC. In this case I suggest the guiding factor should be if every single IOMMU implementation can be updated to support the value. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 6/9] PCI: Add pci_find_dvsec_capability to find designated VSEC
On Thu, 23 Sep 2021 10:26:44 -0700 Ben Widawsky wrote: > Add pci_find_dvsec_capability to locate a Designated Vendor-Specific > Extended Capability with the specified DVSEC ID. > > The Designated Vendor-Specific Extended Capability (DVSEC) allows one or > more vendor specific capabilities that aren't tied to the vendor ID of > the PCI component. > > DVSEC is critical for both the Compute Express Link (CXL) driver as well > as the driver for OpenCAPI coherent accelerator (OCXL). > > Cc: David E. Box > Cc: Jonathan Cameron > Cc: Bjorn Helgaas > Cc: Dan Williams > Cc: linux-...@vger.kernel.org > Cc: linuxppc-...@lists.ozlabs.org > Cc: Andrew Donnellan > Cc: Lu Baolu > Reviewed-by: Frederic Barrat > Signed-off-by: Ben Widawsky Great to see this cleaned up. Reviewed-by: Jonathan Cameron > --- > drivers/pci/pci.c | 32 > include/linux/pci.h | 1 + > 2 files changed, 33 insertions(+) > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index ce2ab62b64cf..94ac86ff28b0 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -732,6 +732,38 @@ u16 pci_find_vsec_capability(struct pci_dev *dev, u16 > vendor, int cap) > } > EXPORT_SYMBOL_GPL(pci_find_vsec_capability); > > +/** > + * pci_find_dvsec_capability - Find DVSEC for vendor > + * @dev: PCI device to query > + * @vendor: Vendor ID to match for the DVSEC > + * @dvsec: Designated Vendor-specific capability ID > + * > + * If DVSEC has Vendor ID @vendor and DVSEC ID @dvsec return the capability > + * offset in config space; otherwise return 0. > + */ > +u16 pci_find_dvsec_capability(struct pci_dev *dev, u16 vendor, u16 dvsec) > +{ > + int pos; > + > + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DVSEC); > + if (!pos) > + return 0; > + > + while (pos) { > + u16 v, id; > + > + pci_read_config_word(dev, pos + PCI_DVSEC_HEADER1, &v); > + pci_read_config_word(dev, pos + PCI_DVSEC_HEADER2, &id); > + if (vendor == v && dvsec == id) > + return pos; > + > + pos = pci_find_next_ext_capability(dev, pos, > PCI_EXT_CAP_ID_DVSEC); > + } > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(pci_find_dvsec_capability); > + > /** > * pci_find_parent_resource - return resource region of parent bus of given > * region > diff --git a/include/linux/pci.h b/include/linux/pci.h > index cd8aa6fce204..c93ccfa4571b 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -1130,6 +1130,7 @@ u16 pci_find_ext_capability(struct pci_dev *dev, int > cap); > u16 pci_find_next_ext_capability(struct pci_dev *dev, u16 pos, int cap); > struct pci_bus *pci_find_next_bus(const struct pci_bus *from); > u16 pci_find_vsec_capability(struct pci_dev *dev, u16 vendor, int cap); > +u16 pci_find_dvsec_capability(struct pci_dev *dev, u16 vendor, u16 dvsec); > > u64 pci_get_dsn(struct pci_dev *dev); > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: DPAA2 triggers, [PATCH] dma debug: report -EEXIST errors in add_dma_entry
On Fri, Oct 01, 2021 at 06:19:59AM +0200, Christoph Hellwig wrote: > On Tue, Sep 14, 2021 at 03:45:06PM +, Ioana Ciornei wrote: > > [ 245.927020] fsl_dpaa2_eth dpni.3: scather-gather idx 0 P=20a732 > > N=20a7320 D=20a732 L=30 DMA_BIDIRECTIONAL dma map error check not > > applicableĀ· > > [ 245.927048] fsl_dpaa2_eth dpni.3: scather-gather idx 1 P=20a7320030 > > N=20a7320 D=20a7320030 L=5a8 DMA_BIDIRECTIONAL dma map error check not > > applicable > > [ 245.927062] DMA-API: cacheline tracking EEXIST, overlapping mappings > > aren't supported > > > > The first line is the dump of the dma_debug_entry which is already present > > in the radix tree and the second one is the entry which just triggered > > the EEXIST. > > > > As we can see, they are not actually overlapping, at least from my > > understanding. The first one starts at 0x20a732 with a size 0x30 > > and the second one at 0x20a7320030. > > They overlap the cache lines. Which means if you use this driver > on a system that is not dma coherent you will corrupt data. This is a driver of an integrated ethernet controller which is DMA coherent. I added a print just to make sure of this: --- a/kernel/dma/debug.c +++ b/kernel/dma/debug.c @@ -567,6 +567,7 @@ static void add_dma_entry(struct dma_debug_entry *entry) pr_err("cacheline tracking ENOMEM, dma-debug disabled\n"); global_disable = true; } else if (rc == -EEXIST) { + pr_err("dev_is_dma_coherent(%s) = %d\n", dev_name(entry->dev), dev_is_dma_coherent(entry->dev)); err_printk(entry->dev, entry, "cacheline tracking EEXIST, overlapping mappings aren't supported\n"); } [ 85.852218] DMA-API: dev_is_dma_coherent(dpni.3) = 1 [ 85.858891] [ cut here ] [ 85.858893] DMA-API: fsl_dpaa2_eth dpni.3: cacheline tracking EEXIST, overlapping mappings aren't supported [ 85.858901] WARNING: CPU: 13 PID: 1046 at kernel/dma/debug.c:571 add_dma_entry+0x330/0x390 [ 85.858911] Modules linked in: [ 85.858915] CPU: 13 PID: 1046 Comm: iperf3 Not tainted 5.15.0-rc2-00478-g34286ba6a164-dirty #1275 [ 85.858919] Hardware name: NXP Layerscape LX2160ARDB (DT) Shouldn't this case not generate this kind of warning? Ioana ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu