Re: [PATCH v2] iommu: intel: do deep dma-unmapping, to avoid kernel-flooding.
Ping .. Any updates please on this? It will be great to have the fix upstreamed (properly of course). Right now, the patch contains the change as suggested, of explicitly/properly clearing out dma-mappings when unmap is called. Please let me know in whatever way I can help, including testing/debugging for other approaches if required. Many thanks to Alex and Lu for their continued support on the issue. P.S. : I might have missed mentioning the information about the device that causes flooding. Please find it below : ## sudo lspci -vvv 0a:00.0 SD Host controller: O2 Micro, Inc. OZ600FJ0/OZ900FJ0/OZ600FJS SD/MMC Card Reader Controller (rev 05) (prog-if 01) Subsystem: Dell OZ600FJ0/OZ900FJ0/OZ600FJS SD/MMC Card Reader Controller Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- wrote: > > Origins at : > https://lists.linuxfoundation.org/pipermail/iommu/2021-October/thread.html > > === Changes from v1 => v2 === > > a) > Improved patch-description. > > b) > A more root-level fix, as suggested by > > 1. > Alex Williamson > > 2. > Lu Baolu > > > > === Issue === > > Kernel-flooding is seen, when an x86_64 L1 guest (Ubuntu-21) is booted in > qemu/kvm > on a x86_64 host (Ubuntu-21), with a host-pci-device attached. > > Following kind of logs, along with the stacktraces, cause the flood : > > .. > DMAR: ERROR: DMA PTE for vPFN 0x428ec already set (to 3f6ec003 not 3f6ec003) > DMAR: ERROR: DMA PTE for vPFN 0x428ed already set (to 3f6ed003 not 3f6ed003) > DMAR: ERROR: DMA PTE for vPFN 0x428ee already set (to 3f6ee003 not 3f6ee003) > DMAR: ERROR: DMA PTE for vPFN 0x428ef already set (to 3f6ef003 not 3f6ef003) > DMAR: ERROR: DMA PTE for vPFN 0x428f0 already set (to 3f6f0003 not 3f6f0003) > .. > > > > === Current Behaviour, leading to the issue === > > Currently, when we do a dma-unmapping, we unmap/unlink the mappings, but > the pte-entries are not cleared. > > Thus, following sequencing would flood the kernel-logs : > > i) > A dma-unmapping makes the real/leaf-level pte-slot invalid, but the > pte-content itself is not cleared. > > ii) > Now, during some later dma-mapping procedure, as the pte-slot is about > to hold a new pte-value, the intel-iommu checks if a prior > pte-entry exists in the pte-slot. If it exists, it logs a kernel-error, > along with a corresponding stacktrace. > > iii) > Step ii) runs in abundance, and the kernel-logs run insane. > > > > === Fix === > > We ensure that as part of a dma-unmapping, each (unmapped) pte-slot > is also cleared of its value/content (at the leaf-level, where the > real mapping from a iova => pfn mapping is stored). > > This completes a "deep" dma-unmapping. > > > > Signed-off-by: Ajay Garg > --- > drivers/iommu/intel/iommu.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c > index d75f59ae28e6..485a8ea71394 100644 > --- a/drivers/iommu/intel/iommu.c > +++ b/drivers/iommu/intel/iommu.c > @@ -5090,6 +5090,8 @@ static size_t intel_iommu_unmap(struct iommu_domain > *domain, > gather->freelist = domain_unmap(dmar_domain, start_pfn, > last_pfn, gather->freelist); > > + dma_pte_clear_range(dmar_domain, start_pfn, last_pfn); > + > if (dmar_domain->max_addr == iova + size) > dmar_domain->max_addr = iova; > > -- > 2.30.2 > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 13/33] iommu/mediatek: Remove the power status checking in tlb flush all
Hi On 23.09.21 13:58, Yong Wu wrote: To simplify the code, Remove the power status checking in the tlb_flush_all, remove this: if (pm_runtime_get_if_in_use(data->dev) <= 0) continue; After this patch, the mtk_iommu_tlb_flush_all will be called from a) isr b) pm runtime resume callback c) tlb flush range fail case d) iommu_create_device_direct_mappings -> iommu_flush_iotlb_all In first three cases, the power and clock always are enabled; d) is direct Regarding case "c) tlb flush range fail case", I found out that this often happens when the iommu is used while it is runtime suspended. For example the mtk-vcodec encoder driver calls "pm_runtime_resume_and_get" only when it starts streaming but buffers allocation is done in 'v4l2_reqbufs' before "pm_runtime_resume_and_get" is called and then I see the warning "Partial TLB flush timed out, falling back to full flush" I am not sure how to fix that issue, but it seems that case 'c)' might indicate that power and clock are actually not enabled. mapping, the tlb flush is unnecessay since we already have tlb_flush_all in the pm_runtime_resume callback. When the iommu's power status is changed to active, the tlb always is clean. In addition, there still are 2 reasons that don't add PM status checking in the tlb flush all: a) Write tlb flush all register also is ok even though the HW has no power and clocks. Write ignore. b) pm_runtime_get_if_in_use(m4udev) is 0 when the tlb_flush_all is called frm pm_runtime_resume cb. From this point, we can not add this code above in this tlb_flush_all. Signed-off-by: Yong Wu --- drivers/iommu/mtk_iommu.c | 20 +++- 1 file changed, 7 insertions(+), 13 deletions(-) diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c index e9e94944ed91..4a33b6c6b1db 100644 --- a/drivers/iommu/mtk_iommu.c +++ b/drivers/iommu/mtk_iommu.c @@ -204,10 +204,14 @@ static struct mtk_iommu_domain *to_mtk_domain(struct iommu_domain *dom) return container_of(dom, struct mtk_iommu_domain, domain); } -static void mtk_iommu_tlb_do_flush_all(struct mtk_iommu_data *data) +static void mtk_iommu_tlb_flush_all(struct mtk_iommu_data *data) { unsigned long flags; + /* +* No need get power status since the HW PM status nearly is active +* when entering here. +*/ spin_lock_irqsave(>tlb_lock, flags); writel_relaxed(F_INVLD_EN1 | F_INVLD_EN0, data->base + data->plat_data->inv_sel_reg); @@ -216,16 +220,6 @@ static void mtk_iommu_tlb_do_flush_all(struct mtk_iommu_data *data) spin_unlock_irqrestore(>tlb_lock, flags); } -static void mtk_iommu_tlb_flush_all(struct mtk_iommu_data *data) -{ - if (pm_runtime_get_if_in_use(data->dev) <= 0) - return; - - mtk_iommu_tlb_do_flush_all(data); - - pm_runtime_put(data->dev); -} - static void mtk_iommu_tlb_flush_range_sync(unsigned long iova, size_t size, struct mtk_iommu_data *data) { @@ -263,7 +257,7 @@ static void mtk_iommu_tlb_flush_range_sync(unsigned long iova, size_t size, if (ret) { dev_warn(data->dev, "Partial TLB flush timed out, falling back to full flush\n"); - mtk_iommu_tlb_do_flush_all(data); + mtk_iommu_tlb_flush_all(data); } if (has_pm) @@ -993,7 +987,7 @@ static int __maybe_unused mtk_iommu_runtime_resume(struct device *dev) * * Thus, Make sure the tlb always is clean after each PM resume. */ - mtk_iommu_tlb_do_flush_all(data); + mtk_iommu_tlb_flush_all(data); /* * Uppon first resume, only enable the clk and return, since the values of the ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 4/6] iommu: Move IOMMU pagesize check to attach_device
On 2021-10-22 09:06, Marc Zyngier wrote: On Fri, 22 Oct 2021 03:52:38 +0100, Lu Baolu wrote: On 10/21/21 4:10 PM, Marc Zyngier wrote: On Thu, 21 Oct 2021 03:22:30 +0100, Lu Baolu wrote: On 10/20/21 10:22 PM, Marc Zyngier wrote: On Wed, 20 Oct 2021 06:21:44 +0100, Lu Baolu wrote: On 2021/10/20 0:37, Sven Peter via iommu wrote: + /* +* Check that CPU pages can be represented by the IOVA granularity. +* This has to be done after ops->attach_dev since many IOMMU drivers +* only limit domain->pgsize_bitmap after having attached the first +* device. +*/ + ret = iommu_check_page_size(domain); + if (ret) { + __iommu_detach_device(domain, dev); + return ret; + } It looks odd. __iommu_attach_device() attaches an I/O page table for a device. How does it relate to CPU pages? Why is it a failure case if CPU page size is not covered? If you allocate a CPU PAGE_SIZE'd region, and point it at a device that now can DMA to more than what you have allocated because the IOMMU's own page size is larger, the device has now access to data it shouldn't see. In my book, that's a pretty bad thing. But even you enforce the CPU page size check here, this problem still exists unless all DMA buffers are PAGE_SIZE aligned and sized, right? Let me take a CPU analogy: you have a page that contains some user data *and* a kernel secret. How do you map this page into userspace without leaking the kernel secret? PAGE_SIZE allocations are the unit of isolation, and this applies to both CPU and IOMMU. If you have allocated a DMA buffer that is less than a page, you then have to resort to bounce buffering, or accept that your data isn't safe. I can understand the problems when IOMMU page sizes is larger than CPU page size. But the code itself is not clean. The vendor iommu drivers know more hardware details than the iommu core. It looks odd that the vendor iommu says "okay, I can attach this I/O page table to the device", but the iommu core says "no, you can't" and rolls everything back. If your IOMMU driver can do things behind the core's back and contradict the view that the core has, then it is probably time to fix your IOMMU driver and make the core aware of what is going on. Supported page sizes is one of these things. In general, keeping the IOMMU driver as dumb as possible is a worthy design goal, and this is why we have these abstractions. In this case it's the abstractions that are the problem, though. Any driver which supports heterogeneous IOMMU instances with potentially differing page sizes currently has no choice but to do horrible bodges to make the bus-based iommu_domain_alloc() paradigm work *at all*. Fixing that from the fundamental API level upwards has been on the to-do list for some time now, but won't be straightforward. Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 0/5] iommu/virtio: Add identity domains
On Fri, Oct 22, 2021 at 06:16:27AM -0400, Michael S. Tsirkin wrote: > On Wed, Oct 13, 2021 at 01:10:48PM +0100, Jean-Philippe Brucker wrote: > > Support identity domains, allowing to only enable IOMMU protection for a > > subset of endpoints (those assigned to userspace, for example). Users > > may enable identity domains at compile time > > (CONFIG_IOMMU_DEFAULT_PASSTHROUGH), boot time (iommu.passthrough=1) or > > runtime (/sys/kernel/iommu_groups/*/type = identity). > > > I put this in my branch so it can get testing under linux-next, > but pls notice if the ballot does not conclude in time > for the merge window I won't send it to Linus. Makes sense, thank you. I sent a new version of the spec change with clarifications https://www.mail-archive.com/virtio-dev@lists.oasis-open.org/msg07969.html Thanks, Jean ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 0/5] iommu/virtio: Add identity domains
On Wed, Oct 13, 2021 at 01:10:48PM +0100, Jean-Philippe Brucker wrote: > Support identity domains, allowing to only enable IOMMU protection for a > subset of endpoints (those assigned to userspace, for example). Users > may enable identity domains at compile time > (CONFIG_IOMMU_DEFAULT_PASSTHROUGH), boot time (iommu.passthrough=1) or > runtime (/sys/kernel/iommu_groups/*/type = identity). I put this in my branch so it can get testing under linux-next, but pls notice if the ballot does not conclude in time for the merge window I won't send it to Linus. > Patches 1-2 support identity domains using the optional > VIRTIO_IOMMU_F_BYPASS_CONFIG feature. The feature bit is not yet in the > spec, see [1] for the latest proposal. > > Patches 3-5 add a fallback to identity mappings, when the feature is not > supported. > > Note that this series doesn't touch the global bypass bit added by > VIRTIO_IOMMU_F_BYPASS_CONFIG. All endpoints managed by the IOMMU should > be attached to a domain, so global bypass isn't in use after endpoints > are probed. Before that, the global bypass policy is decided by the > hypervisor and firmware. So I don't think Linux needs to touch the > global bypass bit, but there are some patches available on my > virtio-iommu/bypass branch [2] to test it. > > QEMU patches are on my virtio-iommu/bypass branch [3] (and the list) > > [1] https://www.mail-archive.com/virtio-dev@lists.oasis-open.org/msg07898.html > [2] https://jpbrucker.net/git/linux/log/?h=virtio-iommu/bypass > [3] https://jpbrucker.net/git/qemu/log/?h=virtio-iommu/bypass > > Jean-Philippe Brucker (5): > iommu/virtio: Add definitions for VIRTIO_IOMMU_F_BYPASS_CONFIG > iommu/virtio: Support bypass domains > iommu/virtio: Sort reserved regions > iommu/virtio: Pass end address to viommu_add_mapping() > iommu/virtio: Support identity-mapped domains > > include/uapi/linux/virtio_iommu.h | 8 ++- > drivers/iommu/virtio-iommu.c | 113 +- > 2 files changed, 101 insertions(+), 20 deletions(-) > > -- > 2.33.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 4/6] iommu: Move IOMMU pagesize check to attach_device
On Fri, 22 Oct 2021 03:52:38 +0100, Lu Baolu wrote: > > On 10/21/21 4:10 PM, Marc Zyngier wrote: > > On Thu, 21 Oct 2021 03:22:30 +0100, > > Lu Baolu wrote: > >> > >> On 10/20/21 10:22 PM, Marc Zyngier wrote: > >>> On Wed, 20 Oct 2021 06:21:44 +0100, > >>> Lu Baolu wrote: > > On 2021/10/20 0:37, Sven Peter via iommu wrote: > > + /* > > +* Check that CPU pages can be represented by the IOVA > > granularity. > > +* This has to be done after ops->attach_dev since many IOMMU > > drivers > > +* only limit domain->pgsize_bitmap after having attached the > > first > > +* device. > > +*/ > > + ret = iommu_check_page_size(domain); > > + if (ret) { > > + __iommu_detach_device(domain, dev); > > + return ret; > > + } > > It looks odd. __iommu_attach_device() attaches an I/O page table for a > device. How does it relate to CPU pages? Why is it a failure case if CPU > page size is not covered? > >>> > >>> If you allocate a CPU PAGE_SIZE'd region, and point it at a device > >>> that now can DMA to more than what you have allocated because the > >>> IOMMU's own page size is larger, the device has now access to data it > >>> shouldn't see. In my book, that's a pretty bad thing. > >> > >> But even you enforce the CPU page size check here, this problem still > >> exists unless all DMA buffers are PAGE_SIZE aligned and sized, right? > > > > Let me take a CPU analogy: you have a page that contains some user > > data *and* a kernel secret. How do you map this page into userspace > > without leaking the kernel secret? > > > > PAGE_SIZE allocations are the unit of isolation, and this applies to > > both CPU and IOMMU. If you have allocated a DMA buffer that is less > > than a page, you then have to resort to bounce buffering, or accept > > that your data isn't safe. > > I can understand the problems when IOMMU page sizes is larger than CPU > page size. But the code itself is not clean. The vendor iommu drivers > know more hardware details than the iommu core. It looks odd that the > vendor iommu says "okay, I can attach this I/O page table to the > device", but the iommu core says "no, you can't" and rolls everything > back. If your IOMMU driver can do things behind the core's back and contradict the view that the core has, then it is probably time to fix your IOMMU driver and make the core aware of what is going on. Supported page sizes is one of these things. In general, keeping the IOMMU driver as dumb as possible is a worthy design goal, and this is why we have these abstractions. M. -- Without deviation from the norm, progress is not possible. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
On Thu, Oct 21, 2021 at 08:22:23PM -0300, Jason Gunthorpe wrote: > On Thu, Oct 21, 2021 at 03:58:02PM +0100, Jean-Philippe Brucker wrote: > > On Thu, Oct 21, 2021 at 02:26:00AM +, Tian, Kevin wrote: > > > > I'll leave it to Jean to confirm. If only coherent DMA can be used in > > > > the guest on other platforms, suppose VFIO should not blindly set > > > > IOMMU_CACHE and in concept it should deny assigning a non-coherent > > > > device since no co-ordination with guest exists today. > > > > > > Jean, what's your opinion? > > > > Yes a sanity check to prevent assigning non-coherent devices would be > > good, though I'm not particularly worried about non-coherent devices. PCIe > > on Arm should be coherent (according to the Base System Architecture). So > > vfio-pci devices should be coherent, but vfio-platform and mdev are > > case-by-case (hopefully all coherent since it concerns newer platforms). > > > > More worrying, I thought we disabled No-Snoop for VFIO but I was wrong, > > it's left enabled. On Arm I don't think userspace can perform the right > > cache maintenance operations to maintain coherency with a device that > > issues No-Snoop writes. Userspace can issue clean+invalidate but not > > invalidate alone, so there is no equivalent to > > arch_sync_dma_for_cpu(). > > So what happens in a VM? Does a VM know that arch_sync_dma_for_cpu() > is not available? This would only affect userspace drivers, it's only host or guest userspace that cannot issue the maintenance operations. The VM can do arch_sync_dma_for_cpu() Thanks, Jean > > And how does this work with the nested IOMMU translation? I thought I > read in the SMMU spec that the io page table entries could control > cachability including in nesting cases? > > > I think the worse that can happen is the device owner shooting itself in > > the foot by using No-Snoop, but would it hurt to disable it? > > No, the worst is the same as Intel - a driver running in the guest VM > assumes it can use arch_sync_dma_for_cpu() and acts accordingly, > resulting in a broken VM. > > Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu