Re: [PATCH v2] iommu: intel: do deep dma-unmapping, to avoid kernel-flooding.

2021-10-22 Thread Ajay Garg
Ping ..

Any updates please on this?

It will be great to have the fix upstreamed (properly of course).

Right now, the patch contains the change as suggested, of
explicitly/properly clearing out dma-mappings when unmap is called.
Please let me know in whatever way I can help, including
testing/debugging for other approaches if required.


Many thanks to Alex and Lu for their continued support on the issue.



P.S. :

I might have missed mentioning the information about the device that
causes flooding.
Please find it below :

##
sudo lspci -vvv

0a:00.0 SD Host controller: O2 Micro, Inc. OZ600FJ0/OZ900FJ0/OZ600FJS
SD/MMC Card Reader Controller (rev 05) (prog-if 01)
Subsystem: Dell OZ600FJ0/OZ900FJ0/OZ600FJS SD/MMC Card Reader Controller
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
SERR-  wrote:
>
> Origins at :
> https://lists.linuxfoundation.org/pipermail/iommu/2021-October/thread.html
>
> === Changes from v1 => v2 ===
>
> a)
> Improved patch-description.
>
> b)
> A more root-level fix, as suggested by
>
> 1.
> Alex Williamson 
>
> 2.
> Lu Baolu 
>
>
>
> === Issue ===
>
> Kernel-flooding is seen, when an x86_64 L1 guest (Ubuntu-21) is booted in 
> qemu/kvm
> on a x86_64 host (Ubuntu-21), with a host-pci-device attached.
>
> Following kind of logs, along with the stacktraces, cause the flood :
>
> ..
>  DMAR: ERROR: DMA PTE for vPFN 0x428ec already set (to 3f6ec003 not 3f6ec003)
>  DMAR: ERROR: DMA PTE for vPFN 0x428ed already set (to 3f6ed003 not 3f6ed003)
>  DMAR: ERROR: DMA PTE for vPFN 0x428ee already set (to 3f6ee003 not 3f6ee003)
>  DMAR: ERROR: DMA PTE for vPFN 0x428ef already set (to 3f6ef003 not 3f6ef003)
>  DMAR: ERROR: DMA PTE for vPFN 0x428f0 already set (to 3f6f0003 not 3f6f0003)
> ..
>
>
>
> === Current Behaviour, leading to the issue ===
>
> Currently, when we do a dma-unmapping, we unmap/unlink the mappings, but
> the pte-entries are not cleared.
>
> Thus, following sequencing would flood the kernel-logs :
>
> i)
> A dma-unmapping makes the real/leaf-level pte-slot invalid, but the
> pte-content itself is not cleared.
>
> ii)
> Now, during some later dma-mapping procedure, as the pte-slot is about
> to hold a new pte-value, the intel-iommu checks if a prior
> pte-entry exists in the pte-slot. If it exists, it logs a kernel-error,
> along with a corresponding stacktrace.
>
> iii)
> Step ii) runs in abundance, and the kernel-logs run insane.
>
>
>
> === Fix ===
>
> We ensure that as part of a dma-unmapping, each (unmapped) pte-slot
> is also cleared of its value/content (at the leaf-level, where the
> real mapping from a iova => pfn mapping is stored).
>
> This completes a "deep" dma-unmapping.
>
>
>
> Signed-off-by: Ajay Garg 
> ---
>  drivers/iommu/intel/iommu.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index d75f59ae28e6..485a8ea71394 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -5090,6 +5090,8 @@ static size_t intel_iommu_unmap(struct iommu_domain 
> *domain,
> gather->freelist = domain_unmap(dmar_domain, start_pfn,
> last_pfn, gather->freelist);
>
> +   dma_pte_clear_range(dmar_domain, start_pfn, last_pfn);
> +
> if (dmar_domain->max_addr == iova + size)
> dmar_domain->max_addr = iova;
>
> --
> 2.30.2
>
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 13/33] iommu/mediatek: Remove the power status checking in tlb flush all

2021-10-22 Thread Dafna Hirschfeld

Hi


On 23.09.21 13:58, Yong Wu wrote:

To simplify the code, Remove the power status checking in the
tlb_flush_all, remove this:
if (pm_runtime_get_if_in_use(data->dev) <= 0)
continue;

After this patch, the mtk_iommu_tlb_flush_all will be called from
a) isr
b) pm runtime resume callback
c) tlb flush range fail case
d) iommu_create_device_direct_mappings
-> iommu_flush_iotlb_all
In first three cases, the power and clock always are enabled; d) is direct


Regarding case "c) tlb flush range fail case", I found out that this often 
happens
when the iommu is used while it is runtime suspended. For example the mtk-vcodec
encoder driver calls "pm_runtime_resume_and_get" only when it starts streaming 
but
buffers allocation is done in 'v4l2_reqbufs' before "pm_runtime_resume_and_get" 
is called
and then I see the warning "Partial TLB flush timed out, falling back to full 
flush"
I am not sure how to fix that issue, but it seems that case 'c)' might indicate 
that
power and clock are actually not enabled.


mapping, the tlb flush is unnecessay since we already have tlb_flush_all
in the pm_runtime_resume callback. When the iommu's power status is
changed to active, the tlb always is clean.

In addition, there still are 2 reasons that don't add PM status checking
in the tlb flush all:
a) Write tlb flush all register also is ok even though the HW has no
power and clocks. Write ignore.
b) pm_runtime_get_if_in_use(m4udev) is 0 when the tlb_flush_all
is called frm pm_runtime_resume cb. From this point, we can not add
this code above in this tlb_flush_all.

Signed-off-by: Yong Wu 
---
  drivers/iommu/mtk_iommu.c | 20 +++-
  1 file changed, 7 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index e9e94944ed91..4a33b6c6b1db 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -204,10 +204,14 @@ static struct mtk_iommu_domain *to_mtk_domain(struct 
iommu_domain *dom)
return container_of(dom, struct mtk_iommu_domain, domain);
  }
  
-static void mtk_iommu_tlb_do_flush_all(struct mtk_iommu_data *data)

+static void mtk_iommu_tlb_flush_all(struct mtk_iommu_data *data)
  {
unsigned long flags;
  
+	/*

+* No need get power status since the HW PM status nearly is active
+* when entering here.
+*/
spin_lock_irqsave(>tlb_lock, flags);
writel_relaxed(F_INVLD_EN1 | F_INVLD_EN0,
   data->base + data->plat_data->inv_sel_reg);
@@ -216,16 +220,6 @@ static void mtk_iommu_tlb_do_flush_all(struct 
mtk_iommu_data *data)
spin_unlock_irqrestore(>tlb_lock, flags);
  }
  
-static void mtk_iommu_tlb_flush_all(struct mtk_iommu_data *data)

-{
-   if (pm_runtime_get_if_in_use(data->dev) <= 0)
-   return;
-
-   mtk_iommu_tlb_do_flush_all(data);
-
-   pm_runtime_put(data->dev);
-}
-
  static void mtk_iommu_tlb_flush_range_sync(unsigned long iova, size_t size,
   struct mtk_iommu_data *data)
  {
@@ -263,7 +257,7 @@ static void mtk_iommu_tlb_flush_range_sync(unsigned long 
iova, size_t size,
if (ret) {
dev_warn(data->dev,
 "Partial TLB flush timed out, falling back to full 
flush\n");
-   mtk_iommu_tlb_do_flush_all(data);
+   mtk_iommu_tlb_flush_all(data);
}
  
  		if (has_pm)

@@ -993,7 +987,7 @@ static int __maybe_unused mtk_iommu_runtime_resume(struct 
device *dev)
 *
 * Thus, Make sure the tlb always is clean after each PM resume.
 */
-   mtk_iommu_tlb_do_flush_all(data);
+   mtk_iommu_tlb_flush_all(data);
  
  	/*

 * Uppon first resume, only enable the clk and return, since the values 
of the


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 4/6] iommu: Move IOMMU pagesize check to attach_device

2021-10-22 Thread Robin Murphy

On 2021-10-22 09:06, Marc Zyngier wrote:

On Fri, 22 Oct 2021 03:52:38 +0100,
Lu Baolu  wrote:


On 10/21/21 4:10 PM, Marc Zyngier wrote:

On Thu, 21 Oct 2021 03:22:30 +0100,
Lu Baolu  wrote:


On 10/20/21 10:22 PM, Marc Zyngier wrote:

On Wed, 20 Oct 2021 06:21:44 +0100,
Lu Baolu  wrote:


On 2021/10/20 0:37, Sven Peter via iommu wrote:

+   /*
+* Check that CPU pages can be represented by the IOVA granularity.
+* This has to be done after ops->attach_dev since many IOMMU drivers
+* only limit domain->pgsize_bitmap after having attached the first
+* device.
+*/
+   ret = iommu_check_page_size(domain);
+   if (ret) {
+   __iommu_detach_device(domain, dev);
+   return ret;
+   }


It looks odd. __iommu_attach_device() attaches an I/O page table for a
device. How does it relate to CPU pages? Why is it a failure case if CPU
page size is not covered?


If you allocate a CPU PAGE_SIZE'd region, and point it at a device
that now can DMA to more than what you have allocated because the
IOMMU's own page size is larger, the device has now access to data it
shouldn't see. In my book, that's a pretty bad thing.


But even you enforce the CPU page size check here, this problem still
exists unless all DMA buffers are PAGE_SIZE aligned and sized, right?


Let me take a CPU analogy: you have a page that contains some user
data *and* a kernel secret. How do you map this page into userspace
without leaking the kernel secret?

PAGE_SIZE allocations are the unit of isolation, and this applies to
both CPU and IOMMU. If you have allocated a DMA buffer that is less
than a page, you then have to resort to bounce buffering, or accept
that your data isn't safe.


I can understand the problems when IOMMU page sizes is larger than CPU
page size. But the code itself is not clean. The vendor iommu drivers
know more hardware details than the iommu core. It looks odd that the
vendor iommu says "okay, I can attach this I/O page table to the
device", but the iommu core says "no, you can't" and rolls everything
back.


If your IOMMU driver can do things behind the core's back and
contradict the view that the core has, then it is probably time to fix
your IOMMU driver and make the core aware of what is going on.
Supported page sizes is one of these things.

In general, keeping the IOMMU driver as dumb as possible is a worthy
design goal, and this is why we have these abstractions.


In this case it's the abstractions that are the problem, though. Any 
driver which supports heterogeneous IOMMU instances with potentially 
differing page sizes currently has no choice but to do horrible bodges 
to make the bus-based iommu_domain_alloc() paradigm work *at all*. 
Fixing that from the fundamental API level upwards has been on the to-do 
list for some time now, but won't be straightforward.


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/5] iommu/virtio: Add identity domains

2021-10-22 Thread Jean-Philippe Brucker
On Fri, Oct 22, 2021 at 06:16:27AM -0400, Michael S. Tsirkin wrote:
> On Wed, Oct 13, 2021 at 01:10:48PM +0100, Jean-Philippe Brucker wrote:
> > Support identity domains, allowing to only enable IOMMU protection for a
> > subset of endpoints (those assigned to userspace, for example). Users
> > may enable identity domains at compile time
> > (CONFIG_IOMMU_DEFAULT_PASSTHROUGH), boot time (iommu.passthrough=1) or
> > runtime (/sys/kernel/iommu_groups/*/type = identity).
> 
> 
> I put this in my branch so it can get testing under linux-next,
> but pls notice if the ballot does not conclude in time
> for the merge window I won't send it to Linus.

Makes sense, thank you. I sent a new version of the spec change with
clarifications
https://www.mail-archive.com/virtio-dev@lists.oasis-open.org/msg07969.html

Thanks,
Jean

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/5] iommu/virtio: Add identity domains

2021-10-22 Thread Michael S. Tsirkin
On Wed, Oct 13, 2021 at 01:10:48PM +0100, Jean-Philippe Brucker wrote:
> Support identity domains, allowing to only enable IOMMU protection for a
> subset of endpoints (those assigned to userspace, for example). Users
> may enable identity domains at compile time
> (CONFIG_IOMMU_DEFAULT_PASSTHROUGH), boot time (iommu.passthrough=1) or
> runtime (/sys/kernel/iommu_groups/*/type = identity).


I put this in my branch so it can get testing under linux-next,
but pls notice if the ballot does not conclude in time
for the merge window I won't send it to Linus.

> Patches 1-2 support identity domains using the optional
> VIRTIO_IOMMU_F_BYPASS_CONFIG feature. The feature bit is not yet in the
> spec, see [1] for the latest proposal.
> 
> Patches 3-5 add a fallback to identity mappings, when the feature is not
> supported.
> 
> Note that this series doesn't touch the global bypass bit added by
> VIRTIO_IOMMU_F_BYPASS_CONFIG. All endpoints managed by the IOMMU should
> be attached to a domain, so global bypass isn't in use after endpoints
> are probed. Before that, the global bypass policy is decided by the
> hypervisor and firmware. So I don't think Linux needs to touch the
> global bypass bit, but there are some patches available on my
> virtio-iommu/bypass branch [2] to test it.
> 
> QEMU patches are on my virtio-iommu/bypass branch [3] (and the list)
> 
> [1] https://www.mail-archive.com/virtio-dev@lists.oasis-open.org/msg07898.html
> [2] https://jpbrucker.net/git/linux/log/?h=virtio-iommu/bypass
> [3] https://jpbrucker.net/git/qemu/log/?h=virtio-iommu/bypass
> 
> Jean-Philippe Brucker (5):
>   iommu/virtio: Add definitions for VIRTIO_IOMMU_F_BYPASS_CONFIG
>   iommu/virtio: Support bypass domains
>   iommu/virtio: Sort reserved regions
>   iommu/virtio: Pass end address to viommu_add_mapping()
>   iommu/virtio: Support identity-mapped domains
> 
>  include/uapi/linux/virtio_iommu.h |   8 ++-
>  drivers/iommu/virtio-iommu.c  | 113 +-
>  2 files changed, 101 insertions(+), 20 deletions(-)
> 
> -- 
> 2.33.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 4/6] iommu: Move IOMMU pagesize check to attach_device

2021-10-22 Thread Marc Zyngier
On Fri, 22 Oct 2021 03:52:38 +0100,
Lu Baolu  wrote:
> 
> On 10/21/21 4:10 PM, Marc Zyngier wrote:
> > On Thu, 21 Oct 2021 03:22:30 +0100,
> > Lu Baolu  wrote:
> >> 
> >> On 10/20/21 10:22 PM, Marc Zyngier wrote:
> >>> On Wed, 20 Oct 2021 06:21:44 +0100,
> >>> Lu Baolu  wrote:
>  
>  On 2021/10/20 0:37, Sven Peter via iommu wrote:
> > +   /*
> > +* Check that CPU pages can be represented by the IOVA 
> > granularity.
> > +* This has to be done after ops->attach_dev since many IOMMU 
> > drivers
> > +* only limit domain->pgsize_bitmap after having attached the 
> > first
> > +* device.
> > +*/
> > +   ret = iommu_check_page_size(domain);
> > +   if (ret) {
> > +   __iommu_detach_device(domain, dev);
> > +   return ret;
> > +   }
>  
>  It looks odd. __iommu_attach_device() attaches an I/O page table for a
>  device. How does it relate to CPU pages? Why is it a failure case if CPU
>  page size is not covered?
> >>> 
> >>> If you allocate a CPU PAGE_SIZE'd region, and point it at a device
> >>> that now can DMA to more than what you have allocated because the
> >>> IOMMU's own page size is larger, the device has now access to data it
> >>> shouldn't see. In my book, that's a pretty bad thing.
> >> 
> >> But even you enforce the CPU page size check here, this problem still
> >> exists unless all DMA buffers are PAGE_SIZE aligned and sized, right?
> > 
> > Let me take a CPU analogy: you have a page that contains some user
> > data *and* a kernel secret. How do you map this page into userspace
> > without leaking the kernel secret?
> > 
> > PAGE_SIZE allocations are the unit of isolation, and this applies to
> > both CPU and IOMMU. If you have allocated a DMA buffer that is less
> > than a page, you then have to resort to bounce buffering, or accept
> > that your data isn't safe.
> 
> I can understand the problems when IOMMU page sizes is larger than CPU
> page size. But the code itself is not clean. The vendor iommu drivers
> know more hardware details than the iommu core. It looks odd that the
> vendor iommu says "okay, I can attach this I/O page table to the
> device", but the iommu core says "no, you can't" and rolls everything
> back.

If your IOMMU driver can do things behind the core's back and
contradict the view that the core has, then it is probably time to fix
your IOMMU driver and make the core aware of what is going on.
Supported page sizes is one of these things.

In general, keeping the IOMMU driver as dumb as possible is a worthy
design goal, and this is why we have these abstractions.

M.

-- 
Without deviation from the norm, progress is not possible.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

2021-10-22 Thread Jean-Philippe Brucker
On Thu, Oct 21, 2021 at 08:22:23PM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 21, 2021 at 03:58:02PM +0100, Jean-Philippe Brucker wrote:
> > On Thu, Oct 21, 2021 at 02:26:00AM +, Tian, Kevin wrote:
> > > > I'll leave it to Jean to confirm. If only coherent DMA can be used in
> > > > the guest on other platforms, suppose VFIO should not blindly set
> > > > IOMMU_CACHE and in concept it should deny assigning a non-coherent
> > > > device since no co-ordination with guest exists today.
> > > 
> > > Jean, what's your opinion?
> > 
> > Yes a sanity check to prevent assigning non-coherent devices would be
> > good, though I'm not particularly worried about non-coherent devices. PCIe
> > on Arm should be coherent (according to the Base System Architecture). So
> > vfio-pci devices should be coherent, but vfio-platform and mdev are
> > case-by-case (hopefully all coherent since it concerns newer platforms).
> > 
> > More worrying, I thought we disabled No-Snoop for VFIO but I was wrong,
> > it's left enabled. On Arm I don't think userspace can perform the right
> > cache maintenance operations to maintain coherency with a device that
> > issues No-Snoop writes. Userspace can issue clean+invalidate but not
> > invalidate alone, so there is no equivalent to
> > arch_sync_dma_for_cpu().
> 
> So what happens in a VM? Does a VM know that arch_sync_dma_for_cpu()
> is not available?

This would only affect userspace drivers, it's only host or guest
userspace that cannot issue the maintenance operations. The VM can do
arch_sync_dma_for_cpu()

Thanks,
Jean

> 
> And how does this work with the nested IOMMU translation? I thought I
> read in the SMMU spec that the io page table entries could control
> cachability including in nesting cases?
> 
> > I think the worse that can happen is the device owner shooting itself in
> > the foot by using No-Snoop, but would it hurt to disable it?
> 
> No, the worst is the same as Intel - a driver running in the guest VM
> assumes it can use arch_sync_dma_for_cpu() and acts accordingly,
> resulting in a broken VM.
> 
> Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu