RE: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-03-24 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Friday, March 25, 2022 7:12 AM
> 
> On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote:
> > That's essentially what I'm trying to reconcile, we're racing both
> > to round out the compatibility interface to fully support QEMU, while
> > also updating QEMU to use iommufd directly so it won't need that full
> > support.  It's a confusing message.  Thanks,
> 
> The long term purpose of compatibility is to provide a config option
> to allow type 1 to be turned off and continue to support old user
> space (eg in containers) that is running old qemu/dpdk/spdk/etc.
> 
> This shows that we have a plan/path to allow a distro to support only
> one iommu interface in their kernel should they choose without having
> to sacrifice uABI compatibility.
> 
> As for racing, my intention is to leave the compat interface alone for
> awhile - the more urgent things in on my personal list are the RFC
> for dirty tracking, mlx5 support for dirty tracking, and VFIO preparation
> for iommufd support.
> 
> Eric and Yi are focusing on userspace page tables and qemu updates.
> 
> Joao is working on implementing iommu driver dirty tracking
> 
> Lu and Jacob are working on getting PASID support infrastructure
> together.
> 
> There is alot going on!
> 
> A question to consider is what would you consider the minimum bar for
> merging?
> 

My two cents. 

IMHO making the compat work as a task in parallel with other works
listed above is the most efficient approach to move forward. In concept
they are not mutual-dependent by using different set of uAPIs (vfio
compat vs. iommufd native). Otherwise considering the list of TODOs 
the compat work will become a single big task gating all other works.

If agreed this suggests we may want to prioritize Yi's vfio device uAPI [1]
to integrate vfio with iommufd to get this series merged. iirc there are
less opens remaining from v1 discussion compared to the list in the
compat interface. Of course it needs the Qemu change ready to use
iommufd directly, but this is necessary to unblock other tasks anyway.

[1] 
https://github.com/luxis1999/iommufd/commit/2d9278d4ecad7953b3787c98cdb650764af8a1a1

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [git pull] IOMMU Updates for Linux v5.18

2022-03-24 Thread pr-tracker-bot
The pull request you sent on Thu, 24 Mar 2022 11:04:20 +0100:

> git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git 
> tags/iommu-updates-v5.18

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/34af78c4e616c359ed428d79fe4758a35d2c5473

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-24 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, March 24, 2022 9:46 PM
> 
> On Thu, Mar 24, 2022 at 07:25:03AM +, Tian, Kevin wrote:
> 
> > Based on that here is a quick tweak of the force-snoop part (not compiled).
> 
> I liked your previous idea better, that IOMMU_CAP_CACHE_COHERENCY
> started out OK but got weird. So lets fix it back to the way it was.
> 
> How about this:
> 
> https://github.com/jgunthorpe/linux/commits/intel_no_snoop
> 
> b11c19a4b34c2a iommu: Move the Intel no-snoop control off of
> IOMMU_CACHE
> 5263947f9d5f36 vfio: Require that device support DMA cache coherence
> eab4b381c64a30 iommu: Restore IOMMU_CAP_CACHE_COHERENCY to its
> original meaning
> 2752e12bed48f6 iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY
> with dev_is_dma_coherent()
> 
> If you like it could you take it from here?
> 

this looks good to me except that the 2nd patch (eab4b381) should be
the last one otherwise it affects bisect. and in that case the subject 
would be simply about removing the capability instead of restoring...

let me find a box to verify it.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[merged] mm-enforce-pageblock_order-max_order.patch removed from -mm tree

2022-03-24 Thread Andrew Morton


The patch titled
 Subject: mm: enforce pageblock_order < MAX_ORDER
has been removed from the -mm tree.  Its filename was
 mm-enforce-pageblock_order-max_order.patch

This patch was dropped because it was merged into mainline or a subsystem tree

--
From: David Hildenbrand 
Subject: mm: enforce pageblock_order < MAX_ORDER

Some places in the kernel don't really expect pageblock_order >=
MAX_ORDER, and it looks like this is only possible in corner cases:

1) CONFIG_DEFERRED_STRUCT_PAGE_INIT we'll end up freeing pageblock_order
   pages via __free_pages_core(), which cannot possibly work.

2) find_zone_movable_pfns_for_nodes() will roundup the ZONE_MOVABLE
   start PFN to MAX_ORDER_NR_PAGES. Consequently with a bigger
   pageblock_order, we could have a single pageblock partially managed by
   two zones.

3) compaction code runs into __fragmentation_index() with order
   >= MAX_ORDER, when checking WARN_ON_ONCE(order >= MAX_ORDER). [1]

4) mm/page_reporting.c won't be reporting any pages with default
   page_reporting_order == pageblock_order, as we'll be skipping the
   reporting loop inside page_reporting_process_zone().

5) __rmqueue_fallback() will never be able to steal with
   ALLOC_NOFRAGMENT.

pageblock_order >= MAX_ORDER is weird either way: it's a pure optimization
for making alloc_contig_range(), as used for allcoation of gigantic pages,
a little more reliable to succeed.  However, if there is demand for
somewhat reliable allocation of gigantic pages, affected setups should be
using CMA or boottime allocations instead.

So let's make sure that pageblock_order < MAX_ORDER and simplify.

[1] https://lkml.kernel.org/r/87r189a2ks@linux.ibm.com

Link: https://lkml.kernel.org/r/20220214174132.219303-3-da...@redhat.com
Signed-off-by: David Hildenbrand 
Reviewed-by: Zi Yan 
Cc: Aneesh Kumar K.V 
Cc: Benjamin Herrenschmidt 
Cc: Christoph Hellwig 
Cc: Frank Rowand 
Cc: John Garry via iommu 
Cc: Marek Szyprowski 
Cc: Michael Ellerman 
Cc: Michael S. Tsirkin 
Cc: Minchan Kim 
Cc: Paul Mackerras 
Cc: Rob Herring 
Cc: Robin Murphy 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
---

 drivers/virtio/virtio_mem.c |9 ++--
 include/linux/cma.h |3 --
 include/linux/pageblock-flags.h |7 --
 mm/Kconfig  |3 ++
 mm/page_alloc.c |   32 +++---
 5 files changed, 20 insertions(+), 34 deletions(-)

--- a/drivers/virtio/virtio_mem.c~mm-enforce-pageblock_order-max_order
+++ a/drivers/virtio/virtio_mem.c
@@ -2476,13 +2476,10 @@ static int virtio_mem_init_hotplug(struc
  VIRTIO_MEM_DEFAULT_OFFLINE_THRESHOLD);
 
/*
-* We want subblocks to span at least MAX_ORDER_NR_PAGES and
-* pageblock_nr_pages pages. This:
-* - Is required for now for alloc_contig_range() to work reliably -
-*   it doesn't properly handle smaller granularity on ZONE_NORMAL.
+* TODO: once alloc_contig_range() works reliably with pageblock
+* granularity on ZONE_NORMAL, use pageblock_nr_pages instead.
 */
-   sb_size = max_t(uint64_t, MAX_ORDER_NR_PAGES,
-   pageblock_nr_pages) * PAGE_SIZE;
+   sb_size = PAGE_SIZE * MAX_ORDER_NR_PAGES;
sb_size = max_t(uint64_t, vm->device_block_size, sb_size);
 
if (sb_size < memory_block_size_bytes() && !force_bbm) {
--- a/include/linux/cma.h~mm-enforce-pageblock_order-max_order
+++ a/include/linux/cma.h
@@ -25,8 +25,7 @@
  * -- can deal with only some pageblocks of a higher-order page being
  *  MIGRATE_CMA, we can use pageblock_nr_pages.
  */
-#define CMA_MIN_ALIGNMENT_PAGES max_t(phys_addr_t, MAX_ORDER_NR_PAGES, \
- pageblock_nr_pages)
+#define CMA_MIN_ALIGNMENT_PAGES MAX_ORDER_NR_PAGES
 #define CMA_MIN_ALIGNMENT_BYTES (PAGE_SIZE * CMA_MIN_ALIGNMENT_PAGES)
 
 struct cma;
--- a/include/linux/pageblock-flags.h~mm-enforce-pageblock_order-max_order
+++ a/include/linux/pageblock-flags.h
@@ -37,8 +37,11 @@ extern unsigned int pageblock_order;
 
 #else /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
 
-/* Huge pages are a constant size */
-#define pageblock_orderHUGETLB_PAGE_ORDER
+/*
+ * Huge pages are a constant size, but don't exceed the maximum allocation
+ * granularity.
+ */
+#define pageblock_ordermin_t(unsigned int, HUGETLB_PAGE_ORDER, 
MAX_ORDER - 1)
 
 #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
 
--- a/mm/Kconfig~mm-enforce-pageblock_order-max_order
+++ a/mm/Kconfig
@@ -262,6 +262,9 @@ config HUGETLB_PAGE_SIZE_VARIABLE
  HUGETLB_PAGE_ORDER when there are multiple HugeTLB page sizes 
available
  on a platform.
 
+ Note that the pageblock_order cannot exceed MAX_ORDER - 1 and will be
+ clamped down to MAX_ORDER - 1.
+
 config CONTIG_ALLOC
def_bool (MEMORY_ISOLATION && COMPACTION) || CMA
 
--- 

[merged] cma-factor-out-minimum-alignment-requirement.patch removed from -mm tree

2022-03-24 Thread Andrew Morton


The patch titled
 Subject: cma: factor out minimum alignment requirement
has been removed from the -mm tree.  Its filename was
 cma-factor-out-minimum-alignment-requirement.patch

This patch was dropped because it was merged into mainline or a subsystem tree

--
From: David Hildenbrand 
Subject: cma: factor out minimum alignment requirement

Patch series "mm: enforce pageblock_order < MAX_ORDER".

Having pageblock_order >= MAX_ORDER seems to be able to happen in corner
cases and some parts of the kernel are not prepared for it.

For example, Aneesh has shown [1] that such kernels can be compiled on
ppc64 with 64k base pages by setting FORCE_MAX_ZONEORDER=8, which will run
into a WARN_ON_ONCE(order >= MAX_ORDER) in comapction code right during
boot.

We can get pageblock_order >= MAX_ORDER when the default hugetlb size is
bigger than the maximum allocation granularity of the buddy, in which case
we are no longer talking about huge pages but instead gigantic pages.

Having pageblock_order >= MAX_ORDER can only make alloc_contig_range() of
such gigantic pages more likely to succeed.

Reliable use of gigantic pages either requires boot time allcoation or
CMA, no need to overcomplicate some places in the kernel to optimize for
corner cases that are broken in other areas of the kernel.


This patch (of 2):

Let's enforce pageblock_order < MAX_ORDER and simplify.

Especially patch #1 can be regarded a cleanup before:
[PATCH v5 0/6] Use pageblock_order for cma and alloc_contig_range
alignment. [2]

[1] https://lkml.kernel.org/r/87r189a2ks@linux.ibm.com
[2] https://lkml.kernel.org/r/20220211164135.1803616-1-zi@sent.com

Link: https://lkml.kernel.org/r/20220214174132.219303-2-da...@redhat.com
Signed-off-by: David Hildenbrand 
Reviewed-by: Zi Yan 
Acked-by: Rob Herring 
Cc: Aneesh Kumar K.V 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Frank Rowand 
Cc: Michael S. Tsirkin 
Cc: Christoph Hellwig 
Cc: Marek Szyprowski 
Cc: Robin Murphy 
Cc: Minchan Kim 
Cc: Vlastimil Babka 
Cc: John Garry via iommu 

Signed-off-by: Andrew Morton 
---

 arch/powerpc/include/asm/fadump-internal.h |5 
 arch/powerpc/kernel/fadump.c   |2 -
 drivers/of/of_reserved_mem.c   |9 ++--
 include/linux/cma.h|9 
 kernel/dma/contiguous.c|4 ---
 mm/cma.c   |   20 ---
 6 files changed, 19 insertions(+), 30 deletions(-)

--- 
a/arch/powerpc/include/asm/fadump-internal.h~cma-factor-out-minimum-alignment-requirement
+++ a/arch/powerpc/include/asm/fadump-internal.h
@@ -19,11 +19,6 @@
 
 #define memblock_num_regions(memblock_type)(memblock.memblock_type.cnt)
 
-/* Alignment per CMA requirement. */
-#define FADUMP_CMA_ALIGNMENT   (PAGE_SIZE <<   \
-max_t(unsigned long, MAX_ORDER - 1,\
-pageblock_order))
-
 /* FAD commands */
 #define FADUMP_REGISTER1
 #define FADUMP_UNREGISTER  2
--- a/arch/powerpc/kernel/fadump.c~cma-factor-out-minimum-alignment-requirement
+++ a/arch/powerpc/kernel/fadump.c
@@ -544,7 +544,7 @@ int __init fadump_reserve_mem(void)
if (!fw_dump.nocma) {
fw_dump.boot_memory_size =
ALIGN(fw_dump.boot_memory_size,
- FADUMP_CMA_ALIGNMENT);
+ CMA_MIN_ALIGNMENT_BYTES);
}
 #endif
 
--- a/drivers/of/of_reserved_mem.c~cma-factor-out-minimum-alignment-requirement
+++ a/drivers/of/of_reserved_mem.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "of_private.h"
 
@@ -116,12 +117,8 @@ static int __init __reserved_mem_alloc_s
if (IS_ENABLED(CONFIG_CMA)
&& of_flat_dt_is_compatible(node, "shared-dma-pool")
&& of_get_flat_dt_prop(node, "reusable", NULL)
-   && !nomap) {
-   unsigned long order =
-   max_t(unsigned long, MAX_ORDER - 1, pageblock_order);
-
-   align = max(align, (phys_addr_t)PAGE_SIZE << order);
-   }
+   && !nomap)
+   align = max_t(phys_addr_t, align, CMA_MIN_ALIGNMENT_BYTES);
 
prop = of_get_flat_dt_prop(node, "alloc-ranges", );
if (prop) {
--- a/include/linux/cma.h~cma-factor-out-minimum-alignment-requirement
+++ a/include/linux/cma.h
@@ -20,6 +20,15 @@
 
 #define CMA_MAX_NAME 64
 
+/*
+ * TODO: once the buddy -- especially pageblock merging and 
alloc_contig_range()
+ * -- can deal with only some pageblocks of a higher-order page being
+ *  MIGRATE_CMA, we can use pageblock_nr_pages.
+ */
+#define CMA_MIN_ALIGNMENT_PAGES max_t(phys_addr_t, MAX_ORDER_NR_PAGES, \
+ pageblock_nr_pages)
+#define 

Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-03-24 Thread Jason Gunthorpe via iommu
On Thu, Mar 24, 2022 at 04:04:03PM -0600, Alex Williamson wrote:
> On Wed, 23 Mar 2022 21:33:42 -0300
> Jason Gunthorpe  wrote:
> 
> > On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:
> > 
> > > My overall question here would be whether we can actually achieve a
> > > compatibility interface that has sufficient feature transparency that we
> > > can dump vfio code in favor of this interface, or will there be enough
> > > niche use cases that we need to keep type1 and vfio containers around
> > > through a deprecation process?  
> > 
> > Other than SPAPR, I think we can.
> 
> Does this mean #ifdef CONFIG_PPC in vfio core to retain infrastructure
> for POWER support?

Certainly initialy - I have no ability to do better than that.

I'm hoping someone from IBM will be willing to work on this in the
long run and we can do better.

> > I don't think this is compatibility. No kernel today triggers qemu to
> > use this feature as no kernel supports live migration. No existing
> > qemu will trigger this feature with new kernels that support live
> > migration v2. Therefore we can adjust qemu's dirty tracking at the
> > same time we enable migration v2 in qemu.
> 
> I guess I was assuming that enabling v2 migration in QEMU was dependent
> on the existing type1 dirty tracking because it's the only means we
> have to tell QEMU that all memory is perpetually dirty when we have a
> DMA device.  Is that not correct?

I haven't looked closely at this part in qemu, but IMHO, if qemu sees
that it has VFIO migration support but does not have any DMA dirty
tracking capability it should not do precopy flows.

If there is a bug here we should certainly fix it before progressing
the v2 patches. I'll ask Yishai & Co to take a look.

> > Intel no-snoop is simple enough, just needs some Intel cleanup parts.

Patches for this exist now
 
> > mdev will come along with the final VFIO integration, all the really
> > hard parts are done already. The VFIO integration is a medium sized
> > task overall.
> > 
> > So, I'm not ready to give up yet :)
> 
> Ok, that's a more promising outlook than I was inferring from the long
> list of missing features.

Yeah, it is just long, but they are not scary things, just priorites
and patch planning.

> > I think we can get there pretty quickly, or at least I haven't got
> > anything that is scaring me alot (beyond SPAPR of course)
> > 
> > For the dpdk/etcs of the world I think we are already there.
> 
> That's essentially what I'm trying to reconcile, we're racing both
> to round out the compatibility interface to fully support QEMU, while
> also updating QEMU to use iommufd directly so it won't need that full
> support.  It's a confusing message.  Thanks,

The long term purpose of compatibility is to provide a config option
to allow type 1 to be turned off and continue to support old user
space (eg in containers) that is running old qemu/dpdk/spdk/etc.

This shows that we have a plan/path to allow a distro to support only
one iommu interface in their kernel should they choose without having
to sacrifice uABI compatibility.

As for racing, my intention is to leave the compat interface alone for
awhile - the more urgent things in on my personal list are the RFC
for dirty tracking, mlx5 support for dirty tracking, and VFIO preparation
for iommufd support.

Eric and Yi are focusing on userspace page tables and qemu updates.

Joao is working on implementing iommu driver dirty tracking

Lu and Jacob are working on getting PASID support infrastructure
together.

There is alot going on!

A question to consider is what would you consider the minimum bar for
merging?

Thanks,
Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-24 Thread Alex Williamson
On Thu, 24 Mar 2022 19:27:39 -0300
Jason Gunthorpe  wrote:

> On Thu, Mar 24, 2022 at 02:40:15PM -0600, Alex Williamson wrote:
> > On Tue, 22 Mar 2022 13:15:21 -0300
> > Jason Gunthorpe via iommu  wrote:
> >   
> > > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > >   
> > > > I'm still picking my way through the series, but the later compat
> > > > interface doesn't mention this difference as an outstanding issue.
> > > > Doesn't this difference need to be accounted in how libvirt manages VM
> > > > resource limits?  
> > > 
> > > AFACIT, no, but it should be checked.
> > >   
> > > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > > memory limits.
> > > 
> > > Yes, and ulimit does work fully. prlimit adjusts the value:
> > > 
> > > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > >   struct rlimit *new_rlim, struct rlimit *old_rlim)
> > > {
> > >   rlim = tsk->signal->rlim + resource;
> > > [..]
> > >   if (new_rlim)
> > >   *rlim = *new_rlim;
> > > 
> > > Which vfio reads back here:
> > > 
> > > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = 
> > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > drivers/vfio/vfio_iommu_type1.c:unsigned long limit = 
> > > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > 
> > > And iommufd does the same read back:
> > > 
> > >   lock_limit =
> > >   task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > >   npages = pages->npinned - pages->last_npinned;
> > >   do {
> > >   cur_pages = atomic_long_read(>source_user->locked_vm);
> > >   new_pages = cur_pages + npages;
> > >   if (new_pages > lock_limit)
> > >   return -ENOMEM;
> > >   } while (atomic_long_cmpxchg(>source_user->locked_vm, cur_pages,
> > >new_pages) != cur_pages);
> > > 
> > > So it does work essentially the same.  
> > 
> > Well, except for the part about vfio updating mm->locked_vm and iommufd
> > updating user->locked_vm, a per-process counter versus a per-user
> > counter.  prlimit specifically sets process resource limits, which get
> > reflected in task_rlimit.  
> 
> Indeed, but that is not how the majority of other things seem to
> operate it.
> 
> > For example, let's say a user has two 4GB VMs and they're hot-adding
> > vfio devices to each of them, so libvirt needs to dynamically modify
> > the locked memory limit for each VM.  AIUI, libvirt would look at the
> > VM size and call prlimit to set that value.  If libvirt does this to
> > both VMs, then each has a task_rlimit of 4GB.  In vfio we add pinned
> > pages to mm->locked_vm, so this works well.  In the iommufd loop above,
> > we're comparing a per-task/process limit to a per-user counter.  So I'm
> > a bit lost how both VMs can pin their pages here.  
> 
> I don't know anything about libvirt - it seems strange to use a
> securityish feature like ulimit but not security isolate processes
> with real users.
> 
> But if it really does this then it really does this.
> 
> So at the very least VFIO container has to keep working this way.
> 
> The next question is if we want iommufd's own device node to work this
> way and try to change libvirt somehow. It seems libvirt will have to
> deal with this at some point as iouring will trigger the same problem.
> 
> > > This whole area is a bit peculiar (eg mlock itself works differently),
> > > IMHO, but with most of the places doing pins voting to use
> > > user->locked_vm as the charge it seems the right path in today's
> > > kernel.  
> > 
> > The philosophy of whether it's ultimately a better choice for the
> > kernel aside, if userspace breaks because we're accounting in a
> > per-user pool rather than a per-process pool, then our compatibility
> > layer ain't so transparent.  
> 
> Sure, if it doesn't work it doesn't work. Lets be sure and clearly
> document what the compatability issue is and then we have to keep it
> per-process.
> 
> And the same reasoning likely means I can't change RDMA either as qemu
> will break just as well when qemu uses rdma mode.
> 
> Which is pretty sucky, but it is what it is..

I added Daniel Berrangé to the cc list for my previous reply, hopefully
he can comment whether libvirt has the sort of user security model you
allude to above that maybe makes this a non-issue for this use case.
Unfortunately it's extremely difficult to prove that there are no such
use cases out there even if libvirt is ok.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-24 Thread Jason Gunthorpe via iommu
On Thu, Mar 24, 2022 at 02:40:15PM -0600, Alex Williamson wrote:
> On Tue, 22 Mar 2022 13:15:21 -0300
> Jason Gunthorpe via iommu  wrote:
> 
> > On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> > 
> > > I'm still picking my way through the series, but the later compat
> > > interface doesn't mention this difference as an outstanding issue.
> > > Doesn't this difference need to be accounted in how libvirt manages VM
> > > resource limits?
> > 
> > AFACIT, no, but it should be checked.
> > 
> > > AIUI libvirt uses some form of prlimit(2) to set process locked
> > > memory limits.  
> > 
> > Yes, and ulimit does work fully. prlimit adjusts the value:
> > 
> > int do_prlimit(struct task_struct *tsk, unsigned int resource,
> > struct rlimit *new_rlim, struct rlimit *old_rlim)
> > {
> > rlim = tsk->signal->rlim + resource;
> > [..]
> > if (new_rlim)
> > *rlim = *new_rlim;
> > 
> > Which vfio reads back here:
> > 
> > drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = 
> > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > drivers/vfio/vfio_iommu_type1.c:unsigned long limit = 
> > rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > 
> > And iommufd does the same read back:
> > 
> > lock_limit =
> > task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > npages = pages->npinned - pages->last_npinned;
> > do {
> > cur_pages = atomic_long_read(>source_user->locked_vm);
> > new_pages = cur_pages + npages;
> > if (new_pages > lock_limit)
> > return -ENOMEM;
> > } while (atomic_long_cmpxchg(>source_user->locked_vm, cur_pages,
> >  new_pages) != cur_pages);
> > 
> > So it does work essentially the same.
> 
> Well, except for the part about vfio updating mm->locked_vm and iommufd
> updating user->locked_vm, a per-process counter versus a per-user
> counter.  prlimit specifically sets process resource limits, which get
> reflected in task_rlimit.

Indeed, but that is not how the majority of other things seem to
operate it.

> For example, let's say a user has two 4GB VMs and they're hot-adding
> vfio devices to each of them, so libvirt needs to dynamically modify
> the locked memory limit for each VM.  AIUI, libvirt would look at the
> VM size and call prlimit to set that value.  If libvirt does this to
> both VMs, then each has a task_rlimit of 4GB.  In vfio we add pinned
> pages to mm->locked_vm, so this works well.  In the iommufd loop above,
> we're comparing a per-task/process limit to a per-user counter.  So I'm
> a bit lost how both VMs can pin their pages here.

I don't know anything about libvirt - it seems strange to use a
securityish feature like ulimit but not security isolate processes
with real users.

But if it really does this then it really does this.

So at the very least VFIO container has to keep working this way.

The next question is if we want iommufd's own device node to work this
way and try to change libvirt somehow. It seems libvirt will have to
deal with this at some point as iouring will trigger the same problem.

> > This whole area is a bit peculiar (eg mlock itself works differently),
> > IMHO, but with most of the places doing pins voting to use
> > user->locked_vm as the charge it seems the right path in today's
> > kernel.
> 
> The philosophy of whether it's ultimately a better choice for the
> kernel aside, if userspace breaks because we're accounting in a
> per-user pool rather than a per-process pool, then our compatibility
> layer ain't so transparent.

Sure, if it doesn't work it doesn't work. Lets be sure and clearly
document what the compatability issue is and then we have to keep it
per-process.

And the same reasoning likely means I can't change RDMA either as qemu
will break just as well when qemu uses rdma mode.

Which is pretty sucky, but it is what it is..

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-03-24 Thread Alex Williamson
On Wed, 23 Mar 2022 21:33:42 -0300
Jason Gunthorpe  wrote:

> On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:
> 
> > My overall question here would be whether we can actually achieve a
> > compatibility interface that has sufficient feature transparency that we
> > can dump vfio code in favor of this interface, or will there be enough
> > niche use cases that we need to keep type1 and vfio containers around
> > through a deprecation process?  
> 
> Other than SPAPR, I think we can.

Does this mean #ifdef CONFIG_PPC in vfio core to retain infrastructure
for POWER support?

> > The locked memory differences for one seem like something that
> > libvirt wouldn't want hidden  
> 
> I'm first interested to have an understanding how this change becomes
> a real problem in practice that requires libvirt to do something
> different for vfio or iommufd. We can discuss in the other thread
> 
> If this is the make or break point then I think we can deal with it
> either by going back to what vfio does now or perhaps some other
> friendly compat approach..
> 
> > and we have questions regarding support for vaddr hijacking  
> 
> I'm not sure what vaddr hijacking is? Do you mean
> VFIO_DMA_MAP_FLAG_VADDR ? There is a comment that outlines my plan to
> implement it in a functionally compatible way without the deadlock
> problem. I estimate this as a small project.
> 
> > and different ideas how to implement dirty page tracking,   
> 
> I don't think this is compatibility. No kernel today triggers qemu to
> use this feature as no kernel supports live migration. No existing
> qemu will trigger this feature with new kernels that support live
> migration v2. Therefore we can adjust qemu's dirty tracking at the
> same time we enable migration v2 in qemu.

I guess I was assuming that enabling v2 migration in QEMU was dependent
on the existing type1 dirty tracking because it's the only means we
have to tell QEMU that all memory is perpetually dirty when we have a
DMA device.  Is that not correct?  If we don't intend to carry type1
dirty tracking into iommufd compatibility and we need it for this
purpose, then our window for being able to rip it out entirely closes
when QEMU gains v2 migration support.

> With Joao's work we are close to having a solid RFC to come with
> something that can be fully implemented.
> 
> Hopefully we can agree to this soon enough that qemu can come with a
> full package of migration v2 support including the dirty tracking
> solution.
> 
> > not to mention the missing features that are currently well used,
> > like p2p mappings, coherency tracking, mdev, etc.  
> 
> I consider these all mandatory things, they won't be left out.
> 
> The reason they are not in the RFC is mostly because supporting them
> requires work outside just this iommufd area, and I'd like this series
> to remain self-contained.
> 
> I've already got a draft to add DMABUF support to VFIO PCI which
> nicely solves the follow_pfn security problem, we want to do this for
> another reason already. I'm waiting for some testing feedback before
> posting it. Need some help from Daniel make the DMABUF revoke semantic
> him and I have been talking about. In the worst case can copy the
> follow_pfn approach.
> 
> Intel no-snoop is simple enough, just needs some Intel cleanup parts.
> 
> mdev will come along with the final VFIO integration, all the really
> hard parts are done already. The VFIO integration is a medium sized
> task overall.
> 
> So, I'm not ready to give up yet :)

Ok, that's a more promising outlook than I was inferring from the long
list of missing features.

> > Where do we focus attention?  Is symlinking device files our proposal
> > to userspace and is that something achievable, or do we want to use
> > this compatibility interface as a means to test the interface and
> > allow userspace to make use of it for transition, if their use cases
> > allow it, perhaps eventually performing the symlink after deprecation
> > and eventual removal of the vfio container and type1 code?  Thanks,  
> 
> symlinking device files is definitely just a suggested way to expedite
> testing.
> 
> Things like qemu that are learning to use iommufd-only features should
> learn to directly open iommufd instead of vfio container to activate
> those features.

Which is kind of the basis for my question, QEMU is racing for native
support, Eric and Yi are already working on this, so some of these
compatibility interfaces might only have short term usefulness.

> Looking long down the road I don't think we want to have type 1 and
> iommufd code forever.

Agreed.

> So, I would like to make an option to compile
> out vfio container support entirely and have that option arrange for
> iommufd to provide the container device node itself.
> 
> I think we can get there pretty quickly, or at least I haven't got
> anything that is scaring me alot (beyond SPAPR of course)
> 
> For the dpdk/etcs of the world I think we are already there.


Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-24 Thread Toke Høiland-Jørgensen via iommu
Linus Torvalds  writes:

> On Thu, Mar 24, 2022 at 10:07 AM Toke Høiland-Jørgensen  wrote:
>>
>> Right, but is that sync_for_device call really needed?
>
> Well, imagine that you have a non-cache-coherent DMA (not bounce
> buffers - just bad hardware)...
>
> So the driver first does that dma_sync_single_for_cpu() for the CPU
> see the current state (for the non-cache-coherent case it would just
> invalidate caches).
>
> The driver then examines the command buffer state, sees that it's
> still in progress, and does that return -EINPROGRESS.
>
> It's actually very natural in that situation to flush the caches from
> the CPU side again. And so dma_sync_single_for_device() is a fairly
> reasonable thing to do in that situation.
>
> But it doesn't seem *required*, no. The CPU caches only have a copy of
> the data in them, no writeback needed (and writeback wouldn't work
> since DMA from the device may be in progress).
>
> So I don't think the dma_sync_single_for_device() is *wrong* per se,
> because the CPU didn't actually do any modifications.
>
> But yes, I think it's unnecessary - because any later CPU accesses
> would need that dma_sync_single_for_cpu() anyway, which should
> invalidate any stale caches.

OK, the above was basically how I understood it. Thank you for
confirming!

> And it clearly doesn't work in a bounce-buffer situation, but honestly
> I don't think a "CPU modified buffers concurrently with DMA" can
> *ever* work in that situation, so I think it's wrong for a bounce
> buffer model to ever do anything in the dma_sync_single_for_device()
> situation.

Right.

> Does removing that dma_sync_single_for_device() actually fix the
> problem for the ath driver?

I am hoping Oleksandr can help answer that since my own ath9k hardware
is currently on strike :(

-Toke
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-24 Thread Alex Williamson
On Tue, 22 Mar 2022 13:15:21 -0300
Jason Gunthorpe via iommu  wrote:

> On Tue, Mar 22, 2022 at 09:29:23AM -0600, Alex Williamson wrote:
> 
> > I'm still picking my way through the series, but the later compat
> > interface doesn't mention this difference as an outstanding issue.
> > Doesn't this difference need to be accounted in how libvirt manages VM
> > resource limits?
> 
> AFACIT, no, but it should be checked.
> 
> > AIUI libvirt uses some form of prlimit(2) to set process locked
> > memory limits.  
> 
> Yes, and ulimit does work fully. prlimit adjusts the value:
> 
> int do_prlimit(struct task_struct *tsk, unsigned int resource,
>   struct rlimit *new_rlim, struct rlimit *old_rlim)
> {
>   rlim = tsk->signal->rlim + resource;
> [..]
>   if (new_rlim)
>   *rlim = *new_rlim;
> 
> Which vfio reads back here:
> 
> drivers/vfio/vfio_iommu_type1.c:unsigned long pfn, limit = 
> rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> drivers/vfio/vfio_iommu_type1.c:unsigned long limit = 
> rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> 
> And iommufd does the same read back:
> 
>   lock_limit =
>   task_rlimit(pages->source_task, RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>   npages = pages->npinned - pages->last_npinned;
>   do {
>   cur_pages = atomic_long_read(>source_user->locked_vm);
>   new_pages = cur_pages + npages;
>   if (new_pages > lock_limit)
>   return -ENOMEM;
>   } while (atomic_long_cmpxchg(>source_user->locked_vm, cur_pages,
>new_pages) != cur_pages);
> 
> So it does work essentially the same.

Well, except for the part about vfio updating mm->locked_vm and iommufd
updating user->locked_vm, a per-process counter versus a per-user
counter.  prlimit specifically sets process resource limits, which get
reflected in task_rlimit.

For example, let's say a user has two 4GB VMs and they're hot-adding
vfio devices to each of them, so libvirt needs to dynamically modify
the locked memory limit for each VM.  AIUI, libvirt would look at the
VM size and call prlimit to set that value.  If libvirt does this to
both VMs, then each has a task_rlimit of 4GB.  In vfio we add pinned
pages to mm->locked_vm, so this works well.  In the iommufd loop above,
we're comparing a per-task/process limit to a per-user counter.  So I'm
a bit lost how both VMs can pin their pages here.

Am I missing some assumption about how libvirt users prlimit or
sandboxes users?

> The difference is more subtle, iouring/etc puts the charge in the user
> so it is additive with things like iouring and additively spans all
> the users processes.
> 
> However vfio is accounting only per-process and only for itself - no
> other subsystem uses locked as the charge variable for DMA pins.
> 
> The user visible difference will be that a limit X that worked with
> VFIO may start to fail after a kernel upgrade as the charge accounting
> is now cross user and additive with things like iommufd.

And that's exactly the concern.
 
> This whole area is a bit peculiar (eg mlock itself works differently),
> IMHO, but with most of the places doing pins voting to use
> user->locked_vm as the charge it seems the right path in today's
> kernel.

The philosophy of whether it's ultimately a better choice for the
kernel aside, if userspace breaks because we're accounting in a
per-user pool rather than a per-process pool, then our compatibility
layer ain't so transparent.

> Ceratinly having qemu concurrently using three different subsystems
> (vfio, rdma, iouring) issuing FOLL_LONGTERM and all accounting for
> RLIMIT_MEMLOCK differently cannot be sane or correct.

I think everyone would agree with that, but it also seems there are
real differences between task_rlimits and per-user vs per-process
accounting buckets and I'm confused how that's not a blocker for trying
to implement transparent compatibility.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-24 Thread Linus Torvalds
On Thu, Mar 24, 2022 at 10:07 AM Toke Høiland-Jørgensen  wrote:
>
> Right, but is that sync_for_device call really needed?

Well, imagine that you have a non-cache-coherent DMA (not bounce
buffers - just bad hardware)...

So the driver first does that dma_sync_single_for_cpu() for the CPU
see the current state (for the non-cache-coherent case it would just
invalidate caches).

The driver then examines the command buffer state, sees that it's
still in progress, and does that return -EINPROGRESS.

It's actually very natural in that situation to flush the caches from
the CPU side again. And so dma_sync_single_for_device() is a fairly
reasonable thing to do in that situation.

But it doesn't seem *required*, no. The CPU caches only have a copy of
the data in them, no writeback needed (and writeback wouldn't work
since DMA from the device may be in progress).

So I don't think the dma_sync_single_for_device() is *wrong* per se,
because the CPU didn't actually do any modifications.

But yes, I think it's unnecessary - because any later CPU accesses
would need that dma_sync_single_for_cpu() anyway, which should
invalidate any stale caches.

And it clearly doesn't work in a bounce-buffer situation, but honestly
I don't think a "CPU modified buffers concurrently with DMA" can
*ever* work in that situation, so I think it's wrong for a bounce
buffer model to ever do anything in the dma_sync_single_for_device()
situation.

Does removing that dma_sync_single_for_device() actually fix the
problem for the ath driver?

There's a fair number of those dma_sync_single_for_device() things all
over. Could we find mis-uses and warn about them some way? It seems to
be a very natural thing to do in this context, but bounce buffering
does make them very fragile.

 Linus
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-24 Thread Halil Pasic
On Thu, 24 Mar 2022 16:52:31 +
Robin Murphy  wrote:

> > Creating a new mapping for the same buffer before unmapping the
> > previous one does looks rather bogus.  But it does not fit the
> > pattern where revering the sync_single changes make the driver
> > work again.  
> 
> OK, you made me look :)
> 
> Now that it's obvious what to look for, I can only conclude that during 
> the stanza in ath_edma_get_buffers(), the device is still writing to the 
> buffer while ownership has been transferred to the CPU, and whatever got 
> written while ath9k_hw_process_rxdesc_edma() was running then gets wiped 
> out by the subsequent sync_for_device, which currently resets the 
> SWIOTLB slot to the state that sync_for_cpu copied out. By the letter of 
> the DMA API that's not allowed, but on the other hand I'm not sure if we 
> even have a good idiom for "I can't tell if the device has finished with 
> this buffer or not unless I look at it" :/

I agree with your analysis. Especially with the latter part (were you
state that we don't have a good idiom for that use case). 

I believe, a stronger statement is also true: it is fundamentally
impossible to accommodate use cases where the device and the cpu need
concurrent access to a dma buffer, if the dma buffer isn't in dma
coherent memory.

If the dma buffer is in dma coherent memory, and we don't need swiotlb,
then we don't need the dma_sync functionality. 

Specifically for swiotlb, if the swiotlb buffer is in dma coherent
memory, the driver could peek the swiotlb buffer, but I have no idea if
we can provide a remotely sane API for that. The point is the device
would have peek not via a pointer to the original buffer, but get
suitable pointer to the bounce buffer, which would be probably be
considered valid, as long as the mapping is valid. Honestly IMHO quite
ugly but I see no other way. 

Regards,
Halil
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-24 Thread Halil Pasic
On Wed, 23 Mar 2022 20:54:08 +
Robin Murphy  wrote:

> On 2022-03-23 19:16, Linus Torvalds wrote:
> > On Wed, Mar 23, 2022 at 12:06 PM Robin Murphy  wrote: 
> >  
> >>
> >> On 2022-03-23 17:27, Linus Torvalds wrote:  
> >>>
> >>> I'm assuming that the ath9k issue is that it gives DMA mapping a big
> >>> enough area to handle any possible packet size, and just expects -
> >>> quite reasonably - smaller packets to only fill the part they need.
> >>>
> >>> Which that "info leak" patch obviously breaks entirely.  
> >>
> >> Except that's the exact case which the new patch is addressing  
> > 
> > Not "addressing". Breaking.
> > 
> > Which is why it will almost certainly get reverted.
> > 
> > Not doing DMA to the whole area seems to be quite the sane thing to do
> > for things like network packets, and overwriting the part that didn't
> > get DMA'd with zeroes seems to be exactly the wrong thing here.
> > 
> > So the SG_IO - and other random untrusted block command sources - data
> > leak will almost certainly have to be addressed differently. Possibly
> > by simply allocating the area with GFP_ZERO to begin with.  
> 
> Er, the point of the block layer case is that whole area *is* zeroed to 
> begin with, and a latent memory corruption problem in SWIOTLB itself 
> replaces those zeros with random other kernel data unexpectedly. Let me 
> try illustrating some sequences for clarity...
> 
> Expected behaviour/without SWIOTLB:
>   Memory
> ---
> start12345678
> dma_map(DMA_FROM_DEVICE)  no-op
> device writes partial data   12ABC678 <- ABC
> dma_unmap(DMA_FROM_DEVICE)   12ABC678
> 
> 
> SWIOTLB previously:
>   Memory  Bounce buffer
> ---
> start12345678
> dma_map(DMA_FROM_DEVICE) no-op
> device writes partial data   12345678xxABCxxx <- ABC
> dma_unmap(DMA_FROM_DEVICE)   xxABCxxx <- xxABCxxx
> 
> 
> SWIOTLB Now:
>   Memory  Bounce buffer
> ---
> start12345678
> dma_map(DMA_FROM_DEVICE) 12345678 -> 12345678
> device writes partial data   1234567812ABC678 <- ABC
> dma_unmap(DMA_FROM_DEVICE)   12ABC678 <- 12ABC678
> 
> 
> Now, sure we can prevent any actual information leakage by initialising 
> the bounce buffer slot with zeros, but then we're just corrupting the 
> not-written-to parts of the mapping with zeros instead of anyone else's 
> old data. That's still fundamentally not OK. The only thing SWIOTLB can 
> do to be correct is treat DMA_FROM_DEVICE as a read-modify-write of the 
> entire mapping, because it has no way to know how much of it is actually 
> going to be modified.
> 

Very nice explanation! Thanks!

> I'll admit I still never quite grasped the reason for also adding the 
> override to swiotlb_sync_single_for_device() in aa6f8dcbab47, but I 
> think by that point we were increasingly tired and confused and starting 
> to second-guess ourselves (well, I was, at least).

I raised the question, do we need to do the same for
swiotlb_sync_single_for_device(). Did that based on my understanding of the
DMA API documentation. I had the following scenario in mind

SWIOTLB without the snyc_single:
  Memory  Bounce buffer  Owner
--
start 12345678 C
dma_map(DMA_FROM_DEVICE)  12345678 -> 12345678 C->D
device writes partial data1234567812ABC678 <- ABC  D
sync_for_cpu(DMA_FROM_DEVICE) 12ABC678 <- 12ABC678 D->C
cpu modifies buffer   12ABC678 C
sync_for_device(DMA_FROM_DEVICE)  12ABC678 C->D
device writes partial data1EFGC678 <-EFG   D
dma_unmap(DMA_FROM_DEVICE)1EFGC678 <- 1EFGC678 D->C

Legend: in Owner column C stands for cpu and D for device.

Without swiotlb, I believe we should have arrived at 6EFG. To get the
same result, IMHO, we need to do a sync in sync_for_device().
And aa6f8dcbab47 is an imperfect solution to that (because of size).


> I don't think it's 
> wrong per se, but as I said I do think it can bite anyone who's been 
> doing dma_sync_*() wrong but getting away with it until now. 

I fully agree.

> If 
> ddbd89deb7d3 alone turns out to work OK then I'd be inclined to try a 
> partial revert of just that one hunk.
>

I'm not against being pragmatic and doing the partial revert. But as
explained above, I do believe for correctness of swiotlb we ultimately
do need that change. So if the revert is the short term solution,
what should be our mid-term road-map?

Regards,
Halil
 
> Thanks,
> Robin.


Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-24 Thread Toke Høiland-Jørgensen via iommu
Robin Murphy  writes:

> On 2022-03-24 16:31, Christoph Hellwig wrote:
>> On Thu, Mar 24, 2022 at 05:29:12PM +0100, Maxime Bizon wrote:
 I'm looking into this; but in the interest of a speedy resolution of
 the regression I would be in favour of merging that partial revert
 and reinstating it if/when we identify (and fix) any bugs in ath9k :)
>>>
>>> This looks fishy:
>>>
>>> ath9k/recv.c
>>>
>>>  /* We will now give hardware our shiny new allocated skb */
>>>  new_buf_addr = dma_map_single(sc->dev, requeue_skb->data,
>>>common->rx_bufsize, 
>>> dma_type);
>>>  if (unlikely(dma_mapping_error(sc->dev, new_buf_addr))) {
>>>  dev_kfree_skb_any(requeue_skb);
>>>  goto requeue_drop_frag;
>>>  }
>>>
>>>  /* Unmap the frame */
>>>  dma_unmap_single(sc->dev, bf->bf_buf_addr,
>>>   common->rx_bufsize, dma_type);
>>>
>>>  bf->bf_mpdu = requeue_skb;
>>>  bf->bf_buf_addr = new_buf_addr;
>> 
>> Creating a new mapping for the same buffer before unmapping the
>> previous one does looks rather bogus.  But it does not fit the
>> pattern where revering the sync_single changes make the driver
>> work again.
>
> OK, you made me look :)
>
> Now that it's obvious what to look for, I can only conclude that during 
> the stanza in ath_edma_get_buffers(), the device is still writing to the 
> buffer while ownership has been transferred to the CPU, and whatever got 
> written while ath9k_hw_process_rxdesc_edma() was running then gets wiped 
> out by the subsequent sync_for_device, which currently resets the 
> SWIOTLB slot to the state that sync_for_cpu copied out. By the letter of 
> the DMA API that's not allowed, but on the other hand I'm not sure if we 
> even have a good idiom for "I can't tell if the device has finished with 
> this buffer or not unless I look at it" :/

Right, but is that sync_for_device call really needed? AFAICT, that
ath9k_hw_process_rxdesc_edma() invocation doesn't actually modify any of
the data when it returns EINPROGRESS, so could we just skip it? Like
the patch below? Or am I misunderstanding the semantics here?

-Toke


diff --git a/drivers/net/wireless/ath/ath9k/recv.c 
b/drivers/net/wireless/ath/ath9k/recv.c
index 0c0624a3b40d..19244d4c0ada 100644
--- a/drivers/net/wireless/ath/ath9k/recv.c
+++ b/drivers/net/wireless/ath/ath9k/recv.c
@@ -647,12 +647,8 @@ static bool ath_edma_get_buffers(struct ath_softc *sc,
common->rx_bufsize, DMA_FROM_DEVICE);
 
ret = ath9k_hw_process_rxdesc_edma(ah, rs, skb->data);
-   if (ret == -EINPROGRESS) {
-   /*let device gain the buffer again*/
-   dma_sync_single_for_device(sc->dev, bf->bf_buf_addr,
-   common->rx_bufsize, DMA_FROM_DEVICE);
+   if (ret == -EINPROGRESS)
return false;
-   }
 
__skb_unlink(skb, _edma->rx_fifo);
if (ret == -EINVAL) {
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-24 Thread Robin Murphy

On 2022-03-24 16:31, Christoph Hellwig wrote:

On Thu, Mar 24, 2022 at 05:29:12PM +0100, Maxime Bizon wrote:

I'm looking into this; but in the interest of a speedy resolution of
the regression I would be in favour of merging that partial revert
and reinstating it if/when we identify (and fix) any bugs in ath9k :)


This looks fishy:

ath9k/recv.c

 /* We will now give hardware our shiny new allocated skb */
 new_buf_addr = dma_map_single(sc->dev, requeue_skb->data,
   common->rx_bufsize, dma_type);
 if (unlikely(dma_mapping_error(sc->dev, new_buf_addr))) {
 dev_kfree_skb_any(requeue_skb);
 goto requeue_drop_frag;
 }

 /* Unmap the frame */
 dma_unmap_single(sc->dev, bf->bf_buf_addr,
  common->rx_bufsize, dma_type);

 bf->bf_mpdu = requeue_skb;
 bf->bf_buf_addr = new_buf_addr;


Creating a new mapping for the same buffer before unmapping the
previous one does looks rather bogus.  But it does not fit the
pattern where revering the sync_single changes make the driver
work again.


OK, you made me look :)

Now that it's obvious what to look for, I can only conclude that during 
the stanza in ath_edma_get_buffers(), the device is still writing to the 
buffer while ownership has been transferred to the CPU, and whatever got 
written while ath9k_hw_process_rxdesc_edma() was running then gets wiped 
out by the subsequent sync_for_device, which currently resets the 
SWIOTLB slot to the state that sync_for_cpu copied out. By the letter of 
the DMA API that's not allowed, but on the other hand I'm not sure if we 
even have a good idiom for "I can't tell if the device has finished with 
this buffer or not unless I look at it" :/


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-24 Thread Maxime Bizon

On Thu, 2022-03-24 at 15:27 +0100, Toke Høiland-Jørgensen wrote:

> 
> I'm looking into this; but in the interest of a speedy resolution of
> the regression I would be in favour of merging that partial revert
> and reinstating it if/when we identify (and fix) any bugs in ath9k :)

This looks fishy:

ath9k/recv.c

/* We will now give hardware our shiny new allocated skb */
new_buf_addr = dma_map_single(sc->dev, requeue_skb->data,
  common->rx_bufsize, dma_type);
if (unlikely(dma_mapping_error(sc->dev, new_buf_addr))) {
dev_kfree_skb_any(requeue_skb);
goto requeue_drop_frag;
}

/* Unmap the frame */
dma_unmap_single(sc->dev, bf->bf_buf_addr,
 common->rx_bufsize, dma_type);

bf->bf_mpdu = requeue_skb;
bf->bf_buf_addr = new_buf_addr;

-- 
Maxime



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-24 Thread Christoph Hellwig
On Thu, Mar 24, 2022 at 05:29:12PM +0100, Maxime Bizon wrote:
> > I'm looking into this; but in the interest of a speedy resolution of
> > the regression I would be in favour of merging that partial revert
> > and reinstating it if/when we identify (and fix) any bugs in ath9k :)
> 
> This looks fishy:
> 
> ath9k/recv.c
> 
> /* We will now give hardware our shiny new allocated skb */
> new_buf_addr = dma_map_single(sc->dev, requeue_skb->data,
>   common->rx_bufsize, dma_type);
> if (unlikely(dma_mapping_error(sc->dev, new_buf_addr))) {
> dev_kfree_skb_any(requeue_skb);
> goto requeue_drop_frag;
> }
> 
> /* Unmap the frame */
> dma_unmap_single(sc->dev, bf->bf_buf_addr,
>  common->rx_bufsize, dma_type);
> 
> bf->bf_mpdu = requeue_skb;
> bf->bf_buf_addr = new_buf_addr;

Creating a new mapping for the same buffer before unmapping the
previous one does looks rather bogus.  But it does not fit the
pattern where revering the sync_single changes make the driver
work again.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 2/2] PCI: hv: Propagate coherence from VMbus device to PCI device

2022-03-24 Thread Michael Kelley via iommu
PCI pass-thru devices in a Hyper-V VM are represented as a VMBus
device and as a PCI device.  The coherence of the VMbus device is
set based on the VMbus node in ACPI, but the PCI device has no
ACPI node and defaults to not hardware coherent.  This results
in extra software coherence management overhead on ARM64 when
devices are hardware coherent.

Fix this by setting up the PCI host bus so that normal
PCI mechanisms will propagate the coherence of the VMbus
device to the PCI device. There's no effect on x86/x64 where
devices are always hardware coherent.

Signed-off-by: Michael Kelley 
Acked-by: Boqun Feng 
Acked-by: Robin Murphy 
---
 drivers/pci/controller/pci-hyperv.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/pci/controller/pci-hyperv.c 
b/drivers/pci/controller/pci-hyperv.c
index ae0bc2f..88b3b56 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3404,6 +3404,15 @@ static int hv_pci_probe(struct hv_device *hdev,
hbus->bridge->domain_nr = dom;
 #ifdef CONFIG_X86
hbus->sysdata.domain = dom;
+#elif defined(CONFIG_ARM64)
+   /*
+* Set the PCI bus parent to be the corresponding VMbus
+* device. Then the VMbus device will be assigned as the
+* ACPI companion in pcibios_root_bridge_prepare() and
+* pci_dma_configure() will propagate device coherence
+* information to devices created on the bus.
+*/
+   hbus->sysdata.parent = hdev->device.parent;
 #endif
 
hbus->hdev = hdev;
-- 
1.8.3.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 1/2] Drivers: hv: vmbus: Propagate VMbus coherence to each VMbus device

2022-03-24 Thread Michael Kelley via iommu
VMbus synthetic devices are not represented in the ACPI DSDT -- only
the top level VMbus device is represented. As a result, on ARM64
coherence information in the _CCA method is not specified for
synthetic devices, so they default to not hardware coherent.
Drivers for some of these synthetic devices have been recently
updated to use the standard DMA APIs, and they are incurring extra
overhead of unneeded software coherence management.

Fix this by propagating coherence information from the VMbus node
in ACPI to the individual synthetic devices. There's no effect on
x86/x64 where devices are always hardware coherent.

Signed-off-by: Michael Kelley 
Acked-by: Robin Murphy 
---
 drivers/hv/hv_common.c | 11 +++
 drivers/hv/vmbus_drv.c | 31 +++
 include/asm-generic/mshyperv.h |  1 +
 3 files changed, 43 insertions(+)

diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 181d16b..820e814 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -216,6 +217,16 @@ bool hv_query_ext_cap(u64 cap_query)
 }
 EXPORT_SYMBOL_GPL(hv_query_ext_cap);
 
+void hv_setup_dma_ops(struct device *dev, bool coherent)
+{
+   /*
+* Hyper-V does not offer a vIOMMU in the guest
+* VM, so pass 0/NULL for the IOMMU settings
+*/
+   arch_setup_dma_ops(dev, 0, 0, NULL, coherent);
+}
+EXPORT_SYMBOL_GPL(hv_setup_dma_ops);
+
 bool hv_is_hibernation_supported(void)
 {
return !hv_root_partition && acpi_sleep_state_supported(ACPI_STATE_S4);
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 12a2b37..5c3b29a 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -921,6 +921,21 @@ static int vmbus_probe(struct device *child_device)
 }
 
 /*
+ * vmbus_dma_configure -- Configure DMA coherence for VMbus device
+ */
+static int vmbus_dma_configure(struct device *child_device)
+{
+   /*
+* On ARM64, propagate the DMA coherence setting from the top level
+* VMbus ACPI device to the child VMbus device being added here.
+* On x86/x64 coherence is assumed and these calls have no effect.
+*/
+   hv_setup_dma_ops(child_device,
+   device_get_dma_attr(_acpi_dev->dev) == DEV_DMA_COHERENT);
+   return 0;
+}
+
+/*
  * vmbus_remove - Remove a vmbus device
  */
 static void vmbus_remove(struct device *child_device)
@@ -1040,6 +1055,7 @@ static void vmbus_device_release(struct device *device)
.remove =   vmbus_remove,
.probe =vmbus_probe,
.uevent =   vmbus_uevent,
+   .dma_configure =vmbus_dma_configure,
.dev_groups =   vmbus_dev_groups,
.drv_groups =   vmbus_drv_groups,
.bus_groups =   vmbus_bus_groups,
@@ -2428,6 +2444,21 @@ static int vmbus_acpi_add(struct acpi_device *device)
 
hv_acpi_dev = device;
 
+   /*
+* Older versions of Hyper-V for ARM64 fail to include the _CCA
+* method on the top level VMbus device in the DSDT. But devices
+* are hardware coherent in all current Hyper-V use cases, so fix
+* up the ACPI device to behave as if _CCA is present and indicates
+* hardware coherence.
+*/
+   ACPI_COMPANION_SET(>dev, device);
+   if (IS_ENABLED(CONFIG_ACPI_CCA_REQUIRED) &&
+   device_get_dma_attr(>dev) == DEV_DMA_NOT_SUPPORTED) {
+   pr_info("No ACPI _CCA found; assuming coherent device I/O\n");
+   device->flags.cca_seen = true;
+   device->flags.coherent_dma = true;
+   }
+
result = acpi_walk_resources(device->handle, METHOD_NAME__CRS,
vmbus_walk_resources, NULL);
 
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index c08758b..c05d2ce 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -269,6 +269,7 @@ static inline int cpumask_to_vpset_noself(struct hv_vpset 
*vpset,
 u64 hv_ghcb_hypercall(u64 control, void *input, void *output, u32 input_size);
 void hyperv_cleanup(void);
 bool hv_query_ext_cap(u64 cap_query);
+void hv_setup_dma_ops(struct device *dev, bool coherent);
 void *hv_map_memory(void *addr, unsigned long size);
 void hv_unmap_memory(void *addr);
 #else /* CONFIG_HYPERV */
-- 
1.8.3.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 0/2] Fix coherence for VMbus and PCI pass-thru devices in Hyper-V VM

2022-03-24 Thread Michael Kelley via iommu
Hyper-V VMs have VMbus synthetic devices and PCI pass-thru devices that are 
added
dynamically via the VMbus protocol and are not represented in the ACPI DSDT. 
Only
the top level VMbus node exists in the DSDT. As such, on ARM64 these devices 
don't
pick up coherence information and default to not hardware coherent.  This 
results
in extra software coherence management overhead since the synthetic devices are
always hardware coherent. PCI pass-thru devices are also hardware coherent in 
all
current usage scenarios.

Fix this by propagating coherence information from the top level VMbus node in
the DSDT to all VMbus synthetic devices and PCI pass-thru devices. While smaller
granularity of control would be better, basing on the VMbus node in the DSDT
gives as escape path if a future scenario arises with devices that are not
hardware coherent.

Changes since v2:
* Move coherence propagation for VMbus synthetic devices to a separate
  .dma_configure function instead of the .probe fucntion [Robin Murphy]

Changes since v1:
* Use device_get_dma_attr() instead of acpi_get_dma_attr(), eliminating the
  need to export acpi_get_dma_attr() [Robin Murphy]
* Use arch_setup_dma_ops() to set device coherence [Robin Murphy]
* Move handling of missing _CCA to vmbus_acpi_add() so it is only done once
* Rework handling of PCI devices so existing code in pci_dma_configure()
  just works

Michael Kelley (2):
  Drivers: hv: vmbus: Propagate VMbus coherence to each VMbus device
  PCI: hv: Propagate coherence from VMbus device to PCI device

 drivers/hv/hv_common.c  | 11 +++
 drivers/hv/vmbus_drv.c  | 31 +++
 drivers/pci/controller/pci-hyperv.c |  9 +
 include/asm-generic/mshyperv.h  |  1 +
 4 files changed, 52 insertions(+)

-- 
1.8.3.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-24 Thread Toke Høiland-Jørgensen via iommu
Robin Murphy  writes:

> On 2022-03-24 10:25, Oleksandr Natalenko wrote:
>> Hello.
>> 
>> On čtvrtek 24. března 2022 6:57:32 CET Christoph Hellwig wrote:
>>> On Wed, Mar 23, 2022 at 08:54:08PM +, Robin Murphy wrote:
 I'll admit I still never quite grasped the reason for also adding the
 override to swiotlb_sync_single_for_device() in aa6f8dcbab47, but I think
 by that point we were increasingly tired and confused and starting to
 second-guess ourselves (well, I was, at least). I don't think it's wrong
 per se, but as I said I do think it can bite anyone who's been doing
 dma_sync_*() wrong but getting away with it until now. If ddbd89deb7d3
 alone turns out to work OK then I'd be inclined to try a partial revert of
 just that one hunk.
>>>
>>> Agreed.  Let's try that first.
>>>
>>> Oleksandr, can you try the patch below:
>>>
>>>
>>> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
>>> index 6db1c475ec827..6c350555e5a1c 100644
>>> --- a/kernel/dma/swiotlb.c
>>> +++ b/kernel/dma/swiotlb.c
>>> @@ -701,13 +701,10 @@ void swiotlb_tbl_unmap_single(struct device *dev, 
>>> phys_addr_t tlb_addr,
>>>   void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t 
>>> tlb_addr,
>>> size_t size, enum dma_data_direction dir)
>>>   {
>>> -   /*
>>> -* Unconditional bounce is necessary to avoid corruption on
>>> -* sync_*_for_cpu or dma_ummap_* when the device didn't overwrite
>>> -* the whole lengt of the bounce buffer.
>>> -*/
>>> -   swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE);
>>> -   BUG_ON(!valid_dma_direction(dir));
>>> +   if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
>>> +   swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE);
>>> +   else
>>> +   BUG_ON(dir != DMA_FROM_DEVICE);
>>>   }
>>>   
>>>   void swiotlb_sync_single_for_cpu(struct device *dev, phys_addr_t tlb_addr,
>>>
>> 
>> With this patch the AP works for me.
>
> Cool, thanks for confirming. So I think ath9k probably is doing 
> something dodgy with dma_sync_*(), but if Linus prefers to make the 
> above change rather than wait for that to get figured out, I believe 
> that should be fine.

I'm looking into this; but in the interest of a speedy resolution of the
regression I would be in favour of merging that partial revert and
reinstating it if/when we identify (and fix) any bugs in ath9k :)

-Toke
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 1/2] Drivers: hv: vmbus: Propagate VMbus coherence to each VMbus device

2022-03-24 Thread Robin Murphy

On 2022-03-24 13:18, Michael Kelley (LINUX) wrote:

From: Robin Murphy  Sent: Thursday, March 24, 2022 4:59 AM


On 2022-03-23 20:31, Michael Kelley wrote:

VMbus synthetic devices are not represented in the ACPI DSDT -- only
the top level VMbus device is represented. As a result, on ARM64
coherence information in the _CCA method is not specified for
synthetic devices, so they default to not hardware coherent.
Drivers for some of these synthetic devices have been recently
updated to use the standard DMA APIs, and they are incurring extra
overhead of unneeded software coherence management.

Fix this by propagating coherence information from the VMbus node
in ACPI to the individual synthetic devices. There's no effect on
x86/x64 where devices are always hardware coherent.

Signed-off-by: Michael Kelley 
---
   drivers/hv/hv_common.c | 11 +++
   drivers/hv/vmbus_drv.c | 23 +++
   include/asm-generic/mshyperv.h |  1 +
   3 files changed, 35 insertions(+)

diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 181d16b..820e814 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -20,6 +20,7 @@
   #include 
   #include 
   #include 
+#include 
   #include 
   #include 

@@ -216,6 +217,16 @@ bool hv_query_ext_cap(u64 cap_query)
   }
   EXPORT_SYMBOL_GPL(hv_query_ext_cap);

+void hv_setup_dma_ops(struct device *dev, bool coherent)
+{
+   /*
+* Hyper-V does not offer a vIOMMU in the guest
+* VM, so pass 0/NULL for the IOMMU settings
+*/
+   arch_setup_dma_ops(dev, 0, 0, NULL, coherent);
+}
+EXPORT_SYMBOL_GPL(hv_setup_dma_ops);
+
   bool hv_is_hibernation_supported(void)
   {
return !hv_root_partition && acpi_sleep_state_supported(ACPI_STATE_S4);
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 12a2b37..2d2c54c 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -905,6 +905,14 @@ static int vmbus_probe(struct device *child_device)
struct hv_device *dev = device_to_hv_device(child_device);
const struct hv_vmbus_device_id *dev_id;

+   /*
+* On ARM64, propagate the DMA coherence setting from the top level
+* VMbus ACPI device to the child VMbus device being added here.
+* On x86/x64 coherence is assumed and these calls have no effect.
+*/
+   hv_setup_dma_ops(child_device,
+   device_get_dma_attr(_acpi_dev->dev) == DEV_DMA_COHERENT);


Would you mind hooking up the hv_bus.dma_configure method to do this?
Feel free to fold hv_setup_dma_ops entirely into that if you're not
likely to need to call it from anywhere else.


I'm pretty sure using hv_bus.dma_configure() is doable.  A separate
hv_setup_dma_ops() is still needed because arch_setup_dma_ops() isn't
exported and this VMbus driver can be built as a module.


Ah, right you are, I keep forgetting that.


+
dev_id = hv_vmbus_get_id(drv, dev);
if (drv->probe) {
ret = drv->probe(dev, dev_id);
@@ -2428,6 +2436,21 @@ static int vmbus_acpi_add(struct acpi_device *device)

hv_acpi_dev = device;

+   /*
+* Older versions of Hyper-V for ARM64 fail to include the _CCA
+* method on the top level VMbus device in the DSDT. But devices
+* are hardware coherent in all current Hyper-V use cases, so fix
+* up the ACPI device to behave as if _CCA is present and indicates
+* hardware coherence.
+*/
+   ACPI_COMPANION_SET(>dev, device);
+   if (IS_ENABLED(CONFIG_ACPI_CCA_REQUIRED) &&
+   device_get_dma_attr(>dev) == DEV_DMA_NOT_SUPPORTED) {
+   pr_info("No ACPI _CCA found; assuming coherent device I/O\n");
+   device->flags.cca_seen = true;
+   device->flags.coherent_dma = true;
+   }


I'm not the biggest fan of this, especially since I'm not convinced that
there are any out-of-support deployments of ARM64 Hyper-V that can't be
updated. However I suppose it's not "real" firmware, and one Hyper-V
component is at liberty to hack another Hyper-V component's data if it
really wants to...


Agreed, it's a hack.  But Hyper-V instances are out there as part of
Windows 10/11 on ARM64 PCs, and they run ARM64 VMs for the
Windows Subsystem for Linux.  Microsoft gets pilloried for breaking
stuff, and this removes the potential for that happening if someone
runs a new Linux kernel version in that VM.


And actually that one's on me as well - for some reason I was thinking 
that this had never worked, and therefore you could likely get a Hyper-V 
update pushed out long before users get this patch through distros, but 
of course it only becomes an issue now because previously there was no 
connection to any ACPI node at all. As I said, personally I'm happy to 
consider this a Hyper-V internal workaround, but if anyone else objects 
to poking at the ACPI flags, I suppose you've also got the fallback 
option of flipping it around and making the 

Re: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-24 Thread Jason Gunthorpe via iommu
On Thu, Mar 24, 2022 at 07:25:03AM +, Tian, Kevin wrote:

> Based on that here is a quick tweak of the force-snoop part (not compiled).

I liked your previous idea better, that IOMMU_CAP_CACHE_COHERENCY
started out OK but got weird. So lets fix it back to the way it was.

How about this:

https://github.com/jgunthorpe/linux/commits/intel_no_snoop

b11c19a4b34c2a iommu: Move the Intel no-snoop control off of IOMMU_CACHE
5263947f9d5f36 vfio: Require that device support DMA cache coherence
eab4b381c64a30 iommu: Restore IOMMU_CAP_CACHE_COHERENCY to its original meaning
2752e12bed48f6 iommu: Replace uses of IOMMU_CAP_CACHE_COHERENCY with 
dev_is_dma_coherent()

If you like it could you take it from here?

Thanks,
Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2 1/2] Drivers: hv: vmbus: Propagate VMbus coherence to each VMbus device

2022-03-24 Thread Michael Kelley (LINUX) via iommu
From: Robin Murphy  Sent: Thursday, March 24, 2022 4:59 AM
> 
> On 2022-03-23 20:31, Michael Kelley wrote:
> > VMbus synthetic devices are not represented in the ACPI DSDT -- only
> > the top level VMbus device is represented. As a result, on ARM64
> > coherence information in the _CCA method is not specified for
> > synthetic devices, so they default to not hardware coherent.
> > Drivers for some of these synthetic devices have been recently
> > updated to use the standard DMA APIs, and they are incurring extra
> > overhead of unneeded software coherence management.
> >
> > Fix this by propagating coherence information from the VMbus node
> > in ACPI to the individual synthetic devices. There's no effect on
> > x86/x64 where devices are always hardware coherent.
> >
> > Signed-off-by: Michael Kelley 
> > ---
> >   drivers/hv/hv_common.c | 11 +++
> >   drivers/hv/vmbus_drv.c | 23 +++
> >   include/asm-generic/mshyperv.h |  1 +
> >   3 files changed, 35 insertions(+)
> >
> > diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> > index 181d16b..820e814 100644
> > --- a/drivers/hv/hv_common.c
> > +++ b/drivers/hv/hv_common.c
> > @@ -20,6 +20,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> >   #include 
> >   #include 
> >
> > @@ -216,6 +217,16 @@ bool hv_query_ext_cap(u64 cap_query)
> >   }
> >   EXPORT_SYMBOL_GPL(hv_query_ext_cap);
> >
> > +void hv_setup_dma_ops(struct device *dev, bool coherent)
> > +{
> > +   /*
> > +* Hyper-V does not offer a vIOMMU in the guest
> > +* VM, so pass 0/NULL for the IOMMU settings
> > +*/
> > +   arch_setup_dma_ops(dev, 0, 0, NULL, coherent);
> > +}
> > +EXPORT_SYMBOL_GPL(hv_setup_dma_ops);
> > +
> >   bool hv_is_hibernation_supported(void)
> >   {
> > return !hv_root_partition && acpi_sleep_state_supported(ACPI_STATE_S4);
> > diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> > index 12a2b37..2d2c54c 100644
> > --- a/drivers/hv/vmbus_drv.c
> > +++ b/drivers/hv/vmbus_drv.c
> > @@ -905,6 +905,14 @@ static int vmbus_probe(struct device *child_device)
> > struct hv_device *dev = device_to_hv_device(child_device);
> > const struct hv_vmbus_device_id *dev_id;
> >
> > +   /*
> > +* On ARM64, propagate the DMA coherence setting from the top level
> > +* VMbus ACPI device to the child VMbus device being added here.
> > +* On x86/x64 coherence is assumed and these calls have no effect.
> > +*/
> > +   hv_setup_dma_ops(child_device,
> > +   device_get_dma_attr(_acpi_dev->dev) == DEV_DMA_COHERENT);
> 
> Would you mind hooking up the hv_bus.dma_configure method to do this?
> Feel free to fold hv_setup_dma_ops entirely into that if you're not
> likely to need to call it from anywhere else.

I'm pretty sure using hv_bus.dma_configure() is doable.  A separate
hv_setup_dma_ops() is still needed because arch_setup_dma_ops() isn't
exported and this VMbus driver can be built as a module.

> 
> > +
> > dev_id = hv_vmbus_get_id(drv, dev);
> > if (drv->probe) {
> > ret = drv->probe(dev, dev_id);
> > @@ -2428,6 +2436,21 @@ static int vmbus_acpi_add(struct acpi_device *device)
> >
> > hv_acpi_dev = device;
> >
> > +   /*
> > +* Older versions of Hyper-V for ARM64 fail to include the _CCA
> > +* method on the top level VMbus device in the DSDT. But devices
> > +* are hardware coherent in all current Hyper-V use cases, so fix
> > +* up the ACPI device to behave as if _CCA is present and indicates
> > +* hardware coherence.
> > +*/
> > +   ACPI_COMPANION_SET(>dev, device);
> > +   if (IS_ENABLED(CONFIG_ACPI_CCA_REQUIRED) &&
> > +   device_get_dma_attr(>dev) == DEV_DMA_NOT_SUPPORTED) {
> > +   pr_info("No ACPI _CCA found; assuming coherent device I/O\n");
> > +   device->flags.cca_seen = true;
> > +   device->flags.coherent_dma = true;
> > +   }
> 
> I'm not the biggest fan of this, especially since I'm not convinced that
> there are any out-of-support deployments of ARM64 Hyper-V that can't be
> updated. However I suppose it's not "real" firmware, and one Hyper-V
> component is at liberty to hack another Hyper-V component's data if it
> really wants to...

Agreed, it's a hack.  But Hyper-V instances are out there as part of
Windows 10/11 on ARM64 PCs, and they run ARM64 VMs for the
Windows Subsystem for Linux.  Microsoft gets pilloried for breaking
stuff, and this removes the potential for that happening if someone
runs a new Linux kernel version in that VM.

Michael

> 
> If you can hook up .dma_configure, or clarify if it wouldn't work,
> 
> Acked-by: Robin Murphy 
> 
> Cheers,
> Robin.
> 
> > +
> > result = acpi_walk_resources(device->handle, METHOD_NAME__CRS,
> > vmbus_walk_resources, NULL);
> >
> > diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> > index c08758b..c05d2ce 100644
> > --- a/include/asm-generic/mshyperv.h
> 

Re: [PATCH RFC 07/12] iommufd: Data structure to provide IOVA to PFN mapping

2022-03-24 Thread Jason Gunthorpe via iommu
On Thu, Mar 24, 2022 at 03:09:46AM +, Tian, Kevin wrote:
> + /*
> +  * Can't cross areas that are not aligned to the system page
> +  * size with this API.
> +  */
> + if (cur_iova % PAGE_SIZE) {
> + rc = -EINVAL;
> + goto out_remove;
> + }
> 
> Currently it's done after iopt_pages_add_user() but before cur_iova 
> is adjusted, which implies the last add_user() will not be reverted in
> case of failed check here.

Oh, yes that's right too..

The above is wrong even, it didn't get fixed when page_offset was
done.

So more like this:

diff --git a/drivers/iommu/iommufd/io_pagetable.c 
b/drivers/iommu/iommufd/io_pagetable.c
index 1c08ae9b848fcf..9505f119df982e 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -23,7 +23,7 @@ static unsigned long iopt_area_iova_to_index(struct iopt_area 
*area,
if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
WARN_ON(iova < iopt_area_iova(area) ||
iova > iopt_area_last_iova(area));
-   return (iova - (iopt_area_iova(area) & PAGE_MASK)) / PAGE_SIZE;
+   return (iova - (iopt_area_iova(area) - area->page_offset)) / PAGE_SIZE;
 }
 
 static struct iopt_area *iopt_find_exact_area(struct io_pagetable *iopt,
@@ -436,31 +436,45 @@ int iopt_access_pages(struct io_pagetable *iopt, unsigned 
long iova,
unsigned long index;
 
/* Need contiguous areas in the access */
-   if (iopt_area_iova(area) < cur_iova || !area->pages) {
+   if (iopt_area_iova(area) > cur_iova || !area->pages) {
rc = -EINVAL;
goto out_remove;
}
 
index = iopt_area_iova_to_index(area, cur_iova);
last_index = iopt_area_iova_to_index(area, last);
+
+   /*
+* The API can only return aligned pages, so the starting point
+* must be at a page boundary.
+*/
+   if ((cur_iova - (iopt_area_iova(area) - area->page_offset)) %
+   PAGE_SIZE) {
+   rc = -EINVAL;
+   goto out_remove;
+   }
+
+   /*
+* and an interior ending point must be at a page boundary
+*/
+   if (last != last_iova &&
+   (iopt_area_last_iova(area) - cur_iova + 1) % PAGE_SIZE) {
+   rc = -EINVAL;
+   goto out_remove;
+   }
+
rc = iopt_pages_add_user(area->pages, index, last_index,
 out_pages, write);
if (rc)
goto out_remove;
if (last == last_iova)
break;
-   /*
-* Can't cross areas that are not aligned to the system page
-* size with this API.
-*/
-   if (cur_iova % PAGE_SIZE) {
-   rc = -EINVAL;
-   goto out_remove;
-   }
cur_iova = last + 1;
out_pages += last_index - index;
atomic_inc(>num_users);
}
+   if (cur_iova != last_iova)
+   goto out_remove;
 
up_read(>iova_rwsem);
return 0;
diff --git a/tools/testing/selftests/iommu/iommufd.c 
b/tools/testing/selftests/iommu/iommufd.c
index 5c47d706ed9449..a46e0f0ae82553 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -1221,5 +1221,6 @@ TEST_F(vfio_compat_mock_domain, get_info)
 /* FIXME test VFIO_IOMMU_MAP_DMA */
 /* FIXME test VFIO_IOMMU_UNMAP_DMA */
 /* FIXME test 2k iova alignment */
+/* FIXME cover boundary cases for iopt_access_pages()  */
 
 TEST_HARNESS_MAIN
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 2/2] PCI: hv: Propagate coherence from VMbus device to PCI device

2022-03-24 Thread Robin Murphy

On 2022-03-24 12:23, Robin Murphy wrote:

On 2022-03-23 20:31, Michael Kelley wrote:

PCI pass-thru devices in a Hyper-V VM are represented as a VMBus
device and as a PCI device.  The coherence of the VMbus device is
set based on the VMbus node in ACPI, but the PCI device has no
ACPI node and defaults to not hardware coherent.  This results
in extra software coherence management overhead on ARM64 when
devices are hardware coherent.

Fix this by setting up the PCI host bus so that normal
PCI mechanisms will propagate the coherence of the VMbus
device to the PCI device. There's no effect on x86/x64 where
devices are always hardware coherent.


Honestly, I don't hate this :)

It seems conceptually accurate, as far as I understand, and in 
functional terms I'm starting to think it might even be the most correct 
approach anyway. In the physical world we might be surprised to find the 
PCI side of a host bridge


And of course by "the PCI side of a host bridge" I think I actually mean 
"a PCI root bus", because in my sloppy terminology I'm thinking about 
hardware bridging from PCI(e) to some SoC-internal protocol, which does 
not have to imply an actual PCI-visible Host Bridge device...


Robin.

behind anything other than some platform/ACPI 
device representing the other side of a physical host bridge or root 
complex, but who's to say that a paravirtual world can't present a more 
abstract topology? Either way, a one-line way of tying in to the 
standard flow is hard to turn down.


Acked-by: Robin Murphy 


Signed-off-by: Michael Kelley 
---
  drivers/pci/controller/pci-hyperv.c | 9 +
  1 file changed, 9 insertions(+)

diff --git a/drivers/pci/controller/pci-hyperv.c 
b/drivers/pci/controller/pci-hyperv.c

index ae0bc2f..88b3b56 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3404,6 +3404,15 @@ static int hv_pci_probe(struct hv_device *hdev,
  hbus->bridge->domain_nr = dom;
  #ifdef CONFIG_X86
  hbus->sysdata.domain = dom;
+#elif defined(CONFIG_ARM64)
+    /*
+ * Set the PCI bus parent to be the corresponding VMbus
+ * device. Then the VMbus device will be assigned as the
+ * ACPI companion in pcibios_root_bridge_prepare() and
+ * pci_dma_configure() will propagate device coherence
+ * information to devices created on the bus.
+ */
+    hbus->sysdata.parent = hdev->device.parent;
  #endif
  hbus->hdev = hdev;

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 2/2] PCI: hv: Propagate coherence from VMbus device to PCI device

2022-03-24 Thread Robin Murphy

On 2022-03-23 20:31, Michael Kelley wrote:

PCI pass-thru devices in a Hyper-V VM are represented as a VMBus
device and as a PCI device.  The coherence of the VMbus device is
set based on the VMbus node in ACPI, but the PCI device has no
ACPI node and defaults to not hardware coherent.  This results
in extra software coherence management overhead on ARM64 when
devices are hardware coherent.

Fix this by setting up the PCI host bus so that normal
PCI mechanisms will propagate the coherence of the VMbus
device to the PCI device. There's no effect on x86/x64 where
devices are always hardware coherent.


Honestly, I don't hate this :)

It seems conceptually accurate, as far as I understand, and in 
functional terms I'm starting to think it might even be the most correct 
approach anyway. In the physical world we might be surprised to find the 
PCI side of a host bridge behind anything other than some platform/ACPI 
device representing the other side of a physical host bridge or root 
complex, but who's to say that a paravirtual world can't present a more 
abstract topology? Either way, a one-line way of tying in to the 
standard flow is hard to turn down.


Acked-by: Robin Murphy 


Signed-off-by: Michael Kelley 
---
  drivers/pci/controller/pci-hyperv.c | 9 +
  1 file changed, 9 insertions(+)

diff --git a/drivers/pci/controller/pci-hyperv.c 
b/drivers/pci/controller/pci-hyperv.c
index ae0bc2f..88b3b56 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3404,6 +3404,15 @@ static int hv_pci_probe(struct hv_device *hdev,
hbus->bridge->domain_nr = dom;
  #ifdef CONFIG_X86
hbus->sysdata.domain = dom;
+#elif defined(CONFIG_ARM64)
+   /*
+* Set the PCI bus parent to be the corresponding VMbus
+* device. Then the VMbus device will be assigned as the
+* ACPI companion in pcibios_root_bridge_prepare() and
+* pci_dma_configure() will propagate device coherence
+* information to devices created on the bus.
+*/
+   hbus->sysdata.parent = hdev->device.parent;
  #endif
  
  	hbus->hdev = hdev;

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 1/2] Drivers: hv: vmbus: Propagate VMbus coherence to each VMbus device

2022-03-24 Thread Robin Murphy

On 2022-03-23 20:31, Michael Kelley wrote:

VMbus synthetic devices are not represented in the ACPI DSDT -- only
the top level VMbus device is represented. As a result, on ARM64
coherence information in the _CCA method is not specified for
synthetic devices, so they default to not hardware coherent.
Drivers for some of these synthetic devices have been recently
updated to use the standard DMA APIs, and they are incurring extra
overhead of unneeded software coherence management.

Fix this by propagating coherence information from the VMbus node
in ACPI to the individual synthetic devices. There's no effect on
x86/x64 where devices are always hardware coherent.

Signed-off-by: Michael Kelley 
---
  drivers/hv/hv_common.c | 11 +++
  drivers/hv/vmbus_drv.c | 23 +++
  include/asm-generic/mshyperv.h |  1 +
  3 files changed, 35 insertions(+)

diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 181d16b..820e814 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -20,6 +20,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  
@@ -216,6 +217,16 @@ bool hv_query_ext_cap(u64 cap_query)

  }
  EXPORT_SYMBOL_GPL(hv_query_ext_cap);
  
+void hv_setup_dma_ops(struct device *dev, bool coherent)

+{
+   /*
+* Hyper-V does not offer a vIOMMU in the guest
+* VM, so pass 0/NULL for the IOMMU settings
+*/
+   arch_setup_dma_ops(dev, 0, 0, NULL, coherent);
+}
+EXPORT_SYMBOL_GPL(hv_setup_dma_ops);
+
  bool hv_is_hibernation_supported(void)
  {
return !hv_root_partition && acpi_sleep_state_supported(ACPI_STATE_S4);
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 12a2b37..2d2c54c 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -905,6 +905,14 @@ static int vmbus_probe(struct device *child_device)
struct hv_device *dev = device_to_hv_device(child_device);
const struct hv_vmbus_device_id *dev_id;
  
+	/*

+* On ARM64, propagate the DMA coherence setting from the top level
+* VMbus ACPI device to the child VMbus device being added here.
+* On x86/x64 coherence is assumed and these calls have no effect.
+*/
+   hv_setup_dma_ops(child_device,
+   device_get_dma_attr(_acpi_dev->dev) == DEV_DMA_COHERENT);


Would you mind hooking up the hv_bus.dma_configure method to do this? 
Feel free to fold hv_setup_dma_ops entirely into that if you're not 
likely to need to call it from anywhere else.



+
dev_id = hv_vmbus_get_id(drv, dev);
if (drv->probe) {
ret = drv->probe(dev, dev_id);
@@ -2428,6 +2436,21 @@ static int vmbus_acpi_add(struct acpi_device *device)
  
  	hv_acpi_dev = device;
  
+	/*

+* Older versions of Hyper-V for ARM64 fail to include the _CCA
+* method on the top level VMbus device in the DSDT. But devices
+* are hardware coherent in all current Hyper-V use cases, so fix
+* up the ACPI device to behave as if _CCA is present and indicates
+* hardware coherence.
+*/
+   ACPI_COMPANION_SET(>dev, device);
+   if (IS_ENABLED(CONFIG_ACPI_CCA_REQUIRED) &&
+   device_get_dma_attr(>dev) == DEV_DMA_NOT_SUPPORTED) {
+   pr_info("No ACPI _CCA found; assuming coherent device I/O\n");
+   device->flags.cca_seen = true;
+   device->flags.coherent_dma = true;
+   }


I'm not the biggest fan of this, especially since I'm not convinced that 
there are any out-of-support deployments of ARM64 Hyper-V that can't be 
updated. However I suppose it's not "real" firmware, and one Hyper-V 
component is at liberty to hack another Hyper-V component's data if it 
really wants to...


If you can hook up .dma_configure, or clarify if it wouldn't work,

Acked-by: Robin Murphy 

Cheers,
Robin.


+
result = acpi_walk_resources(device->handle, METHOD_NAME__CRS,
vmbus_walk_resources, NULL);
  
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h

index c08758b..c05d2ce 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -269,6 +269,7 @@ static inline int cpumask_to_vpset_noself(struct hv_vpset 
*vpset,
  u64 hv_ghcb_hypercall(u64 control, void *input, void *output, u32 input_size);
  void hyperv_cleanup(void);
  bool hv_query_ext_cap(u64 cap_query);
+void hv_setup_dma_ops(struct device *dev, bool coherent);
  void *hv_map_memory(void *addr, unsigned long size);
  void hv_unmap_memory(void *addr);
  #else /* CONFIG_HYPERV */

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

2022-03-24 Thread Jason Gunthorpe via iommu
On Thu, Mar 24, 2022 at 11:50:47AM +0800, Jason Wang wrote:

> It's simply because we don't want to break existing userspace. [1]

I'm still waiting to hear what exactly breaks in real systems.

As I explained this is not a significant change, but it could break
something in a few special scenarios.

Also the one place we do have ABI breaks is security, and ulimit is a
security mechanism that isn't working right. So we do clearly need to
understand *exactly* what real thing breaks - if anything.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 03/25] dma-direct: take dma-ranges/offsets into account in resource mapping

2022-03-24 Thread Robin Murphy

On 2022-03-24 01:48, Serge Semin wrote:

A basic device-specific linear memory mapping was introduced back in
commit ("dma: Take into account dma_pfn_offset") as a single-valued offset
preserved in the device.dma_pfn_offset field, which was initialized for
instance by means of the "dma-ranges" DT property. Afterwards the
functionality was extended to support more than one device-specific region
defined in the device.dma_range_map list of maps. But all of these
improvements concerned a single pointer, page or sg DMA-mapping methods,
while the system resource mapping function turned to miss the
corresponding modification. Thus the dma_direct_map_resource() method now
just casts the CPU physical address to the device DMA address with no
dma-ranges-based mapping taking into account, which is obviously wrong.
Let's fix it by using the phys_to_dma_direct() method to get the
device-specific bus address from the passed memory resource for the case
of the directly mapped DMA.


It may not have been well-documented at the time, but this was largely 
intentional. The assumption based on known systems was that where 
dma_pfn_offset existed, it would *not* apply to peer MMIO addresses.


For instance, DTs for TI Keystone 2 platforms only describe an offset 
for RAM:


dma-ranges = <0x8000 0x8 0x 0x8000>;

but a DMA controller might also want to access something in the MMIO 
range 0x0-0x7fff, of which it still has an identical non-offset 
view. If a driver was previously using dma_map_resource() for that, it 
would now start getting DMA_MAPPING_ERROR because the dma_range_map 
exists but doesn't describe the MMIO region. I agree that in hindsight 
it's not an ideal situation, but it's how things have ended up, so at 
this point I'm wary of making potentially-breaking changes.


May I ask what exactly your setup looks like, if you have a DMA 
controller with an offset view of its "own" MMIO space?


Thanks,
Robin.


Fixes: 25f1e1887088 ("dma: Take into account dma_pfn_offset")
Signed-off-by: Serge Semin 
---
  kernel/dma/direct.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 50f48e9e4598..9ce8192b29ab 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -497,7 +497,7 @@ int dma_direct_map_sg(struct device *dev, struct 
scatterlist *sgl, int nents,
  dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
size_t size, enum dma_data_direction dir, unsigned long attrs)
  {
-   dma_addr_t dma_addr = paddr;
+   dma_addr_t dma_addr = phys_to_dma_direct(dev, paddr);
  
  	if (unlikely(!dma_capable(dev, dma_addr, size, false))) {

dev_err_once(dev,

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-24 Thread Robin Murphy

On 2022-03-24 10:25, Oleksandr Natalenko wrote:

Hello.

On čtvrtek 24. března 2022 6:57:32 CET Christoph Hellwig wrote:

On Wed, Mar 23, 2022 at 08:54:08PM +, Robin Murphy wrote:

I'll admit I still never quite grasped the reason for also adding the
override to swiotlb_sync_single_for_device() in aa6f8dcbab47, but I think
by that point we were increasingly tired and confused and starting to
second-guess ourselves (well, I was, at least). I don't think it's wrong
per se, but as I said I do think it can bite anyone who's been doing
dma_sync_*() wrong but getting away with it until now. If ddbd89deb7d3
alone turns out to work OK then I'd be inclined to try a partial revert of
just that one hunk.


Agreed.  Let's try that first.

Oleksandr, can you try the patch below:


diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 6db1c475ec827..6c350555e5a1c 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -701,13 +701,10 @@ void swiotlb_tbl_unmap_single(struct device *dev, 
phys_addr_t tlb_addr,
  void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr,
size_t size, enum dma_data_direction dir)
  {
-   /*
-* Unconditional bounce is necessary to avoid corruption on
-* sync_*_for_cpu or dma_ummap_* when the device didn't overwrite
-* the whole lengt of the bounce buffer.
-*/
-   swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE);
-   BUG_ON(!valid_dma_direction(dir));
+   if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
+   swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE);
+   else
+   BUG_ON(dir != DMA_FROM_DEVICE);
  }
  
  void swiotlb_sync_single_for_cpu(struct device *dev, phys_addr_t tlb_addr,




With this patch the AP works for me.


Cool, thanks for confirming. So I think ath9k probably is doing 
something dodgy with dma_sync_*(), but if Linus prefers to make the 
above change rather than wait for that to get figured out, I believe 
that should be fine.


The crucial part of the "rework" patch is that we'll unconditionally 
initialise the SWIOTLB bounce slot as it's allocated in 
swiotlb_tbl_map_single(), regardless of DMA_ATTR_SKIP_CPU_SYNC. As long 
as that happens, we're safe in terms of leaking data from previous 
mappings, and any possibility for incorrect sync usage to lose 
newly-written DMA data is at least no worse than it always has been. The 
most confusion was around how the proposed DMA_ATTR_OVERWRITE attribute 
would need to interact with DMA_ATTR_SKIP_CPU_SYNC to remain safe but 
still have any useful advantage, so unless and until anyone wants to 
revisit that, this should remain comparatively simple to reason about.


Cheers,
Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-24 Thread Oleksandr Natalenko via iommu
Hello.

On čtvrtek 24. března 2022 6:57:32 CET Christoph Hellwig wrote:
> On Wed, Mar 23, 2022 at 08:54:08PM +, Robin Murphy wrote:
> > I'll admit I still never quite grasped the reason for also adding the 
> > override to swiotlb_sync_single_for_device() in aa6f8dcbab47, but I think 
> > by that point we were increasingly tired and confused and starting to 
> > second-guess ourselves (well, I was, at least). I don't think it's wrong 
> > per se, but as I said I do think it can bite anyone who's been doing 
> > dma_sync_*() wrong but getting away with it until now. If ddbd89deb7d3 
> > alone turns out to work OK then I'd be inclined to try a partial revert of 
> > just that one hunk.
> 
> Agreed.  Let's try that first.
> 
> Oleksandr, can you try the patch below:
> 
> 
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 6db1c475ec827..6c350555e5a1c 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -701,13 +701,10 @@ void swiotlb_tbl_unmap_single(struct device *dev, 
> phys_addr_t tlb_addr,
>  void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr,
>   size_t size, enum dma_data_direction dir)
>  {
> - /*
> -  * Unconditional bounce is necessary to avoid corruption on
> -  * sync_*_for_cpu or dma_ummap_* when the device didn't overwrite
> -  * the whole lengt of the bounce buffer.
> -  */
> - swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE);
> - BUG_ON(!valid_dma_direction(dir));
> + if (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)
> + swiotlb_bounce(dev, tlb_addr, size, DMA_TO_DEVICE);
> + else
> + BUG_ON(dir != DMA_FROM_DEVICE);
>  }
>  
>  void swiotlb_sync_single_for_cpu(struct device *dev, phys_addr_t tlb_addr,
> 

With this patch the AP works for me.

Thanks.

-- 
Oleksandr Natalenko (post-factum)


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] iommu/dma: Explicitly sort PCI DMA windows

2022-03-24 Thread Robin Murphy

On 2022-03-24 00:56, Rob Herring wrote:

On Tue, Mar 22, 2022 at 12:27 PM Robin Murphy  wrote:


Originally, creating the dma_ranges resource list in pre-sorted fashion
was the simplest and most efficient way to enforce the order required by
iova_reserve_pci_windows(). However since then at least one PCI host
driver is now re-sorting the list for its own probe-time processing,
which doesn't seem entirely unreasonable, so that basic assumption no
longer holds. Make iommu-dma robust and get the sort order it needs by
explicitly sorting, which means we can also save the effort at creation
time and just build the list in whatever natural order the DT had.

Signed-off-by: Robin Murphy 
---

Looking at this area off the back of the XGene thread[1] made me realise
that we need to do it anyway, regardless of whether it might also happen
to restore the previous XGene behaviour or not. Presumably nobody's
tried to use pcie-cadence-host behind an IOMMU yet...

Boot-tested on Juno to make sure I hadn't got the sort comparison
backwards.

Robin.

[1] https://lore.kernel.org/linux-pci/20220321104843.949645-1-...@kernel.org/

  drivers/iommu/dma-iommu.c | 13 -
  drivers/pci/of.c  |  7 +--
  2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index b22034975301..91d134c0c9b1 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -20,6 +20,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -414,6 +415,15 @@ static int cookie_init_hw_msi_region(struct 
iommu_dma_cookie *cookie,
 return 0;
  }

+static int iommu_dma_ranges_sort(void *priv, const struct list_head *a,
+   const struct list_head *b)
+{
+   struct resource_entry *res_a = list_entry(a, typeof(*res_a), node);
+   struct resource_entry *res_b = list_entry(b, typeof(*res_b), node);
+
+   return res_a->res->start > res_b->res->start;
+}
+
  static int iova_reserve_pci_windows(struct pci_dev *dev,
 struct iova_domain *iovad)
  {
@@ -432,6 +442,7 @@ static int iova_reserve_pci_windows(struct pci_dev *dev,
 }

 /* Get reserved DMA windows from host bridge */
+   list_sort(NULL, >dma_ranges, iommu_dma_ranges_sort);
 resource_list_for_each_entry(window, >dma_ranges) {
 end = window->res->start - window->offset;
  resv_iova:
@@ -440,7 +451,7 @@ static int iova_reserve_pci_windows(struct pci_dev *dev,
 hi = iova_pfn(iovad, end);
 reserve_iova(iovad, lo, hi);
 } else if (end < start) {
-   /* dma_ranges list should be sorted */
+   /* DMA ranges should be non-overlapping */
 dev_err(>dev,
 "Failed to reserve IOVA [%pa-%pa]\n",
 , );
diff --git a/drivers/pci/of.c b/drivers/pci/of.c
index cb2e8351c2cc..d176b4bc6193 100644
--- a/drivers/pci/of.c
+++ b/drivers/pci/of.c
@@ -393,12 +393,7 @@ static int devm_of_pci_get_host_bridge_resources(struct 
device *dev,
 goto failed;
 }

-   /* Keep the resource list sorted */
-   resource_list_for_each_entry(entry, ib_resources)
-   if (entry->res->start > res->start)
-   break;
-
-   pci_add_resource_offset(>node, res,


entry is now unused and causes a warning.


Sigh, seems the problem with CONFIG_WERROR is that once you think it's 
enabled, you then stop paying much attention to the build log...


Thanks for the catch,
Robin.




+   pci_add_resource_offset(ib_resources, res,
 res->start - range.pci_addr);
 }

--
2.28.0.dirty


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[git pull] IOMMU Updates for Linux v5.18

2022-03-24 Thread Joerg Roedel
Hi Linus,

there is a conflict this time when merging the IOMMU tree updates. It is
in drivers/iommu/intel/iommu.c and comes from the fact that the tip-tree
patched functions in that file which get removed with these updates. So
the merge resolution is to use the changes from the IOMMU tree. With
that in mind, here are the IOMMU changes for v5.18:

The following changes since commit ffb217a13a2eaf6d5bd974fc83036a53ca69f1e2:

  Linux 5.17-rc7 (2022-03-06 14:28:31 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git 
tags/iommu-updates-v5.18

for you to fetch changes up to e17c6debd4b2d2d474074f83946f8c6522587566:

  Merge branches 'arm/mediatek', 'arm/msm', 'arm/renesas', 'arm/rockchip', 
'arm/smmu', 'x86/vt-d' and 'x86/amd' into next (2022-03-08 12:21:31 +0100)


IOMMU Updates for Linux v5.18

Including:

- IOMMU Core changes:
  - Removal of aux domain related code as it is basically dead
and will be replaced by iommu-fd framework
  - Split of iommu_ops to carry domain-specific call-backs
separatly
  - Cleanup to remove useless ops->capable implementations
  - Improve 32-bit free space estimate in iova allocator

- Intel VT-d updates:
  - Various cleanups of the driver
  - Support for ATS of SoC-integrated devices listed in
ACPI/SATC table

- ARM SMMU updates:
  - Fix SMMUv3 soft lockup during continuous stream of events
  - Fix error path for Qualcomm SMMU probe()
  - Rework SMMU IRQ setup to prepare the ground for PMU support
  - Minor cleanups and refactoring

- AMD IOMMU driver:
  - Some minor cleanups and error-handling fixes

- Rockchip IOMMU driver:
  - Use standard driver registration

- MSM IOMMU driver:
  - Minor cleanup and change to standard driver registration

- Mediatek IOMMU driver:
  - Fixes for IOTLB flushing logic


Andy Shevchenko (2):
  perf/smmuv3: Don't cast parameter in bit operations
  iommu/vt-d: Move intel_iommu_ops to header file

Christophe JAILLET (2):
  iommu/arm-smmu-v3: Avoid open coded arithmetic in memory allocation
  iommu/arm-smmu-v3: Simplify memory allocation

David Heidelberg (1):
  iommu/msm: Simplify with dev_err_probe()

Jiasheng Jiang (1):
  iommu/ipmmu-vmsa: Check for error num after setting mask

Joerg Roedel (3):
  Merge branch 'core' into x86/vt-d
  Merge tag 'arm-smmu-updates' of 
git://git.kernel.org/pub/scm/linux/kernel/git/will/linux into arm/smmu
  Merge branches 'arm/mediatek', 'arm/msm', 'arm/renesas', 'arm/rockchip', 
'arm/smmu', 'x86/vt-d' and 'x86/amd' into next

John Garry (1):
  iommu/iova: Separate out rcache init

Lu Baolu (17):
  iommu/vt-d: Remove guest pasid related callbacks
  iommu: Remove guest pasid related interfaces and definitions
  iommu/vt-d: Remove aux-domain related callbacks
  iommu: Remove aux-domain related interfaces and iommu_ops
  iommu: Remove apply_resv_region
  drm/nouveau/device: Get right pgsize_bitmap of iommu_domain
  iommu: Use right way to retrieve iommu_ops
  iommu: Remove unused argument in is_attach_deferred
  iommu: Split struct iommu_ops
  iommu/vt-d: Remove intel_iommu::domains
  iommu/vt-d: Remove finding domain in dmar_insert_one_dev_info()
  iommu/vt-d: Remove iova_cache_get/put()
  iommu/vt-d: Remove domain and devinfo mempool
  iommu/vt-d: Remove DEFER_DEVICE_DOMAIN_INFO
  iommu/vt-d: Remove unnecessary includes
  iommu/vt-d: Remove unnecessary prototypes
  iommu/vt-d: Fix indentation of goto labels

Marco Bonelli (1):
  iommu/vt-d: Add missing "__init" for rmrr_sanity_check()

Miaoqian Lin (1):
  iommu/arm-smmu: Add missing pm_runtime_disable() in 
qcom_iommu_device_probe

Rafael J. Wysocki (1):
  iommu/vtd: Replace acpi_bus_get_device()

Robin Murphy (5):
  iommu: Remove trivial ops->capable implementations
  iommu/rockchip: : Use standard driver registration
  iommu/msm: Use standard driver registration
  iommu/iova: Improve 32-bit free space estimate
  iommu/arm-smmu: Account for PMU interrupts

Sebastian Reichel (1):
  iommu/mediatek: Always check runtime PM status in tlb flush range callback

Suravee Suthikulpanit (2):
  iommu/amd: Improve error handling for amd_iommu_init_pci
  iommu/amd: Improve amd_iommu_v2_exit()

Vasant Hegde (3):
  iommu/amd: Call memunmap in error path
  iommu/amd: Clean up function declarations
  iommu/amd: Remove unused struct fault.devid

Yian Chen (1):
  iommu/vt-d: Enable ATS for the devices in SATC table

Yong Wu (4):
  iommu/mediatek: Remove for_each_m4u in tlb_sync_all
  iommu/mediatek: Remove the power status 

[PATCH 03/25] dma-direct: take dma-ranges/offsets into account in resource mapping

2022-03-24 Thread Serge Semin
A basic device-specific linear memory mapping was introduced back in
commit ("dma: Take into account dma_pfn_offset") as a single-valued offset
preserved in the device.dma_pfn_offset field, which was initialized for
instance by means of the "dma-ranges" DT property. Afterwards the
functionality was extended to support more than one device-specific region
defined in the device.dma_range_map list of maps. But all of these
improvements concerned a single pointer, page or sg DMA-mapping methods,
while the system resource mapping function turned to miss the
corresponding modification. Thus the dma_direct_map_resource() method now
just casts the CPU physical address to the device DMA address with no
dma-ranges-based mapping taking into account, which is obviously wrong.
Let's fix it by using the phys_to_dma_direct() method to get the
device-specific bus address from the passed memory resource for the case
of the directly mapped DMA.

Fixes: 25f1e1887088 ("dma: Take into account dma_pfn_offset")
Signed-off-by: Serge Semin 
---
 kernel/dma/direct.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 50f48e9e4598..9ce8192b29ab 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -497,7 +497,7 @@ int dma_direct_map_sg(struct device *dev, struct 
scatterlist *sgl, int nents,
 dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
size_t size, enum dma_data_direction dir, unsigned long attrs)
 {
-   dma_addr_t dma_addr = paddr;
+   dma_addr_t dma_addr = phys_to_dma_direct(dev, paddr);
 
if (unlikely(!dma_capable(dev, dma_addr, size, false))) {
dev_err_once(dev,
-- 
2.35.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [REGRESSION] Recent swiotlb DMA_FROM_DEVICE fixes break ath9k-based AP

2022-03-24 Thread Oleksandr Natalenko via iommu
Hello.

On středa 23. března 2022 18:27:21 CET Linus Torvalds wrote:
> On Wed, Mar 23, 2022 at 12:19 AM Oleksandr Natalenko
>  wrote:
> > These commits appeared in v5.17 and v5.16.15, and both kernels are
> > broken for me. I'm pretty confident these commits make the difference
> > since I've built both v5.17 and v5.16.15 without them, and it fixed
> > the issue.
> 
> Can you double-check (or just explicitly confirm if you already did
> that test) that you need to revert *both* of those commits, and it's
> the later "rework" fix that triggers it?

I can confirm that if I revert aa6f8dcbab47 only, but leave ddbd89deb7d3 in 
place, AP works. So, it seems that the latest "rework" triggers the issue for 
me.

Thanks.

-- 
Oleksandr Natalenko (post-factum)


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-03-24 Thread Eric Auger
Hi,

On 3/24/22 1:33 AM, Jason Gunthorpe wrote:
> On Wed, Mar 23, 2022 at 04:51:25PM -0600, Alex Williamson wrote:
>
>> My overall question here would be whether we can actually achieve a
>> compatibility interface that has sufficient feature transparency that we
>> can dump vfio code in favor of this interface, or will there be enough
>> niche use cases that we need to keep type1 and vfio containers around
>> through a deprecation process?
> Other than SPAPR, I think we can.
>
>> The locked memory differences for one seem like something that
>> libvirt wouldn't want hidden
> I'm first interested to have an understanding how this change becomes
> a real problem in practice that requires libvirt to do something
> different for vfio or iommufd. We can discuss in the other thread
>
> If this is the make or break point then I think we can deal with it
> either by going back to what vfio does now or perhaps some other
> friendly compat approach..
>
>> and we have questions regarding support for vaddr hijacking
> I'm not sure what vaddr hijacking is? Do you mean
> VFIO_DMA_MAP_FLAG_VADDR ? There is a comment that outlines my plan to
> implement it in a functionally compatible way without the deadlock
> problem. I estimate this as a small project.
>
>> and different ideas how to implement dirty page tracking, 
> I don't think this is compatibility. No kernel today triggers qemu to
> use this feature as no kernel supports live migration. No existing
> qemu will trigger this feature with new kernels that support live
> migration v2. Therefore we can adjust qemu's dirty tracking at the
> same time we enable migration v2 in qemu.
>
> With Joao's work we are close to having a solid RFC to come with
> something that can be fully implemented.
>
> Hopefully we can agree to this soon enough that qemu can come with a
> full package of migration v2 support including the dirty tracking
> solution.
>
>> not to mention the missing features that are currently well used,
>> like p2p mappings, coherency tracking, mdev, etc.
> I consider these all mandatory things, they won't be left out.
>
> The reason they are not in the RFC is mostly because supporting them
> requires work outside just this iommufd area, and I'd like this series
> to remain self-contained.
>
> I've already got a draft to add DMABUF support to VFIO PCI which
> nicely solves the follow_pfn security problem, we want to do this for
> another reason already. I'm waiting for some testing feedback before
> posting it. Need some help from Daniel make the DMABUF revoke semantic
> him and I have been talking about. In the worst case can copy the
> follow_pfn approach.
>
> Intel no-snoop is simple enough, just needs some Intel cleanup parts.
>
> mdev will come along with the final VFIO integration, all the really
> hard parts are done already. The VFIO integration is a medium sized
> task overall.
>
> So, I'm not ready to give up yet :)
>
>> Where do we focus attention?  Is symlinking device files our proposal
>> to userspace and is that something achievable, or do we want to use
>> this compatibility interface as a means to test the interface and
>> allow userspace to make use of it for transition, if their use cases
>> allow it, perhaps eventually performing the symlink after deprecation
>> and eventual removal of the vfio container and type1 code?  Thanks,
> symlinking device files is definitely just a suggested way to expedite
> testing.
>
> Things like qemu that are learning to use iommufd-only features should
> learn to directly open iommufd instead of vfio container to activate
> those features.
>
> Looking long down the road I don't think we want to have type 1 and
> iommufd code forever. So, I would like to make an option to compile
> out vfio container support entirely and have that option arrange for
> iommufd to provide the container device node itself.
I am currently working on migrating the QEMU VFIO device onto the new
API because since after our discussions the compat mode cannot be used
anyway to implemented nesting. I hope I will be able to present
something next week.

Thanks

Eric
>
> I think we can get there pretty quickly, or at least I haven't got
> anything that is scaring me alot (beyond SPAPR of course)
>
> For the dpdk/etcs of the world I think we are already there.
>
> Jason
>

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-24 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, March 24, 2022 6:55 AM
> 
> On Wed, Mar 23, 2022 at 05:34:18PM -0300, Jason Gunthorpe wrote:
> 
> > Stated another way, any platform that wires dev_is_dma_coherent() to
> > true, like all x86 does, must support IOMMU_CACHE and report
> > IOMMU_CAP_CACHE_COHERENCY for every iommu_domain the platform
> > supports. The platform obviously declares it support this in order to
> > support the in-kernel DMA API.
> 
> That gives me a nice simple idea:
> 
> diff --git a/drivers/iommu/iommufd/device.c
> b/drivers/iommu/iommufd/device.c
> index 3c6b95ad026829..8366884df4a030 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -8,6 +8,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #include "iommufd_private.h"
> 
> @@ -61,6 +62,10 @@ struct iommufd_device
> *iommufd_bind_pci_device(int fd, struct pci_dev *pdev,
>   struct iommu_group *group;
>   int rc;
> 
> + /* iommufd always uses IOMMU_CACHE */
> + if (!dev_is_dma_coherent(>dev))
> + return ERR_PTR(-EINVAL);
> +
>   ictx = iommufd_fget(fd);
>   if (!ictx)
>   return ERR_PTR(-EINVAL);
> diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
> index 48149988c84bbc..3d6df1ffbf93e6 100644
> --- a/drivers/iommu/iommufd/ioas.c
> +++ b/drivers/iommu/iommufd/ioas.c
> @@ -129,7 +129,8 @@ static int conv_iommu_prot(u32 map_flags)
>* We provide no manual cache coherency ioctls to userspace and
> most
>* architectures make the CPU ops for cache flushing privileged.
>* Therefore we require the underlying IOMMU to support CPU
> coherent
> -  * operation.
> +  * operation. Support for IOMMU_CACHE is enforced by the
> +  * dev_is_dma_coherent() test during bind.
>*/
>   iommu_prot = IOMMU_CACHE;
>   if (map_flags & IOMMU_IOAS_MAP_WRITEABLE)
> 
> Looking at it I would say all the places that test
> IOMMU_CAP_CACHE_COHERENCY can be replaced with
> dev_is_dma_coherent()
> except for the one call in VFIO that is supporting the Intel no-snoop
> behavior.
> 
> Then we can rename IOMMU_CAP_CACHE_COHERENCY to something like
> IOMMU_CAP_ENFORCE_CACHE_COHERENCY and just create a
> IOMMU_ENFORCE_CACHE prot flag for Intel IOMMU to use instead of
> abusing IOMMU_CACHE.
> 

Based on that here is a quick tweak of the force-snoop part (not compiled).

Several notes:

- IOMMU_CAP_CACHE_COHERENCY is kept as it's checked in vfio's
  group attach interface. Removing it may require a group_is_dma_coherent();

- vdpa is not changed as force-snoop is only for integrated GPU today which
  is not managed by vdpa. But adding the snoop support is easy if necessary;

- vfio type1 reports force-snoop fact to KVM via VFIO_DMA_CC_IOMMU. For
  iommufd the compat layer may leverage that interface but more thoughts
  are required for non-compat usage how that can be reused or whether a
  new one is required between iommufd and kvm. Per earlier  discussions
  Paolo prefers to reuse.

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 5b196cf..06cca04 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5110,7 +5110,8 @@ static int intel_iommu_map(struct iommu_domain *domain,
prot |= DMA_PTE_READ;
if (iommu_prot & IOMMU_WRITE)
prot |= DMA_PTE_WRITE;
-   if ((iommu_prot & IOMMU_CACHE) && dmar_domain->iommu_snooping)
+   /* nothing to do for IOMMU_CACHE */
+   if ((iommu_prot & IOMMU_SNOOP) && dmar_domain->iommu_snooping)
prot |= DMA_PTE_SNP;
 
max_addr = iova + size;
@@ -5236,6 +5237,8 @@ static phys_addr_t intel_iommu_iova_to_phys(struct 
iommu_domain *domain,
 static bool intel_iommu_capable(enum iommu_cap cap)
 {
if (cap == IOMMU_CAP_CACHE_COHERENCY)
+   return true;
+   if (cap == IOMMU_CAP_FORCE_SNOOP)
return domain_update_iommu_snooping(NULL);
if (cap == IOMMU_CAP_INTR_REMAP)
return irq_remapping_enabled == 1;
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 9394aa9..abc4cfe 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2270,6 +2270,9 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
domain->prot |= IOMMU_CACHE;
 
+   if (iommu_capable(bus, IOMMU_CAP_FORCE_SNOOP)
+   domain->prot |= IOMMU_SNOOP;
+
/*
 * Try to match an existing compatible domain.  We don't want to
 * preclude an IOMMU driver supporting multiple bus_types and being
@@ -2611,14 +2614,14 @@ static void vfio_iommu_type1_release(void *iommu_data)
kfree(iommu);
 }
 
-static int vfio_domains_have_iommu_cache(struct vfio_iommu *iommu)
+static int vfio_domains_have_iommu_snoop(struct vfio_iommu *iommu)
 {
struct vfio_domain *domain;
  

RE: [PATCH RFC 08/12] iommufd: IOCTLs for the io_pagetable

2022-03-24 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, March 24, 2022 4:34 AM
> 
> On Wed, Mar 23, 2022 at 02:04:46PM -0600, Alex Williamson wrote:
> > On Wed, 23 Mar 2022 16:34:39 -0300
> > Jason Gunthorpe  wrote:
> >
> > > On Wed, Mar 23, 2022 at 01:10:38PM -0600, Alex Williamson wrote:
> > > > On Fri, 18 Mar 2022 14:27:33 -0300
> > > > Jason Gunthorpe  wrote:
> > > >
> > > > > +static int conv_iommu_prot(u32 map_flags)
> > > > > +{
> > > > > + int iommu_prot;
> > > > > +
> > > > > + /*
> > > > > +  * We provide no manual cache coherency ioctls to userspace
> and most
> > > > > +  * architectures make the CPU ops for cache flushing
> privileged.
> > > > > +  * Therefore we require the underlying IOMMU to support
> CPU coherent
> > > > > +  * operation.
> > > > > +  */
> > > > > + iommu_prot = IOMMU_CACHE;
> > > >
> > > > Where is this requirement enforced?  AIUI we'd need to test
> > > > IOMMU_CAP_CACHE_COHERENCY somewhere since functions like
> > > > intel_iommu_map() simply drop the flag when not supported by HW.
> > >
> > > You are right, the correct thing to do is to fail device
> > > binding/attach entirely if IOMMU_CAP_CACHE_COHERENCY is not there,
> > > however we can't do that because Intel abuses the meaning of
> > > IOMMU_CAP_CACHE_COHERENCY to mean their special no-snoop
> behavior is
> > > supported.
> > >
> > > I want Intel to split out their special no-snoop from IOMMU_CACHE and
> > > IOMMU_CAP_CACHE_COHERENCY so these things have a consisent
> meaning in
> > > all iommu drivers. Once this is done vfio and iommufd should both
> > > always set IOMMU_CACHE and refuse to work without
> > > IOMMU_CAP_CACHE_COHERENCY. (unless someone knows of
> an !IOMMU_CACHE
> > > arch that does in fact work today with vfio, somehow, but I don't..)
> >
> > IIRC, the DMAR on Intel CPUs dedicated to IGD was where we'd often see
> > lack of snoop-control support, causing us to have mixed coherent and
> > non-coherent domains.  I don't recall if you go back far enough in VT-d
> > history if the primary IOMMU might have lacked this support.  So I
> > think there are systems we care about with IOMMUs that can't enforce
> > DMA coherency.
> >
> > As it is today, if the IOMMU reports IOMMU_CAP_CACHE_COHERENCY and
> all
> > mappings make use of IOMMU_CACHE, then all DMA is coherent.  Are you
> > suggesting IOMMU_CAP_CACHE_COHERENCY should indicate that all
> mappings
> > are coherent regardless of mapping protection flags?  What's the point
> > of IOMMU_CACHE at that point?
> 
> IOMMU_CAP_CACHE_COHERENCY should return to what it was before Intel's
> change.

One nit (as I explained in previous v1 discussion). It is not that Intel
abuses this capability as it was firstly introduced for Intel's force 
snoop requirement. It is just that when later its meaning was changed
to match what you described below the original use of Intel was not
caught and fixed properly. 

> 
> It only means normal DMAs issued in a normal way are coherent with the
> CPU and do not require special cache flushing instructions. ie DMA
> issued by a kernel driver using the DMA API.
> 
> It does not mean that non-coherence DMA does not exist, or that
> platform or device specific ways to trigger non-coherence are blocked.
> 
> Stated another way, any platform that wires dev_is_dma_coherent() to
> true, like all x86 does, must support IOMMU_CACHE and report
> IOMMU_CAP_CACHE_COHERENCY for every iommu_domain the platform
> supports. The platform obviously declares it support this in order to
> support the in-kernel DMA API.

This is a good explanation of IOMMU_CACHE. From that intel-iommu
driver should always report this capability and do nothing with
IOMMU_CACHE since it's already guaranteed by the arch. Actually
this is exactly what AMD iommu driver does today.

Then we'll introduce a new cap/prot in particular for enforcing snoop
as you suggested below.

> 
> Thus, a new cap indicating that 'all dma is coherent' or 'no-snoop
> blocking' should be created to cover Intel's special need. From what I
> know it is only implemented in the Intel driver and apparently only
> for some IOMMUs connected to IGD.
> 
> > > Yes, it was missed in the notes for vfio compat that Intel no-snoop is
> > > not working currently, I fixed it.
> >
> > Right, I see it in the comments relative to extensions, but missed in
> > the commit log.  Thanks,
> 
> Oh good, I remembered it was someplace..
> 

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu