RE: [RFC 06/20] iommu: Add iommu_device_init[exit]_user_dma interfaces

2021-10-14 Thread Tian, Kevin
Hi, Jason,

> From: Jason Gunthorpe 
> Sent: Wednesday, September 29, 2021 8:59 PM
> 
> On Wed, Sep 29, 2021 at 12:38:35AM +, Tian, Kevin wrote:
> 
> > /* If set the driver must call iommu_XX as the first action in probe() or
> >   * before it attempts to do DMA
> >   */
> >  bool suppress_dma_owner:1;
> 
> It is not "attempts to do DMA" but more "operates the physical device
> in any away"
> 
> Not having ownership means another entity could be using user space
> DMA to manipulate the device state and attack the integrity of the
> kernel's programming of the device.
> 

Does suppress_kernel_dma sounds better than suppress_dma_owner?
We found the latter causing some confusion when doing internal
code review. Somehow this flag represents "don't claim the kernel dma
ownership during driver binding". suppress_dma_owner sounds the
entire ownership is disabled...

Another thing is about DMA_OWNER_SHARED, which is set to indicate 
no dma at all. Thinking more we feel that this flag is meaningless. Its
sole purpose is to show compatibility to any USER/KERNEL ownership, 
and essentially the same semantics as a device which is not bound to
any driver. So we plan to remove it then pci-stub just needs one line 
change to set the suppress flag. But want to check with you first in case
any oversight.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

2021-10-14 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, October 14, 2021 11:43 PM
> 
> > > > I think the key is whether other archs allow driver to decide DMA
> > > > coherency and indirectly the underlying I/O page table format.
> > > > If yes, then I don't see a reason why such decision should not be
> > > > given to userspace for passthrough case.
> > >
> > > The choice all comes down to if the other arches have cache
> > > maintenance instructions in the VM that *don't work*
> >
> > Looks vfio always sets IOMMU_CACHE on all platforms as long as
> > iommu supports it (true on all platforms except intel iommu which
> > is dedicated for GPU):
> >
> > vfio_iommu_type1_attach_group()
> > {
> > ...
> > if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
> > domain->prot |= IOMMU_CACHE;
> > ...
> > }
> >
> > Should above be set according to whether a device is coherent?
> 
> For IOMMU_CACHE there are two questions related to the overloaded
> meaning:
> 
>  - Should VFIO ask the IOMMU to use non-coherent DMA (ARM meaning)
>This depends on how the VFIO user expects to operate the DMA.
>If the VFIO user can issue cache maintenance ops then IOMMU_CACHE
>should be controlled by the user. I have no idea what platforms
>support user space cache maintenance ops.

But just like you said for intel meaning below, even if those ops are
privileged a uAPI can be provided to support such usage if necessary.

> 
>  - Should VFIO ask the IOMMU to suppress no-snoop (Intel meaning)
>This depends if the VFIO user has access to wbinvd or not.
> 
>wbinvd is a privileged instruction so normally userspace will not
>be able to access it.
> 
>Per Paolo recommendation there should be a uAPI someplace that
>allows userspace to issue wbinvd - basically the suppress no-snoop
>is also user controllable.
> 
> The two things are very similar and ultimately are a choice userspace
> should be making.

yes

> 
> From something like a qemu perspective things are more murkey - eg on
> ARM qemu needs to co-ordinate with the guest. Whatever IOMMU_CACHE
> mode VFIO is using must match the device coherent flag in the Linux
> guest. I'm guessing all Linux guest VMs only use coherent DMA for all
> devices today. I don't know if the cache maintaince ops are even
> permitted in an ARM VM.
> 

I'll leave it to Jean to confirm. If only coherent DMA can be used in
the guest on other platforms, suppose VFIO should not blindly set 
IOMMU_CACHE and in concept it should deny assigning a non-coherent 
device since no co-ordination with guest exists today.

So the bottomline is that we'll keep this no-snoop thing Intel-specific. 
For the basic skeleton we'll not support no-snoop thus the user 
needs to set enforce-snoop flag when creating an IOAS like this RFC v1
does. Also need to introduce a new flag instead of abusing 
IOMMU_CACHE in the kernel. For other platforms it may need a fix 
to deny non-coherent device (based on above open) for now.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] iommu: Use put_pages_list

2021-10-14 Thread Matthew Wilcox
On Thu, Oct 14, 2021 at 05:17:18PM +0100, Robin Murphy wrote:
> On 2021-10-14 12:52, John Garry wrote:
> > On 14/10/2021 12:20, Matthew Wilcox wrote:
> > > I'm going to keep pinging this patch weekly.
> > > 
> > > On Thu, Oct 07, 2021 at 07:17:02PM +0100, Matthew Wilcox wrote:
> > > > ping?
> > 
> > Robin, Were you checking this? You mentioned "I got
> > side-tracked trying to make io-pgtable use that freelist properly" in
> > another thread, which seems related.
> 
> Ooh, thanks for the heads-up John - I'm still only just starting to catch up
> on my mailing list folders since I got back off holiday.
> 
> Indeed I already started untangling the freelist handling in the flush queue
> code (to make the move into iommu-dma smaller). Once I'd figured out how it
> worked I did wonder whether there was any more "standard" field to borrow,
> since page->freelist did seem very much in the minority. If page->lru is it
> then great! From a quick skim of the patch I think I'd only have a few
> trivial review comments to make - certainly no objection to the fundamental
> change itself (indeed I hit a point in io-pgtable-arm where adding to the
> pointer chain got rather awkward, so having proper lists to splice would be
> lovely).

Great to hear!

> Matthew - is this something getting in the way of mm development, or just a
> nice cleanup? I'd be happy either to pursue merging it on its own, or to
> pick it up and work it into a series with my stuff.

This is probably going to get in the way of MM development in ~6 months
time.  I'm happy for you to pick it up and put it in a series of your own!
BTW, the optimisation of the implementation of put_pages_list() is sitting
in akpm's tree, so if you see a performance problem, please give that
a try.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] iommu: Use put_pages_list

2021-10-14 Thread Robin Murphy

On 2021-10-14 12:52, John Garry wrote:

On 14/10/2021 12:20, Matthew Wilcox wrote:

I'm going to keep pinging this patch weekly.

On Thu, Oct 07, 2021 at 07:17:02PM +0100, Matthew Wilcox wrote:

ping?


Robin, Were you checking this? You mentioned "I got
side-tracked trying to make io-pgtable use that freelist properly" in 
another thread, which seems related.


Ooh, thanks for the heads-up John - I'm still only just starting to 
catch up on my mailing list folders since I got back off holiday.


Indeed I already started untangling the freelist handling in the flush 
queue code (to make the move into iommu-dma smaller). Once I'd figured 
out how it worked I did wonder whether there was any more "standard" 
field to borrow, since page->freelist did seem very much in the 
minority. If page->lru is it then great! From a quick skim of the patch 
I think I'd only have a few trivial review comments to make - certainly 
no objection to the fundamental change itself (indeed I hit a point in 
io-pgtable-arm where adding to the pointer chain got rather awkward, so 
having proper lists to splice would be lovely).


Matthew - is this something getting in the way of mm development, or 
just a nice cleanup? I'd be happy either to pursue merging it on its 
own, or to pick it up and work it into a series with my stuff.


Cheers,
Robin.



Thanks,
John



On Thu, Sep 30, 2021 at 05:20:42PM +0100, Matthew Wilcox (Oracle) wrote:

page->freelist is for the use of slab.  We already have the ability
to free a list of pages in the core mm, but it requires the use of a
list_head and for the pages to be chained together through page->lru.
Switch the iommu code over to using free_pages_list().

Signed-off-by: Matthew Wilcox (Oracle) 
---
  drivers/iommu/amd/io_pgtable.c | 99 
+++---

  drivers/iommu/dma-iommu.c  | 11 +---
  drivers/iommu/intel/iommu.c    | 89 +++---
  include/linux/iommu.h  |  3 +-
  4 files changed, 77 insertions(+), 125 deletions(-)


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

2021-10-14 Thread Jason Gunthorpe via iommu
On Thu, Oct 14, 2021 at 09:11:58AM +, Tian, Kevin wrote:

> But in both cases cache maintenance instructions are available from 
> guest p.o.v and no coherency semantics would be violated.

You've described how Intel's solution papers over the problem.

In part wbinvd is defined to restore CPU cache coherence after a
no-snoop DMA. Having wbinvd NOP breaks this contract.

To counter-act the broken wbinvd the IOMMU completely prevents the use
of no-snoop DMA. It converts them to snoop instead.

The driver thinks it has no-snoop. The platform appears to support
no-snoop. The driver issues wbinvd - but all of it does nothing.

Don't think any of this is even remotely related to what ARM is doing
here. ARM has neither the broken VM cache ops, nor the IOMMU ability
to suppress no-snoop.

> > > I think the key is whether other archs allow driver to decide DMA
> > > coherency and indirectly the underlying I/O page table format.
> > > If yes, then I don't see a reason why such decision should not be
> > > given to userspace for passthrough case.
> > 
> > The choice all comes down to if the other arches have cache
> > maintenance instructions in the VM that *don't work*
>
> Looks vfio always sets IOMMU_CACHE on all platforms as long as
> iommu supports it (true on all platforms except intel iommu which
> is dedicated for GPU):
> 
> vfio_iommu_type1_attach_group()
> {
>   ...
>   if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
>   domain->prot |= IOMMU_CACHE;
>   ...
> }
> 
> Should above be set according to whether a device is coherent?

For IOMMU_CACHE there are two questions related to the overloaded
meaning:

 - Should VFIO ask the IOMMU to use non-coherent DMA (ARM meaning)
   This depends on how the VFIO user expects to operate the DMA.
   If the VFIO user can issue cache maintenance ops then IOMMU_CACHE
   should be controlled by the user. I have no idea what platforms
   support user space cache maintenance ops.

 - Should VFIO ask the IOMMU to suppress no-snoop (Intel meaning)
   This depends if the VFIO user has access to wbinvd or not.

   wbinvd is a privileged instruction so normally userspace will not
   be able to access it.

   Per Paolo recommendation there should be a uAPI someplace that
   allows userspace to issue wbinvd - basically the suppress no-snoop
   is also user controllable.

The two things are very similar and ultimately are a choice userspace
should be making.

>From something like a qemu perspective things are more murkey - eg on
ARM qemu needs to co-ordinate with the guest. Whatever IOMMU_CACHE
mode VFIO is using must match the device coherent flag in the Linux
guest. I'm guessing all Linux guest VMs only use coherent DMA for all
devices today. I don't know if the cache maintaince ops are even
permitted in an ARM VM.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-14 Thread Jason Gunthorpe via iommu
On Thu, Oct 14, 2021 at 03:33:21PM +1100, da...@gibson.dropbear.id.au wrote:

> > If the HW can attach multiple non-overlapping IOAS's to the same
> > device then the HW is routing to the correct IOAS by using the address
> > bits. This is not much different from the prior discussion we had
> > where we were thinking of the PASID as an 80 bit address
> 
> Ah... that might be a workable approach.  And it even helps me get my
> head around multiple attachment which I was struggling with before.
> 
> So, the rule would be that you can attach multiple IOASes to a device,
> as long as none of them overlap.  The non-overlapping could be because
> each IOAS covers a disjoint address range, or it could be because
> there's some attached information - such as a PASID - to disambiguate.

Right exactly - it is very parallel to PASID

And obviously HW support is required to have multiple page table
pointers per RID - which sounds like PPC does (high/low pointer?)
 
> What remains a question is where the disambiguating information comes
> from in each case: does it come from properties of the IOAS,
> propertues of the device, or from extra parameters supplied at attach
> time.  IIUC, the current draft suggests it always comes at attach time
> for the PASID information.  Obviously the more consistency we can have
> here the better.

>From a generic view point I'd say all are fair game. It is up to the
IOMMU driver to take the requested set of IOAS's, the "at attachment"
information (like PASID) and decide what to do, or fail.

> I can also see an additional problem in implementation, once we start
> looking at hot-adding devices to existing address spaces.  

I won't pretend to guess how to implement this :) Just from a modeling
perspective is something that works logically. If the kernel
implementation is too hard then PPC should do one of the other ideas.

Personally I'd probably try for a nice multi-domain attachment model
like PASID and not try to create/destroy domains.

As I said in my last email I think it is up to each IOMMU HW driver to
make these decisions, the iommufd framework just provides a
standardized API toward the attaching driver that the IOMMU HW must
fit into.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-14 Thread Jason Gunthorpe via iommu
On Thu, Oct 14, 2021 at 03:53:33PM +1100, David Gibson wrote:

> > My feeling is that qemu should be dealing with the host != target
> > case, not the kernel.
> > 
> > The kernel's job should be to expose the IOMMU HW it has, with all
> > features accessible, to userspace.
> 
> See... to me this is contrary to the point we agreed on above.

I'm not thinking of these as exclusive ideas.

The IOCTL interface in iommu can quite happily expose:
 Create IOAS generically
 Manipulate IOAS generically
 Create IOAS with IOMMU driver specific attributes
 HW specific Manipulate IOAS

IOCTL commands all together.

So long as everything is focused on a generic in-kernel IOAS object it
is fine to have multiple ways in the uAPI to create and manipulate the
objects.

When I speak about a generic interface I mean "Create IOAS
generically" - ie a set of IOCTLs that work on most IOMMU HW and can
be relied upon by things like DPDK/etc to always work and be portable.
This is why I like "hints" to provide some limited widely applicable
micro-optimization.

When I said "expose the IOMMU HW it has with all features accessible"
I mean also providing "Create IOAS with IOMMU driver specific
attributes".

These other IOCTLs would allow the IOMMU driver to expose every
configuration knob its HW has, in a natural HW centric language.
There is no pretense of genericness here, no crazy foo=A, foo=B hidden
device specific interface.

Think of it as a high level/low level interface to the same thing.

> Those are certainly wrong, but they came about explicitly by *not*
> being generic rather than by being too generic.  So I'm really
> confused aso to what you're arguing for / against.

IMHO it is not having a PPC specific interface that was the problem,
it was making the PPC specific interface exclusive to the type 1
interface. If type 1 continued to work on PPC then DPDK/etc would
never learned PPC specific code.

For iommufd with the high/low interface each IOMMU HW should ask basic
questions:

 - What should the generic high level interface do on this HW?
   For instance what should 'Create IOAS generically' do for PPC?
   It should not fail, it should create *something*
   What is the best thing for DPDK?
   I guess the 64 bit window is most broadly useful.

 - How to accurately describe the HW in terms of standard IOAS objects
   and where to put HW specific structs to support this.

   This is where PPC would decide how best to expose a control over
   its low/high window (eg 1,2,3 IOAS). Whatever the IOMMU driver
   wants, so long as it fits into the kernel IOAS model facing the
   connected device driver.

QEMU would have IOMMU userspace drivers. One would be the "generic
driver" using only the high level generic interface. It should work as
best it can on all HW devices. This is the fallback path you talked
of.

QEMU would also have HW specific IOMMU userspace drivers that know how
to operate the exact HW. eg these drivers would know how to use
userspace page tables, how to form IOPTEs and how to access the
special features.

This is how QEMU could use an optimzed path with nested page tables,
for instance.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC] virtio: wrap config->reset calls

2021-10-14 Thread Anton Yakovlev

On 13.10.2021 12:55, Michael S. Tsirkin wrote:


This will enable cleanups down the road.
The idea is to disable cbs, then add "flush_queued_cbs" callback
as a parameter, this way drivers can flush any work
queued after callbacks have been disabled.

Signed-off-by: Michael S. Tsirkin 
---
  sound/virtio/virtio_card.c | 4 ++--



Reviewed-by: Anton Yakovlev 

--
Anton Yakovlev
Senior Software Engineer

OpenSynergy GmbH
Rotherstr. 20, 10245 Berlin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] iommu: Use put_pages_list

2021-10-14 Thread John Garry

On 14/10/2021 12:20, Matthew Wilcox wrote:

I'm going to keep pinging this patch weekly.

On Thu, Oct 07, 2021 at 07:17:02PM +0100, Matthew Wilcox wrote:

ping?


Robin, Were you checking this? You mentioned "I got
side-tracked trying to make io-pgtable use that freelist properly" in 
another thread, which seems related.


Thanks,
John



On Thu, Sep 30, 2021 at 05:20:42PM +0100, Matthew Wilcox (Oracle) wrote:

page->freelist is for the use of slab.  We already have the ability
to free a list of pages in the core mm, but it requires the use of a
list_head and for the pages to be chained together through page->lru.
Switch the iommu code over to using free_pages_list().

Signed-off-by: Matthew Wilcox (Oracle) 
---
  drivers/iommu/amd/io_pgtable.c | 99 +++---
  drivers/iommu/dma-iommu.c  | 11 +---
  drivers/iommu/intel/iommu.c| 89 +++---
  include/linux/iommu.h  |  3 +-
  4 files changed, 77 insertions(+), 125 deletions(-)


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] iommu: Use put_pages_list

2021-10-14 Thread Matthew Wilcox
I'm going to keep pinging this patch weekly.

On Thu, Oct 07, 2021 at 07:17:02PM +0100, Matthew Wilcox wrote:
> ping?
> 
> On Thu, Sep 30, 2021 at 05:20:42PM +0100, Matthew Wilcox (Oracle) wrote:
> > page->freelist is for the use of slab.  We already have the ability
> > to free a list of pages in the core mm, but it requires the use of a
> > list_head and for the pages to be chained together through page->lru.
> > Switch the iommu code over to using free_pages_list().
> > 
> > Signed-off-by: Matthew Wilcox (Oracle) 
> > ---
> >  drivers/iommu/amd/io_pgtable.c | 99 +++---
> >  drivers/iommu/dma-iommu.c  | 11 +---
> >  drivers/iommu/intel/iommu.c| 89 +++---
> >  include/linux/iommu.h  |  3 +-
> >  4 files changed, 77 insertions(+), 125 deletions(-)
> > 
> > diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c
> > index 182c93a43efd..8dfa6ee58b76 100644
> > --- a/drivers/iommu/amd/io_pgtable.c
> > +++ b/drivers/iommu/amd/io_pgtable.c
> > @@ -74,49 +74,37 @@ static u64 *first_pte_l7(u64 *pte, unsigned long 
> > *page_size,
> >   *
> >   
> > /
> >  
> > -static void free_page_list(struct page *freelist)
> > -{
> > -   while (freelist != NULL) {
> > -   unsigned long p = (unsigned long)page_address(freelist);
> > -
> > -   freelist = freelist->freelist;
> > -   free_page(p);
> > -   }
> > -}
> > -
> > -static struct page *free_pt_page(unsigned long pt, struct page *freelist)
> > +static void free_pt_page(unsigned long pt, struct list_head *list)
> >  {
> > struct page *p = virt_to_page((void *)pt);
> >  
> > -   p->freelist = freelist;
> > -
> > -   return p;
> > +   list_add_tail(>lru, list);
> >  }
> >  
> >  #define DEFINE_FREE_PT_FN(LVL, FN) 
> > \
> > -static struct page *free_pt_##LVL (unsigned long __pt, struct page 
> > *freelist)  \
> > -{  
> > \
> > -   unsigned long p;
> > \
> > -   u64 *pt;
> > \
> > -   int i;  
> > \
> > -   
> > \
> > -   pt = (u64 *)__pt;   
> > \
> > -   
> > \
> > -   for (i = 0; i < 512; ++i) { 
> > \
> > -   /* PTE present? */  
> > \
> > -   if (!IOMMU_PTE_PRESENT(pt[i]))  
> > \
> > -   continue;   
> > \
> > -   
> > \
> > -   /* Large PTE? */
> > \
> > -   if (PM_PTE_LEVEL(pt[i]) == 0 || 
> > \
> > -   PM_PTE_LEVEL(pt[i]) == 7)   
> > \
> > -   continue;   
> > \
> > -   
> > \
> > -   p = (unsigned long)IOMMU_PTE_PAGE(pt[i]);   
> > \
> > -   freelist = FN(p, freelist); 
> > \
> > -   }   
> > \
> > -   
> > \
> > -   return free_pt_page((unsigned long)pt, freelist);   
> > \
> > +static void free_pt_##LVL (unsigned long __pt, struct list_head *list) 
> > \
> > +{  \
> > +   unsigned long p;\
> > +   u64 *pt;\
> > +   int i;  \
> > +   \
> > +   pt = (u64 *)__pt;   \
> > +   \
> > +   for (i = 0; i < 512; ++i) { \
> > +   /* PTE present? */  \
> > +   if (!IOMMU_PTE_PRESENT(pt[i]))  \
> > +   continue;   \
> > +   \
> > +   /* Large PTE? */\
> > +   if 

RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO

2021-10-14 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Friday, October 1, 2021 6:24 AM
> 
> On Thu, Sep 30, 2021 at 09:35:45AM +, Tian, Kevin wrote:
> 
> > > The Intel functional issue is that Intel blocks the cache maintaince
> > > ops from the VM and the VM has no way to self-discover that the cache
> > > maintaince ops don't work.
> >
> > the VM doesn't need to know whether the maintenance ops
> > actually works.
> 
> Which is the whole problem.
> 
> Intel has a design where the device driver tells the device to issue
> non-cachable TLPs.
> 
> The driver is supposed to know if it can issue the cache maintaince
> instructions - if it can then it should ask the device to issue
> no-snoop TLPs.
> 
> For instance the same PCI driver on non-x86 should never ask the
> device to issue no-snoop TLPs because it has no idea how to restore
> cache coherence on eg ARM.
> 
> Do you see the issue? This configuration where the hypervisor silently
> make wbsync a NOP breaks the x86 architecture because the guest has no
> idea it can no longer use no-snoop features.

Thanks for explanation. But I still have one puzzle about the 'break'
part.

If hypervisor makes wbinvd a NOP then it will also set enforce_snoop
bit in PTE to convert non-snoop packet to snoop. No function in the guest
is broken, just the performance may lag.

If performance matters then hypervisor configures IOMMU to allow
non-snoop packet and then emulate wbinvd properly.

The contract between vfio and kvm is to convey above requirement
on how wbinvd is handled.

But in both cases cache maintenance instructions are available from 
guest p.o.v and no coherency semantics would be violated.

> 
> Using the IOMMU to forcibly prevent the device from issuing no-snoop
> makes this whole issue of the broken wbsync moot.

it's not prevent issuing. Instead, IOMMU converts non-snoop request
to snoop.

> 
> It is important to be really clear on what this is about - this is not
> some idealized nice iommu feature - it is working around alot of
> backwards compatability baggage that is probably completely unique to
> x86.
> 
> > > Other arches don't seem to have this specific problem...
> >
> > I think the key is whether other archs allow driver to decide DMA
> > coherency and indirectly the underlying I/O page table format.
> > If yes, then I don't see a reason why such decision should not be
> > given to userspace for passthrough case.
> 
> The choice all comes down to if the other arches have cache
> maintenance instructions in the VM that *don't work*
> 

Looks vfio always sets IOMMU_CACHE on all platforms as long as
iommu supports it (true on all platforms except intel iommu which
is dedicated for GPU):

vfio_iommu_type1_attach_group()
{
...
if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
domain->prot |= IOMMU_CACHE;
...
}

Should above be set according to whether a device is coherent?

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC] virtio: wrap config->reset calls

2021-10-14 Thread Jean-Philippe Brucker
On Wed, Oct 13, 2021 at 06:55:31AM -0400, Michael S. Tsirkin wrote:
> This will enable cleanups down the road.
> The idea is to disable cbs, then add "flush_queued_cbs" callback
> as a parameter, this way drivers can flush any work
> queued after callbacks have been disabled.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---

>  drivers/iommu/virtio-iommu.c   | 2 +-

Reviewed-by: Jean-Philippe Brucker 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/arm: fix ARM_SMMU_QCOM compilation

2021-10-14 Thread Will Deacon
On Wed, Oct 13, 2021 at 09:31:40PM +0200, Arnd Bergmann wrote:
> On Wed, Oct 13, 2021 at 6:20 PM Will Deacon  wrote:
> > On Wed, Oct 13, 2021 at 10:33:55AM +0200, Arnd Bergmann wrote:
> > > On Wed, Oct 13, 2021 at 9:58 AM Will Deacon  wrote:
> > > > On Tue, Oct 12, 2021 at 05:18:00PM +0200, Arnd Bergmann wrote:
> 
> > > I was hoping you and Joerg could just pick your preferred patch
> > > into the iommu fixes tree for v5.15.
> > >
> > > I currently have nothing else pending for my asm-generic tree that
> > > introduced the regression, but I can take it through there if that helps
> > > you.
> >
> > I also don't have any fixes pending, and I don't see any in Joerg's tree so
> > it's probably quickest if you send it on yourself. Is that ok?
> 
> Sure, no problem. I ended up adding it to the arm/fixes branch of the
> soc tree, as I just merged some other fixes there, and it seems as good
> as any of the other trees.

Thanks, Arnd!

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 13/20] iommu: Extend iommu_at[de]tach_device() for multiple devices group

2021-10-14 Thread Tian, Kevin
> From: David Gibson 
> Sent: Thursday, October 14, 2021 1:24 PM
> 
> On Sun, Sep 19, 2021 at 02:38:41PM +0800, Liu Yi L wrote:
> > From: Lu Baolu 
> >
> > These two helpers could be used when 1) the iommu group is singleton,
> > or 2) the upper layer has put the iommu group into the secure state by
> > calling iommu_device_init_user_dma().
> >
> > As we want the iommufd design to be a device-centric model, we want to
> > remove any group knowledge in iommufd. Given that we already have
> > iommu_at[de]tach_device() interface, we could extend it for iommufd
> > simply by doing below:
> >
> >  - first device in a group does group attach;
> >  - last device in a group does group detach.
> >
> > as long as the group has been put into the secure context.
> >
> > The commit <426a273834eae> ("iommu: Limit
> iommu_attach/detach_device to
> > device with their own group") deliberately restricts the two interfaces
> > to single-device group. To avoid the conflict with existing usages, we
> > keep this policy and put the new extension only when the group has been
> > marked for user_dma.
> 
> I still kind of hate this interface because it means an operation that
> appears to be explicitly on a single device has an implicit effect on
> other devices.
> 

I still didn't get your concern why it's such a big deal. With this proposal
the group restriction will be 'explicitly' documented in the attach uAPI
comment and sample flow in iommufd.rst. A sane user should read all
those information to understand how this new sub-system works and
follow whatever constraints claimed there. In the end the user should
maintain the same group knowledge regardless of whether to use an
explicit group uAPI or a device uAPI which has group constraint...

Thanks,
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-14 Thread Tian, Kevin
> From: David Gibson 
> Sent: Thursday, October 14, 2021 1:00 PM
> 
> On Wed, Oct 13, 2021 at 07:00:58AM +, Tian, Kevin wrote:
> > > From: David Gibson
> > > Sent: Friday, October 1, 2021 2:11 PM
> > >
> > > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > > This patch adds IOASID allocation/free interface per iommufd. When
> > > > allocating an IOASID, userspace is expected to specify the type and
> > > > format information for the target I/O page table.
> > > >
> > > > This RFC supports only one type
> (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > > semantics. For this type the user should specify the addr_width of
> > > > the I/O address space and whether the I/O page table is created in
> > > > an iommu enfore_snoop format. enforce_snoop must be true at this
> point,
> > > > as the false setting requires additional contract with KVM on handling
> > > > WBINVD emulation, which can be added later.
> > > >
> > > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next
> patch)
> > > > for what formats can be specified when allocating an IOASID.
> > > >
> > > > Open:
> > > > - Devices on PPC platform currently use a different iommu driver in 
> > > > vfio.
> > > >   Per previous discussion they can also use vfio type1v2 as long as 
> > > > there
> > > >   is a way to claim a specific iova range from a system-wide address
> space.
> > > >   This requirement doesn't sound PPC specific, as addr_width for pci
> > > devices
> > > >   can be also represented by a range [0, 2^addr_width-1]. This RFC
> hasn't
> > > >   adopted this design yet. We hope to have formal alignment in v1
> > > discussion
> > > >   and then decide how to incorporate it in v2.
> > >
> > > Ok, there are several things we need for ppc.  None of which are
> > > inherently ppc specific and some of which will I think be useful for
> > > most platforms.  So, starting from most general to most specific
> > > here's basically what's needed:
> > >
> > > 1. We need to represent the fact that the IOMMU can only translate
> > >*some* IOVAs, not a full 64-bit range.  You have the addr_width
> > >already, but I'm entirely sure if the translatable range on ppc
> > >(or other platforms) is always a power-of-2 size.  It usually will
> > >be, of course, but I'm not sure that's a hard requirement.  So
> > >using a size/max rather than just a number of bits might be safer.
> > >
> > >I think basically every platform will need this.  Most platforms
> > >don't actually implement full 64-bit translation in any case, but
> > >rather some smaller number of bits that fits their page table
> > >format.
> > >
> > > 2. The translatable range of IOVAs may not begin at 0.  So we need to
> > >advertise to userspace what the base address is, as well as the
> > >size.  POWER's main IOVA range begins at 2^59 (at least on the
> > >models I know about).
> > >
> > >I think a number of platforms are likely to want this, though I
> > >couldn't name them apart from POWER.  Putting the translated IOVA
> > >window at some huge address is a pretty obvious approach to making
> > >an IOMMU which can translate a wide address range without colliding
> > >with any legacy PCI addresses down low (the IOMMU can check if this
> > >transaction is for it by just looking at some high bits in the
> > >address).
> > >
> > > 3. There might be multiple translatable ranges.  So, on POWER the
> > >IOMMU can typically translate IOVAs from 0..2GiB, and also from
> > >2^59..2^59+.  The two ranges have completely separate IO
> > >page tables, with (usually) different layouts.  (The low range will
> > >nearly always be a single-level page table with 4kiB or 64kiB
> > >entries, the high one will be multiple levels depending on the size
> > >of the range and pagesize).
> > >
> > >This may be less common, but I suspect POWER won't be the only
> > >platform to do something like this.  As above, using a high range
> > >is a pretty obvious approach, but clearly won't handle older
> > >devices which can't do 64-bit DMA.  So adding a smaller range for
> > >those devices is again a pretty obvious solution.  Any platform
> > >with an "IO hole" can be treated as having two ranges, one below
> > >the hole and one above it (although in that case they may well not
> > >have separate page tables
> >
> > 1-3 are common on all platforms with fixed reserved ranges. Current
> > vfio already reports permitted iova ranges to user via VFIO_IOMMU_
> > TYPE1_INFO_CAP_IOVA_RANGE and the user is expected to construct
> > maps only in those ranges. iommufd can follow the same logic for the
> > baseline uAPI.
> >
> > For above cases a [base, max] hint can be provided by the user per
> > Jason's recommendation.
> 
> Provided at which stage?

IOMMU_IOASID_ALLOC

> 
> > It is a hint as no additional 

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-14 Thread David Gibson
On Wed, Oct 13, 2021 at 07:00:58AM +, Tian, Kevin wrote:
> > From: David Gibson
> > Sent: Friday, October 1, 2021 2:11 PM
> > 
> > On Sun, Sep 19, 2021 at 02:38:39PM +0800, Liu Yi L wrote:
> > > This patch adds IOASID allocation/free interface per iommufd. When
> > > allocating an IOASID, userspace is expected to specify the type and
> > > format information for the target I/O page table.
> > >
> > > This RFC supports only one type (IOMMU_IOASID_TYPE_KERNEL_TYPE1V2),
> > > implying a kernel-managed I/O page table with vfio type1v2 mapping
> > > semantics. For this type the user should specify the addr_width of
> > > the I/O address space and whether the I/O page table is created in
> > > an iommu enfore_snoop format. enforce_snoop must be true at this point,
> > > as the false setting requires additional contract with KVM on handling
> > > WBINVD emulation, which can be added later.
> > >
> > > Userspace is expected to call IOMMU_CHECK_EXTENSION (see next patch)
> > > for what formats can be specified when allocating an IOASID.
> > >
> > > Open:
> > > - Devices on PPC platform currently use a different iommu driver in vfio.
> > >   Per previous discussion they can also use vfio type1v2 as long as there
> > >   is a way to claim a specific iova range from a system-wide address 
> > > space.
> > >   This requirement doesn't sound PPC specific, as addr_width for pci
> > devices
> > >   can be also represented by a range [0, 2^addr_width-1]. This RFC hasn't
> > >   adopted this design yet. We hope to have formal alignment in v1
> > discussion
> > >   and then decide how to incorporate it in v2.
> > 
> > Ok, there are several things we need for ppc.  None of which are
> > inherently ppc specific and some of which will I think be useful for
> > most platforms.  So, starting from most general to most specific
> > here's basically what's needed:
> > 
> > 1. We need to represent the fact that the IOMMU can only translate
> >*some* IOVAs, not a full 64-bit range.  You have the addr_width
> >already, but I'm entirely sure if the translatable range on ppc
> >(or other platforms) is always a power-of-2 size.  It usually will
> >be, of course, but I'm not sure that's a hard requirement.  So
> >using a size/max rather than just a number of bits might be safer.
> > 
> >I think basically every platform will need this.  Most platforms
> >don't actually implement full 64-bit translation in any case, but
> >rather some smaller number of bits that fits their page table
> >format.
> > 
> > 2. The translatable range of IOVAs may not begin at 0.  So we need to
> >advertise to userspace what the base address is, as well as the
> >size.  POWER's main IOVA range begins at 2^59 (at least on the
> >models I know about).
> > 
> >I think a number of platforms are likely to want this, though I
> >couldn't name them apart from POWER.  Putting the translated IOVA
> >window at some huge address is a pretty obvious approach to making
> >an IOMMU which can translate a wide address range without colliding
> >with any legacy PCI addresses down low (the IOMMU can check if this
> >transaction is for it by just looking at some high bits in the
> >address).
> > 
> > 3. There might be multiple translatable ranges.  So, on POWER the
> >IOMMU can typically translate IOVAs from 0..2GiB, and also from
> >2^59..2^59+.  The two ranges have completely separate IO
> >page tables, with (usually) different layouts.  (The low range will
> >nearly always be a single-level page table with 4kiB or 64kiB
> >entries, the high one will be multiple levels depending on the size
> >of the range and pagesize).
> > 
> >This may be less common, but I suspect POWER won't be the only
> >platform to do something like this.  As above, using a high range
> >is a pretty obvious approach, but clearly won't handle older
> >devices which can't do 64-bit DMA.  So adding a smaller range for
> >those devices is again a pretty obvious solution.  Any platform
> >with an "IO hole" can be treated as having two ranges, one below
> >the hole and one above it (although in that case they may well not
> >have separate page tables
> 
> 1-3 are common on all platforms with fixed reserved ranges. Current
> vfio already reports permitted iova ranges to user via VFIO_IOMMU_
> TYPE1_INFO_CAP_IOVA_RANGE and the user is expected to construct
> maps only in those ranges. iommufd can follow the same logic for the
> baseline uAPI.
> 
> For above cases a [base, max] hint can be provided by the user per
> Jason's recommendation.

Provided at which stage?

> It is a hint as no additional restriction is
> imposed,

For the qemu type use case, that's not true.  In that case we
*require* the available mapping ranges to match what the guest
platform expects.

> since the kernel only cares about no violation on permitted
> ranges that it reports to the user. Underlying iommu 

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-14 Thread da...@gibson.dropbear.id.au
On Mon, Oct 11, 2021 at 02:17:48PM -0300, Jason Gunthorpe wrote:
> On Mon, Oct 11, 2021 at 04:37:38PM +1100, da...@gibson.dropbear.id.au wrote:
> > > PASID support will already require that a device can be multi-bound to
> > > many IOAS's, couldn't PPC do the same with the windows?
> > 
> > I don't see how that would make sense.  The device has no awareness of
> > multiple windows the way it does of PASIDs.  It just sends
> > transactions over the bus with the IOVAs it's told.  If those IOVAs
> > lie within one of the windows, the IOMMU picks them up and translates
> > them.  If they don't, it doesn't.
> 
> To my mind that address centric routing is awareness.

I don't really understand that position.  A PASID capable device has
to be built to be PASID capable, and will generally have registers
into which you store PASIDs to use.

Any 64-bit DMA capable device can use the POWER IOMMU just fine - it's
up to the driver to program it with addresses that will be translated
(and in Linux the driver will get those from the DMA subsystem).

> If the HW can attach multiple non-overlapping IOAS's to the same
> device then the HW is routing to the correct IOAS by using the address
> bits. This is not much different from the prior discussion we had
> where we were thinking of the PASID as an 80 bit address

Ah... that might be a workable approach.  And it even helps me get my
head around multiple attachment which I was struggling with before.

So, the rule would be that you can attach multiple IOASes to a device,
as long as none of them overlap.  The non-overlapping could be because
each IOAS covers a disjoint address range, or it could be because
there's some attached information - such as a PASID - to disambiguate.

What remains a question is where the disambiguating information comes
from in each case: does it come from properties of the IOAS,
propertues of the device, or from extra parameters supplied at attach
time.  IIUC, the current draft suggests it always comes at attach time
for the PASID information.  Obviously the more consistency we can have
here the better.


I can also see an additional problem in implementation, once we start
looking at hot-adding devices to existing address spaces.  Suppose our
software (maybe qemu) wants to set up a single DMA view for a bunch of
devices, that has such a split window.  It can set up IOASes easily
enough for the two windows, then it needs to attach them.  Presumbly,
it attaches them one at a time, which means that each device (or
group) goes through an interim state where it's attached to one, but
not the other.  That can probably be achieved by using an extra IOMMU
domain (or the local equivalent) in the hardware for that interim
state.  However it means we have to repeatedly create and destroy that
extra domain for each device after the first we add, rather than
simply adding each device to the domain which has both windows.

[I think this doesn't arise on POWER when running under PowerVM.  That
 has no concept like IOMMU domains, and instead the mapping is always
 done per "partitionable endpoint" (PE), essentially a group.  That
 means it's just a question of whether we mirror mappings on both
 windows into a given PE or just those from one IOAS.  It's not an
 unreasonable extension/combination of existing hardware quirks to
 consider, though]

> The fact the PPC HW actually has multiple page table roots and those
> roots even have different page tables layouts while still connected to
> the same device suggests this is not even an unnatural modelling
> approach...
> 
> Jason  
> 
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-14 Thread David Gibson
On Mon, Oct 11, 2021 at 03:49:14PM -0300, Jason Gunthorpe wrote:
> On Mon, Oct 11, 2021 at 05:02:01PM +1100, David Gibson wrote:
> 
> > > This means we cannot define an input that has a magic HW specific
> > > value.
> > 
> > I'm not entirely sure what you mean by that.
> 
> I mean if you make a general property 'foo' that userspace must
> specify correctly then your API isn't general anymore. Userspace must
> know if it is A or B HW to set foo=A or foo=B.

I absolutely agree.  Which is exactly why I'm advocating that
userspace should request from the kernel what it needs (providing a
*minimum* of information) and the kernel satisfies that (filling in
the missing information as suitable for the platform) or outright
fails.

I think that is more robust across multiple platforms and usecases
than advertising a bunch of capabilities and forcing userspace to
interpret those to work out what it can do.

> Supported IOVA ranges are easially like that as every IOMMU is
> different. So DPDK shouldn't provide such specific or binding
> information.

Absolutely, DPDK should not provide that.  qemu *should* provide that,
because the specific IOVAs matter to the guest.  That will inevitably
mean that the request is more likely to fail, but that's a fundamental
tradeoff.

> > No, I don't think that needs to be a condition.  I think it's
> > perfectly reasonable for a constraint to be given, and for the host
> > IOMMU to just say "no, I can't do that".  But that does mean that each
> > of these values has to have an explicit way of userspace specifying "I
> > don't care", so that the kernel will select a suitable value for those
> > instead - that's what DPDK or other userspace would use nearly all the
> > time.
> 
> My feeling is that qemu should be dealing with the host != target
> case, not the kernel.
> 
> The kernel's job should be to expose the IOMMU HW it has, with all
> features accessible, to userspace.

See... to me this is contrary to the point we agreed on above.

> Qemu's job should be to have a userspace driver for each kernel IOMMU
> and the internal infrastructure to make accelerated emulations for all
> supported target IOMMUs.

This seems the wrong way around to me.  I see qemu as providing logic
to emulate each target IOMMU.  Where that matches the host, there's
the potential for an accelerated implementation, but it makes life a
lot easier if we can at least have a fallback that will work on any
sufficiently capable host IOMMU.

> In other words, it is not the kernel's job to provide target IOMMU
> emulation.

Absolutely not.  But it *is* the kernel's job to let qemu do as mach
as it can with the *host* IOMMU.

> The kernel should provide truely generic "works everywhere" interface
> that qemu/etc can rely on to implement the least accelerated emulation
> path.

Right... seems like we're agreeing again.

> So when I see proposals to have "generic" interfaces that actually
> require very HW specific setup, and cannot be used by a generic qemu
> userpace driver, I think it breaks this model. If qemu needs to know
> it is on PPC (as it does today with VFIO's PPC specific API) then it
> may as well speak PPC specific language and forget about pretending to
> be generic.

Absolutely, the current situation is a mess.

> This approach is grounded in 15 years of trying to build these
> user/kernel split HW subsystems (particularly RDMA) where it has
> become painfully obvious that the kernel is the worst place to try and
> wrangle really divergent HW into a "common" uAPI.
> 
> This is because the kernel/user boundary is fixed. Introducing
> anything generic here requires a lot of time, thought, arguing and
> risk. Usually it ends up being done wrong (like the PPC specific
> ioctls, for instance)

Those are certainly wrong, but they came about explicitly by *not*
being generic rather than by being too generic.  So I'm really
confused aso to what you're arguing for / against.

> and when this happens we can't learn and adapt,
> we are stuck with stable uABI forever.
> 
> Exposing a device's native programming interface is much simpler. Each
> device is fixed, defined and someone can sit down and figure out how
> to expose it. Then that is it, it doesn't need revisiting, it doesn't
> need harmonizing with a future slightly different device, it just
> stays as is.

I can certainly see the case for that approach.  That seems utterly at
odds with what /dev/iommu is trying to do, though.

> The cost, is that there must be a userspace driver component for each
> HW piece - which we are already paying here!
> 
> > Ideally the host /dev/iommu will say "ok!", since both those ranges
> > are within the 0..2^60 translated range of the host IOMMU, and don't
> > touch the IO hole.  When the guest calls the IO mapping hypercalls,
> > qemu translates those into DMA_MAP operations, and since they're all
> > within the previously verified windows, they should work fine.
> 
> For instance, we are going to see HW with nested 

Re: [RFC 11/20] iommu/iommufd: Add IOMMU_IOASID_ALLOC/FREE

2021-10-14 Thread David Gibson
On Mon, Oct 11, 2021 at 09:49:57AM +0100, Jean-Philippe Brucker wrote:
> On Mon, Oct 11, 2021 at 05:02:01PM +1100, David Gibson wrote:
> > qemu wants to emulate a PAPR vIOMMU, so it says (via interfaces yet to
> > be determined) that it needs an IOAS where things can be mapped in the
> > range 0..2GiB (for the 32-bit window) and 2^59..2^59+1TiB (for the
> > 64-bit window).
> > 
> > Ideally the host /dev/iommu will say "ok!", since both those ranges
> > are within the 0..2^60 translated range of the host IOMMU, and don't
> > touch the IO hole.  When the guest calls the IO mapping hypercalls,
> > qemu translates those into DMA_MAP operations, and since they're all
> > within the previously verified windows, they should work fine.
> 
> Seems like we don't need the negotiation part?  The host kernel
> communicates available IOVA ranges to userspace including holes (patch
> 17), and userspace can check that the ranges it needs are within the IOVA
> space boundaries. That part is necessary for DPDK as well since it needs
> to know about holes in the IOVA space where DMA wouldn't work as expected
> (MSI doorbells for example). And there already is a negotiation happening,
> when the host kernel rejects MAP ioctl outside the advertised area.

The problem with the approach where the kernel advertises and
userspace selects based on that, is that it locks us into a specific
representation of what's possible.  If we get new hardware with new
weird constraints that can't be expressed with the representation we
chose, we're kind of out of stuffed.  Userspace will have to change to
accomodate the new extension and have any chance of working on the new
hardware.

With the model where userspace requests, and the kernel acks or nacks,
we can still support existing userspace if the only things it requests
can still be accomodated in the new constraints.  That's pretty likely
if the majority of userspaces request very simple things (say a single
IOVA block where it doesn't care about the base address).

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 13/20] iommu: Extend iommu_at[de]tach_device() for multiple devices group

2021-10-14 Thread David Gibson
On Sun, Sep 19, 2021 at 02:38:41PM +0800, Liu Yi L wrote:
> From: Lu Baolu 
> 
> These two helpers could be used when 1) the iommu group is singleton,
> or 2) the upper layer has put the iommu group into the secure state by
> calling iommu_device_init_user_dma().
> 
> As we want the iommufd design to be a device-centric model, we want to
> remove any group knowledge in iommufd. Given that we already have
> iommu_at[de]tach_device() interface, we could extend it for iommufd
> simply by doing below:
> 
>  - first device in a group does group attach;
>  - last device in a group does group detach.
> 
> as long as the group has been put into the secure context.
> 
> The commit <426a273834eae> ("iommu: Limit iommu_attach/detach_device to
> device with their own group") deliberately restricts the two interfaces
> to single-device group. To avoid the conflict with existing usages, we
> keep this policy and put the new extension only when the group has been
> marked for user_dma.

I still kind of hate this interface because it means an operation that
appears to be explicitly on a single device has an implicit effect on
other devices.

> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/iommu.c | 25 +
>  1 file changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index bffd84e978fb..b6178997aef1 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -47,6 +47,7 @@ struct iommu_group {
>   struct list_head entry;
>   unsigned long user_dma_owner_id;
>   refcount_t owner_cnt;
> + refcount_t attach_cnt;
>  };
>  
>  struct group_device {
> @@ -1994,7 +1995,7 @@ static int __iommu_attach_device(struct iommu_domain 
> *domain,
>  int iommu_attach_device(struct iommu_domain *domain, struct device *dev)
>  {
>   struct iommu_group *group;
> - int ret;
> + int ret = 0;
>  
>   group = iommu_group_get(dev);
>   if (!group)
> @@ -2005,11 +2006,23 @@ int iommu_attach_device(struct iommu_domain *domain, 
> struct device *dev)
>* change while we are attaching
>*/
>   mutex_lock(>mutex);
> - ret = -EINVAL;
> - if (iommu_group_device_count(group) != 1)
> + if (group->user_dma_owner_id) {
> + if (group->domain) {
> + if (group->domain != domain)
> + ret = -EBUSY;
> + else
> + refcount_inc(>attach_cnt);
> +
> + goto out_unlock;
> + }
> + } else if (iommu_group_device_count(group) != 1) {

With this condition in the else, how can you ever attach the first
device of a multi-device group?

> + ret = -EINVAL;
>   goto out_unlock;
> + }
>  
>   ret = __iommu_attach_group(domain, group);
> + if (!ret && group->user_dma_owner_id)
> + refcount_set(>attach_cnt, 1);
>  
>  out_unlock:
>   mutex_unlock(>mutex);
> @@ -2261,7 +2274,10 @@ void iommu_detach_device(struct iommu_domain *domain, 
> struct device *dev)
>   return;
>  
>   mutex_lock(>mutex);
> - if (iommu_group_device_count(group) != 1) {
> + if (group->user_dma_owner_id) {
> + if (!refcount_dec_and_test(>attach_cnt))
> + goto out_unlock;
> + } else if (iommu_group_device_count(group) != 1) {

Shouldn't this path (detach a thing that's not attached), be a no-op
regardless of whether it's a singleton group or not?  Why does one
deserve a warning and one not?

>   WARN_ON(1);
>   goto out_unlock;
>   }
> @@ -3368,6 +3384,7 @@ static int iommu_group_init_user_dma(struct iommu_group 
> *group,
>  
>   group->user_dma_owner_id = owner;
>   refcount_set(>owner_cnt, 1);
> + refcount_set(>attach_cnt, 0);
>  
>   /* default domain is unsafe for user-initiated dma */
>   if (group->domain == group->default_domain)

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu