Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-16 Thread Jason Gunthorpe
On Fri, Apr 16, 2021 at 10:23:32AM -0700, Jacob Pan wrote:

> Perhaps similar to cgroup v1 vs v2, it took a long time and with known
> limitations in v1.

cgroup v2 is still having transition problems, if anything it is a
cautionary tale to think really hard about uAPI because transitioning
can be really hard.

It might be very wise to make /dev/ioasid and /dev/vfio ioctl
compatible in some way so existing software has a smoother upgrade
path.

For instance by defining a default IOASID

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-16 Thread Jacob Pan
Hi Alex,

On Fri, 16 Apr 2021 09:45:47 -0600, Alex Williamson
 wrote:

> On Fri, 16 Apr 2021 06:12:58 -0700
> Jacob Pan  wrote:
> 
> > Hi Jason,
> > 
> > On Thu, 15 Apr 2021 20:07:32 -0300, Jason Gunthorpe 
> > wrote: 
> > > On Thu, Apr 15, 2021 at 03:11:19PM +0200, Auger Eric wrote:
> > > > Hi Jason,
> > > > 
> > > > On 4/1/21 6:03 PM, Jason Gunthorpe wrote:  
> > > > > On Thu, Apr 01, 2021 at 02:08:17PM +, Liu, Yi L wrote:
> > > > >   
> > > > >> DMA page faults are delivered to root-complex via page request
> > > > >> message and it is per-device according to PCIe spec. Page request
> > > > >> handling flow is:
> > > > >>
> > > > >> 1) iommu driver receives a page request from device
> > > > >> 2) iommu driver parses the page request message. Get the
> > > > >> RID,PASID, faulted page and requested permissions etc.
> > > > >> 3) iommu driver triggers fault handler registered by device
> > > > >> driver with iommu_report_device_fault()  
> > > > > 
> > > > > This seems confused.
> > > > > 
> > > > > The PASID should define how to handle the page fault, not the
> > > > > driver. 
> > > > 
> > > > In my series I don't use PASID at all. I am just enabling nested
> > > > stage and the guest uses a single context. I don't allocate any
> > > > user PASID at any point.
> > > > 
> > > > When there is a fault at physical level (a stage 1 fault that
> > > > concerns the guest), this latter needs to be reported and injected
> > > > into the guest. The vfio pci driver registers a fault handler to
> > > > the iommu layer and in that fault handler it fills a circ bugger
> > > > and triggers an eventfd that is listened to by the VFIO-PCI QEMU
> > > > device. this latter retrives the faault from the mmapped circ
> > > > buffer, it knowns which vIOMMU it is attached to, and passes the
> > > > fault to the vIOMMU. Then the vIOMMU triggers and IRQ in the guest.
> > > > 
> > > > We are reusing the existing concepts from VFIO, region, IRQ to do
> > > > that.
> > > > 
> > > > For that use case, would you also use /dev/ioasid?  
> > > 
> > > /dev/ioasid could do all the things you described vfio-pci as doing,
> > > it can even do them the same way you just described.
> > > 
> > > Stated another way, do you plan to duplicate all of this code someday
> > > for vfio-cxl? What about for vfio-platform? ARM SMMU can be hooked to
> > > platform devices, right?
> > > 
> > > I feel what you guys are struggling with is some choice in the iommu
> > > kernel APIs that cause the events to be delivered to the pci_device
> > > owner, not the PASID owner.
> > > 
> > > That feels solvable.
> > > 
> > Perhaps more of a philosophical question for you and Alex. There is no
> > doubt that the direction you guided for /dev/ioasid is a much cleaner
> > one, especially after VDPA emerged as another IOMMU backed framework.  
> 
> I think this statement answers all your remaining questions ;)
> 
> > The question is what do we do with the nested translation features that
> > have been targeting the existing VFIO-IOMMU for the last three years?
> > That predates VDPA. Shall we put a stop marker *after* nested support
> > and say no more extensions for VFIO-IOMMU, new features must be built
> > on this new interface?
> >
> > If we were to close a checkout line for some unforeseen reasons, should
> > we honor the customers already in line for a long time?
> > 
> > This is not a tactic or excuse for not working on the new /dev/ioasid
> > interface. In fact, I believe we can benefit from the lessons learned
> > while completing the existing. This will give confidence to the new
> > interface. Thoughts?  
> 
> I understand a big part of Jason's argument is that we shouldn't be in
> the habit of creating duplicate interfaces, we should create one, well
> designed interfaces to share among multiple subsystems.  As new users
> have emerged, our solution needs to change to a common one rather than
> a VFIO specific one.  The IOMMU uAPI provides an abstraction, but at
> the wrong level, requiring userspace interfaces for each subsystem.
> 
> Luckily the IOMMU uAPI is not really exposed as an actual uAPI, but
> that changes if we proceed to enable the interfaces to tunnel it
> through VFIO.
> 
> The logical answer would therefore be that we don't make that
> commitment to the IOMMU uAPI if we believe now that it's fundamentally
> flawed.
> 
I agree the uAPI data tunneling is definitely flawed in terms of
scalability.

I was just thinking it is still a small part of the overall
picture. Considering there are other parts such as fault reporting, user
space deployment, performance, and security. By completing the support on
the existing VFIO framework, it would at least offer a clear landscape where
the new /dev/ioasid can improve upon.

Perhaps similar to cgroup v1 vs v2, it took a long time and with known
limitations in v1.

Anyway, I am glad we have a clear direction now.

Thanks,

Jacob

> Ideally this new /dev/ioasid interface, and 

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-16 Thread Alex Williamson
On Fri, 16 Apr 2021 06:12:58 -0700
Jacob Pan  wrote:

> Hi Jason,
> 
> On Thu, 15 Apr 2021 20:07:32 -0300, Jason Gunthorpe  wrote:
> 
> > On Thu, Apr 15, 2021 at 03:11:19PM +0200, Auger Eric wrote:  
> > > Hi Jason,
> > > 
> > > On 4/1/21 6:03 PM, Jason Gunthorpe wrote:
> > > > On Thu, Apr 01, 2021 at 02:08:17PM +, Liu, Yi L wrote:
> > > > 
> > > >> DMA page faults are delivered to root-complex via page request
> > > >> message and it is per-device according to PCIe spec. Page request
> > > >> handling flow is:
> > > >>
> > > >> 1) iommu driver receives a page request from device
> > > >> 2) iommu driver parses the page request message. Get the RID,PASID,
> > > >> faulted page and requested permissions etc.
> > > >> 3) iommu driver triggers fault handler registered by device driver
> > > >> with iommu_report_device_fault()
> > > > 
> > > > This seems confused.
> > > > 
> > > > The PASID should define how to handle the page fault, not the driver.
> > > >
> > > 
> > > In my series I don't use PASID at all. I am just enabling nested stage
> > > and the guest uses a single context. I don't allocate any user PASID at
> > > any point.
> > > 
> > > When there is a fault at physical level (a stage 1 fault that concerns
> > > the guest), this latter needs to be reported and injected into the
> > > guest. The vfio pci driver registers a fault handler to the iommu layer
> > > and in that fault handler it fills a circ bugger and triggers an eventfd
> > > that is listened to by the VFIO-PCI QEMU device. this latter retrives
> > > the faault from the mmapped circ buffer, it knowns which vIOMMU it is
> > > attached to, and passes the fault to the vIOMMU.
> > > Then the vIOMMU triggers and IRQ in the guest.
> > > 
> > > We are reusing the existing concepts from VFIO, region, IRQ to do that.
> > > 
> > > For that use case, would you also use /dev/ioasid?
> > 
> > /dev/ioasid could do all the things you described vfio-pci as doing,
> > it can even do them the same way you just described.
> > 
> > Stated another way, do you plan to duplicate all of this code someday
> > for vfio-cxl? What about for vfio-platform? ARM SMMU can be hooked to
> > platform devices, right?
> > 
> > I feel what you guys are struggling with is some choice in the iommu
> > kernel APIs that cause the events to be delivered to the pci_device
> > owner, not the PASID owner.
> > 
> > That feels solvable.
> >   
> Perhaps more of a philosophical question for you and Alex. There is no
> doubt that the direction you guided for /dev/ioasid is a much cleaner one,
> especially after VDPA emerged as another IOMMU backed framework.

I think this statement answers all your remaining questions ;)

> The question is what do we do with the nested translation features that have
> been targeting the existing VFIO-IOMMU for the last three years? That
> predates VDPA. Shall we put a stop marker *after* nested support and say no
> more extensions for VFIO-IOMMU, new features must be built on this new
> interface?
>
> If we were to close a checkout line for some unforeseen reasons, should we
> honor the customers already in line for a long time?
> 
> This is not a tactic or excuse for not working on the new /dev/ioasid
> interface. In fact, I believe we can benefit from the lessons learned while
> completing the existing. This will give confidence to the new
> interface. Thoughts?

I understand a big part of Jason's argument is that we shouldn't be in
the habit of creating duplicate interfaces, we should create one, well
designed interfaces to share among multiple subsystems.  As new users
have emerged, our solution needs to change to a common one rather than
a VFIO specific one.  The IOMMU uAPI provides an abstraction, but at
the wrong level, requiring userspace interfaces for each subsystem.

Luckily the IOMMU uAPI is not really exposed as an actual uAPI, but
that changes if we proceed to enable the interfaces to tunnel it
through VFIO.

The logical answer would therefore be that we don't make that
commitment to the IOMMU uAPI if we believe now that it's fundamentally
flawed.

Ideally this new /dev/ioasid interface, and making use of it as a VFIO
IOMMU backend, should replace type1.  Type1 will live on until that
interface gets to parity, at which point we may deprecate type1, but it
wouldn't make sense to continue to expand type1 in the same direction
as we intend /dev/ioasid to take over in the meantime, especially if it
means maintaining an otherwise dead uAPI.  Thanks,

Alex



Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-16 Thread Auger Eric
Hi Jason,

On 4/16/21 4:34 PM, Jason Gunthorpe wrote:
> On Fri, Apr 16, 2021 at 04:26:19PM +0200, Auger Eric wrote:
> 
>> This was largely done during several confs including plumber, KVM forum,
>> for several years. Also API docs were shared on the ML. I don't remember
>> any voice was raised at those moments.
> 
> I don't think anyone objects to the high level ideas, but
> implementation does matter. I don't think anyone presented "hey we
> will tunnel an uAPI through VFIO to the IOMMU subsystem" - did they?

At minimum
https://events19.linuxfoundation.cn/wp-content/uploads/2017/11/Shared-Virtual-Memory-in-KVM_Yi-Liu.pdf

But most obviously everything is documented in
Documentation/userspace-api/iommu.rst where the VFIO tunneling is
clearly stated ;-)

But well let's work together to design a better and more elegant
solution then.

Thanks

Eric
> 
> Look at the fairly simple IMS situation, for example. This was
> presented at plumbers too, and the slides were great - but the
> implementation was too hacky. It required a major rework of the x86
> interrupt handling before it was OK.
> 
> Jason
> 



Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-16 Thread Jason Gunthorpe
On Fri, Apr 16, 2021 at 04:26:19PM +0200, Auger Eric wrote:

> This was largely done during several confs including plumber, KVM forum,
> for several years. Also API docs were shared on the ML. I don't remember
> any voice was raised at those moments.

I don't think anyone objects to the high level ideas, but
implementation does matter. I don't think anyone presented "hey we
will tunnel an uAPI through VFIO to the IOMMU subsystem" - did they?

Look at the fairly simple IMS situation, for example. This was
presented at plumbers too, and the slides were great - but the
implementation was too hacky. It required a major rework of the x86
interrupt handling before it was OK.

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-16 Thread Auger Eric
Hi,
On 4/16/21 4:05 PM, Jason Gunthorpe wrote:
> On Fri, Apr 16, 2021 at 03:38:02PM +0200, Auger Eric wrote:
> 
>> The redesign requirement came pretty late in the development process.
>> The iommu user API is upstream for a while, the VFIO interfaces have
>> been submitted a long time ago and under review for a bunch of time.
>> Redesigning everything with a different API, undefined at this point, is
>> a major setback for our work and will have a large impact on the
>> introduction of features companies are looking forward, hence our
>> frustration.
> 
> I will answer both you and Jacob at once.
> 
> This is uAPI, once it is set it can never be changed.
> 
> The kernel process and philosophy is to invest heavily in uAPI
> development and review to converge on the best uAPI possible.
> 
> Many past submissions have take a long time to get this right, there
> are several high profile uAPI examples.
> 
> Do you think this case is so special, or the concerns so minor, that it
> should get to bypass all of the normal process?

That's not my intent to bypass any process. I am just trying to
understand what needs to be re-designed and for what use case.
> 
> Ask yourself, is anyone advocating for the current direction on
> technical merits alone?
> 
> Certainly the patches I last saw where completely disgusting from a
> uAPI design perspective.
> 
> It was against the development process to organize this work the way
> it was done. Merging a wack of dead code to the kernel to support a
> uAPI vision that was never clearly articulated was a big mistake.
> 
> Start from the beginning. Invest heavily in defining a high quality
> uAPI. Clearly describe the uAPI to all stake holders.
This was largely done during several confs including plumber, KVM forum,
for several years. Also API docs were shared on the ML. I don't remember
any voice was raised at those moments.

 Break up the
> implementation into patch series without dead code. Make the
> patches. Remove the dead code this group has already added.
> 
> None of this should be a surprise. The VDPA discussion and related
> "what is a mdev" over a year ago made it pretty clear VFIO is not the
> exclusive user of "IOMMU in userspace" and that places limits on what
> kind of uAPIs expansion it should experience going forward.
Maybe clear for you but most probably not for many other stakeholders.

Anyway I do not intend to further argue and I will be happy to learn
from you and work with you, Jacob, Liu and all other stakeholders to
define a better integration.

Thanks

Eric
> 
> Jason
> 



Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-16 Thread Jason Gunthorpe
On Fri, Apr 16, 2021 at 03:38:02PM +0200, Auger Eric wrote:

> The redesign requirement came pretty late in the development process.
> The iommu user API is upstream for a while, the VFIO interfaces have
> been submitted a long time ago and under review for a bunch of time.
> Redesigning everything with a different API, undefined at this point, is
> a major setback for our work and will have a large impact on the
> introduction of features companies are looking forward, hence our
> frustration.

I will answer both you and Jacob at once.

This is uAPI, once it is set it can never be changed.

The kernel process and philosophy is to invest heavily in uAPI
development and review to converge on the best uAPI possible.

Many past submissions have take a long time to get this right, there
are several high profile uAPI examples.

Do you think this case is so special, or the concerns so minor, that it
should get to bypass all of the normal process?

Ask yourself, is anyone advocating for the current direction on
technical merits alone?

Certainly the patches I last saw where completely disgusting from a
uAPI design perspective.

It was against the development process to organize this work the way
it was done. Merging a wack of dead code to the kernel to support a
uAPI vision that was never clearly articulated was a big mistake.

Start from the beginning. Invest heavily in defining a high quality
uAPI. Clearly describe the uAPI to all stake holders. Break up the
implementation into patch series without dead code. Make the
patches. Remove the dead code this group has already added.

None of this should be a surprise. The VDPA discussion and related
"what is a mdev" over a year ago made it pretty clear VFIO is not the
exclusive user of "IOMMU in userspace" and that places limits on what
kind of uAPIs expansion it should experience going forward.

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-16 Thread Auger Eric
Hi Jason,

On 4/16/21 1:07 AM, Jason Gunthorpe wrote:
> On Thu, Apr 15, 2021 at 03:11:19PM +0200, Auger Eric wrote:
>> Hi Jason,
>>
>> On 4/1/21 6:03 PM, Jason Gunthorpe wrote:
>>> On Thu, Apr 01, 2021 at 02:08:17PM +, Liu, Yi L wrote:
>>>
 DMA page faults are delivered to root-complex via page request message and
 it is per-device according to PCIe spec. Page request handling flow is:

 1) iommu driver receives a page request from device
 2) iommu driver parses the page request message. Get the RID,PASID, faulted
page and requested permissions etc.
 3) iommu driver triggers fault handler registered by device driver with
iommu_report_device_fault()
>>>
>>> This seems confused.
>>>
>>> The PASID should define how to handle the page fault, not the driver.
>>
>> In my series I don't use PASID at all. I am just enabling nested stage
>> and the guest uses a single context. I don't allocate any user PASID at
>> any point.
>>
>> When there is a fault at physical level (a stage 1 fault that concerns
>> the guest), this latter needs to be reported and injected into the
>> guest. The vfio pci driver registers a fault handler to the iommu layer
>> and in that fault handler it fills a circ bugger and triggers an eventfd
>> that is listened to by the VFIO-PCI QEMU device. this latter retrives
>> the faault from the mmapped circ buffer, it knowns which vIOMMU it is
>> attached to, and passes the fault to the vIOMMU.
>> Then the vIOMMU triggers and IRQ in the guest.
>>
>> We are reusing the existing concepts from VFIO, region, IRQ to do that.
>>
>> For that use case, would you also use /dev/ioasid?
> 
> /dev/ioasid could do all the things you described vfio-pci as doing,
> it can even do them the same way you just described.
> 
> Stated another way, do you plan to duplicate all of this code someday
> for vfio-cxl? What about for vfio-platform? ARM SMMU can be hooked to
> platform devices, right?
vfio regions and IRQ related APIs are common user interfaces exposed by
all vfio drivers, including platform. Then the actual circular buffer
implementation details can be put in a common lib.

as for the thin vfio iommu wrappers, the ones you don't like, they are
implemented in type1 code.

Maybe the need for /dev/ioasid is more crying for PASID management but
for the nested use case, that's not obvious to me and in your different
replies, it was not crystal clear where the use case belongs to.

The redesign requirement came pretty late in the development process.
The iommu user API is upstream for a while, the VFIO interfaces have
been submitted a long time ago and under review for a bunch of time.
Redesigning everything with a different API, undefined at this point, is
a major setback for our work and will have a large impact on the
introduction of features companies are looking forward, hence our
frustration.

Thanks

Eric


> 
> I feel what you guys are struggling with is some choice in the iommu
> kernel APIs that cause the events to be delivered to the pci_device
> owner, not the PASID owner.
> 
> That feels solvable.
> 
> Jason
> 



Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-16 Thread Jacob Pan
Hi Jason,

On Thu, 15 Apr 2021 20:07:32 -0300, Jason Gunthorpe  wrote:

> On Thu, Apr 15, 2021 at 03:11:19PM +0200, Auger Eric wrote:
> > Hi Jason,
> > 
> > On 4/1/21 6:03 PM, Jason Gunthorpe wrote:  
> > > On Thu, Apr 01, 2021 at 02:08:17PM +, Liu, Yi L wrote:
> > >   
> > >> DMA page faults are delivered to root-complex via page request
> > >> message and it is per-device according to PCIe spec. Page request
> > >> handling flow is:
> > >>
> > >> 1) iommu driver receives a page request from device
> > >> 2) iommu driver parses the page request message. Get the RID,PASID,
> > >> faulted page and requested permissions etc.
> > >> 3) iommu driver triggers fault handler registered by device driver
> > >> with iommu_report_device_fault()  
> > > 
> > > This seems confused.
> > > 
> > > The PASID should define how to handle the page fault, not the driver.
> > >  
> > 
> > In my series I don't use PASID at all. I am just enabling nested stage
> > and the guest uses a single context. I don't allocate any user PASID at
> > any point.
> > 
> > When there is a fault at physical level (a stage 1 fault that concerns
> > the guest), this latter needs to be reported and injected into the
> > guest. The vfio pci driver registers a fault handler to the iommu layer
> > and in that fault handler it fills a circ bugger and triggers an eventfd
> > that is listened to by the VFIO-PCI QEMU device. this latter retrives
> > the faault from the mmapped circ buffer, it knowns which vIOMMU it is
> > attached to, and passes the fault to the vIOMMU.
> > Then the vIOMMU triggers and IRQ in the guest.
> > 
> > We are reusing the existing concepts from VFIO, region, IRQ to do that.
> > 
> > For that use case, would you also use /dev/ioasid?  
> 
> /dev/ioasid could do all the things you described vfio-pci as doing,
> it can even do them the same way you just described.
> 
> Stated another way, do you plan to duplicate all of this code someday
> for vfio-cxl? What about for vfio-platform? ARM SMMU can be hooked to
> platform devices, right?
> 
> I feel what you guys are struggling with is some choice in the iommu
> kernel APIs that cause the events to be delivered to the pci_device
> owner, not the PASID owner.
> 
> That feels solvable.
> 
Perhaps more of a philosophical question for you and Alex. There is no
doubt that the direction you guided for /dev/ioasid is a much cleaner one,
especially after VDPA emerged as another IOMMU backed framework.

The question is what do we do with the nested translation features that have
been targeting the existing VFIO-IOMMU for the last three years? That
predates VDPA. Shall we put a stop marker *after* nested support and say no
more extensions for VFIO-IOMMU, new features must be built on this new
interface?

If we were to close a checkout line for some unforeseen reasons, should we
honor the customers already in line for a long time?

This is not a tactic or excuse for not working on the new /dev/ioasid
interface. In fact, I believe we can benefit from the lessons learned while
completing the existing. This will give confidence to the new
interface. Thoughts?

> Jason


Thanks,

Jacob


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-15 Thread Jason Gunthorpe
On Thu, Apr 15, 2021 at 03:11:19PM +0200, Auger Eric wrote:
> Hi Jason,
> 
> On 4/1/21 6:03 PM, Jason Gunthorpe wrote:
> > On Thu, Apr 01, 2021 at 02:08:17PM +, Liu, Yi L wrote:
> > 
> >> DMA page faults are delivered to root-complex via page request message and
> >> it is per-device according to PCIe spec. Page request handling flow is:
> >>
> >> 1) iommu driver receives a page request from device
> >> 2) iommu driver parses the page request message. Get the RID,PASID, faulted
> >>page and requested permissions etc.
> >> 3) iommu driver triggers fault handler registered by device driver with
> >>iommu_report_device_fault()
> > 
> > This seems confused.
> > 
> > The PASID should define how to handle the page fault, not the driver.
> 
> In my series I don't use PASID at all. I am just enabling nested stage
> and the guest uses a single context. I don't allocate any user PASID at
> any point.
> 
> When there is a fault at physical level (a stage 1 fault that concerns
> the guest), this latter needs to be reported and injected into the
> guest. The vfio pci driver registers a fault handler to the iommu layer
> and in that fault handler it fills a circ bugger and triggers an eventfd
> that is listened to by the VFIO-PCI QEMU device. this latter retrives
> the faault from the mmapped circ buffer, it knowns which vIOMMU it is
> attached to, and passes the fault to the vIOMMU.
> Then the vIOMMU triggers and IRQ in the guest.
> 
> We are reusing the existing concepts from VFIO, region, IRQ to do that.
> 
> For that use case, would you also use /dev/ioasid?

/dev/ioasid could do all the things you described vfio-pci as doing,
it can even do them the same way you just described.

Stated another way, do you plan to duplicate all of this code someday
for vfio-cxl? What about for vfio-platform? ARM SMMU can be hooked to
platform devices, right?

I feel what you guys are struggling with is some choice in the iommu
kernel APIs that cause the events to be delivered to the pci_device
owner, not the PASID owner.

That feels solvable.

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-15 Thread Auger Eric
Hi Jason,

On 4/1/21 6:03 PM, Jason Gunthorpe wrote:
> On Thu, Apr 01, 2021 at 02:08:17PM +, Liu, Yi L wrote:
> 
>> DMA page faults are delivered to root-complex via page request message and
>> it is per-device according to PCIe spec. Page request handling flow is:
>>
>> 1) iommu driver receives a page request from device
>> 2) iommu driver parses the page request message. Get the RID,PASID, faulted
>>page and requested permissions etc.
>> 3) iommu driver triggers fault handler registered by device driver with
>>iommu_report_device_fault()
> 
> This seems confused.
> 
> The PASID should define how to handle the page fault, not the driver.

In my series I don't use PASID at all. I am just enabling nested stage
and the guest uses a single context. I don't allocate any user PASID at
any point.

When there is a fault at physical level (a stage 1 fault that concerns
the guest), this latter needs to be reported and injected into the
guest. The vfio pci driver registers a fault handler to the iommu layer
and in that fault handler it fills a circ bugger and triggers an eventfd
that is listened to by the VFIO-PCI QEMU device. this latter retrives
the faault from the mmapped circ buffer, it knowns which vIOMMU it is
attached to, and passes the fault to the vIOMMU.
Then the vIOMMU triggers and IRQ in the guest.

We are reusing the existing concepts from VFIO, region, IRQ to do that.

For that use case, would you also use /dev/ioasid?

Thanks

Eric
> 
> I don't remember any device specific actions in ATS, so what is the
> driver supposed to do?
> 
>> 4) device driver's fault handler signals an event FD to notify userspace to
>>fetch the information about the page fault. If it's VM case, inject the
>>page fault to VM and let guest to solve it.
> 
> If the PASID is set to 'report page fault to userspace' then some
> event should come out of /dev/ioasid, or be reported to a linked
> eventfd, or whatever.
> 
> If the PASID is set to 'SVM' then the fault should be passed to
> handle_mm_fault
> 
> And so on.
> 
> Userspace chooses what happens based on how they configure the PASID
> through /dev/ioasid.
> 
> Why would a device driver get involved here?
> 
>> Eric has sent below series for the page fault reporting for VM with passthru
>> device.
>> https://lore.kernel.org/kvm/20210223210625.604517-5-eric.au...@redhat.com/
> 
> It certainly should not be in vfio pci. Everything using a PASID needs
> this infrastructure, VDPA, mdev, PCI, CXL, etc.
> 
> Jason
> 



Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-08 Thread Jason Gunthorpe
On Wed, Apr 07, 2021 at 11:50:02PM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Wednesday, April 7, 2021 8:21 PM
> > 
> > On Wed, Apr 07, 2021 at 02:08:33AM +, Tian, Kevin wrote:
> > 
> > > > Because if you don't then we enter insane world where a PASID is being
> > > > created under /dev/ioasid but its translation path flows through setup
> > > > done by VFIO and the whole user API becomes an incomprehensible
> > mess.
> > > >
> > > > How will you even associate the PASID with the other translation??
> > >
> > > PASID is attached to a specific iommu domain (created by VFIO/VDPA),
> > which
> > > has GPA->HPA mappings already configured. If we view that mapping as an
> > > attribute of the iommu domain, it's reasonable to have the userspace-
> > bound
> > > pgtable through /dev/ioasid to nest on it.
> > 
> > A user controlled page table should absolutely not be an attribute of
> > a hidden kernel object, nor should two parts of the kernel silently
> > connect to each other via a hidden internal objects like this.
> > 
> > Security is important - the kind of connection must use some explicit
> > FD authorization to access shared objects, not be made implicit!
> > 
> > IMHO this direction is a dead end for this reason.
> > 
> 
> Could you elaborate what exact security problem is brought with this 
> approach? Isn't ALLOW_PASID the authorization interface for the
> connection?

If the kernel objects don't come out of FDs then no.

> Is it really the only practice in Linux that any new feature has to be
> blocked as long as a refactoring work is identified? 

The practice is to define uAPIs that make sense and have a good chance
to be supported over a long time period, as the software evolves, not
to hacky hacky a gaint uAPI mess just to get some feature out the
door. 

This proposal as it was oringial shown is exactly the kind of hacky
hacky uapi nobody wants to see. Tunneling an IOMMU uapi through a
whole bunch of other FDs is completely nutz.

Intel should basically be investing most of its time building a robust
and well designed uAPI here, and don't complain that the community is
not doing Intel's job for free.

> Don't people accept any balance between enabling new features and
> completing refactoring work through a staging approach, as long as
> we don't introduce an uAPI specifically for the staging purpose? ☹

Since this is all uapi I don't see it as applicable here.

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-08 Thread Jean-Philippe Brucker
On Wed, Apr 07, 2021 at 04:36:54PM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 07, 2021 at 08:43:50PM +0200, Jean-Philippe Brucker wrote:
> 
> > * Get a container handle out of /dev/ioasid (or /dev/iommu, really.)
> >   No operation available since we don't know what the device and IOMMU
> >   capabilities are.
> >
> > * Attach the handle to a VF. With VFIO that would be
> >   VFIO_GROUP_SET_CONTAINER. That causes the kernel to associate an IOMMU
> >   with the handle, and decide which operations are available.
> 
> Right, this is basically the point, - the VFIO container (/dev/vfio)
> and the /dev/ioasid we are talking about have a core of
> similarity. ioasid is the generalized, modernized, and cross-subsystem
> version of the same idea. Instead of calling it "vfio container" we
> call it something that evokes the idea of controlling the iommu.
> 
> The issue is to seperate /dev/vfio generic functionality from vfio and
> share it with every subsystem.
> 
> It may be that /dev/vfio and /dev/ioasid end up sharing a lot of code,
> with a different IOCTL interface around it. The vfio_iommu_driver_ops
> is not particularly VFIOy.
> 
> Creating /dev/ioasid may primarily start as a code reorganization
> exercise.
> 
> > * With a map/unmap vIOMMU (or shadow mappings), a single translation level
> >   is supported. With a nesting vIOMMU, we're populating the level-2
> >   translation (some day maybe by binding the KVM page tables, but
> >   currently with map/unmap ioctl).
> > 
> >   Single-level translation needs single VF per container. 
> 
> Really? Why?

The vIOMMU is started in bypass, so the device can do DMA to the GPA space
until the guest configures the vIOMMU, at which point each VF is either
kept in bypass or gets new DMA mappings, which requires the host to tear
down the bypass mappings and set up the guest mappings on a per-VF basis
(I'm not considering nesting translation in the host kernel for this,
because it's not supported by all pIOMMUs and is expensive in terms of TLB
and pinned memory). So keeping a single VF per container is simpler, but
there are certainly other programming models possible.

Thanks,
Jean



RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-07 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Wednesday, April 7, 2021 8:21 PM
> 
> On Wed, Apr 07, 2021 at 02:08:33AM +, Tian, Kevin wrote:
> 
> > > Because if you don't then we enter insane world where a PASID is being
> > > created under /dev/ioasid but its translation path flows through setup
> > > done by VFIO and the whole user API becomes an incomprehensible
> mess.
> > >
> > > How will you even associate the PASID with the other translation??
> >
> > PASID is attached to a specific iommu domain (created by VFIO/VDPA),
> which
> > has GPA->HPA mappings already configured. If we view that mapping as an
> > attribute of the iommu domain, it's reasonable to have the userspace-
> bound
> > pgtable through /dev/ioasid to nest on it.
> 
> A user controlled page table should absolutely not be an attribute of
> a hidden kernel object, nor should two parts of the kernel silently
> connect to each other via a hidden internal objects like this.
> 
> Security is important - the kind of connection must use some explicit
> FD authorization to access shared objects, not be made implicit!
> 
> IMHO this direction is a dead end for this reason.
> 

Could you elaborate what exact security problem is brought with this 
approach? Isn't ALLOW_PASID the authorization interface for the 
connection?

Based on all your replies now I see what you actually want is generalizing
all IOMMU related stuff through /dev/ioasid (sort of /dev/iommu), which
requires factoring out the vfio_iommu_type1 into the general part. This is
a huge work.

Is it really the only practice in Linux that any new feature has to be
blocked as long as a refactoring work is identified? Don't people accept
any balance between enabling new features and completing refactoring
work through a staging approach, as long as we don't introduce an uAPI
specifically for the staging purpose? ☹

Thanks
Kevin


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-07 Thread Jason Gunthorpe
On Wed, Apr 07, 2021 at 08:43:50PM +0200, Jean-Philippe Brucker wrote:

> * Get a container handle out of /dev/ioasid (or /dev/iommu, really.)
>   No operation available since we don't know what the device and IOMMU
>   capabilities are.
>
> * Attach the handle to a VF. With VFIO that would be
>   VFIO_GROUP_SET_CONTAINER. That causes the kernel to associate an IOMMU
>   with the handle, and decide which operations are available.

Right, this is basically the point, - the VFIO container (/dev/vfio)
and the /dev/ioasid we are talking about have a core of
similarity. ioasid is the generalized, modernized, and cross-subsystem
version of the same idea. Instead of calling it "vfio container" we
call it something that evokes the idea of controlling the iommu.

The issue is to seperate /dev/vfio generic functionality from vfio and
share it with every subsystem.

It may be that /dev/vfio and /dev/ioasid end up sharing a lot of code,
with a different IOCTL interface around it. The vfio_iommu_driver_ops
is not particularly VFIOy.

Creating /dev/ioasid may primarily start as a code reorganization
exercise.

> * With a map/unmap vIOMMU (or shadow mappings), a single translation level
>   is supported. With a nesting vIOMMU, we're populating the level-2
>   translation (some day maybe by binding the KVM page tables, but
>   currently with map/unmap ioctl).
> 
>   Single-level translation needs single VF per container. 

Really? Why?

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-07 Thread Jean-Philippe Brucker
On Wed, Apr 07, 2021 at 08:17:50AM +, Tian, Kevin wrote:
> btw this discussion was raised when discussing the I/O page fault handling
> process. Currently the IOMMU layer implements a per-device fault reporting
> mechanism, which requires VFIO to register a handler to receive all faults 
> on its device and then forwards to ioasid if it's due to 1st-level. Possibly 
> it 
> makes more sense to convert it into a per-pgtable reporting scheme, and 
> then the owner of each pgtable should register its own handler.

Maybe, but you do need device information in there, since that's how the
fault is reported to the guest and how the response is routed back to the
faulting device (only PASID+PRGI would cause aliasing). And we need to
report non-recoverable faults, as well as recoverable ones without PASID,
once we hand control of level-1 page tables to guests.

> It means
> for 1) VFIO will register a 2nd-level pgtable handler while /dev/ioasid
> will register a 1st-level pgtable handler, while for 3) /dev/ioasid will 
> register 
> handlers for both 1st-level and 2nd-level pgtable. Jean? also want to know 
> your thoughts...  

Moving all IOMMU controls to /dev/ioasid rather that splitting them is
probably better. Hopefully the implementation can reuse most of
vfio_iommu_type1.

I'm trying to sketch what may work for Arm, if we have to reuse
/dev/ioasid to avoid duplication of fault and inval queues:

* Get a container handle out of /dev/ioasid (or /dev/iommu, really.)
  No operation available since we don't know what the device and IOMMU
  capabilities are.

* Attach the handle to a VF. With VFIO that would be
  VFIO_GROUP_SET_CONTAINER. That causes the kernel to associate an IOMMU
  with the handle, and decide which operations are available.

* With a map/unmap vIOMMU (or shadow mappings), a single translation level
  is supported. With a nesting vIOMMU, we're populating the level-2
  translation (some day maybe by binding the KVM page tables, but
  currently with map/unmap ioctl).

  Single-level translation needs single VF per container. Two level would
  allow sharing stage-2 between multiple VFs, though it's a pain to define
  and implement.

* Without a vIOMMU or if the vIOMMU starts in bypass, populate the
  container page tables.

Start the guest.

* With a map/unmap vIOMMU, guest creates mappings, userspace populates the
  page tables with map/unmap ioctl.

  It would be possible to add a PASID mode there: guest requests an
  address space with a specific PASID, userspace derives an IOASID handle
  from the container handle and populate that address space with map/unmap
  ioctl. That would enable PASID on sub-VF assignment, which requires the
  host to control which PASID is programmed into the VF (with
  DEVICE_ALLOW_IOASID, I guess). And either the host allocates the PASID
  in this case (which isn't supported by a vSMMU) or we have to do a
  vPASID -> pPASID. I don't know if it's worth the effort.

Or
* With a nesting vIOMMU, the guest attaches a PASID table to a VF,
  userspace issues a SET_PASID_TABLE ioctl on the container handle. If
  we support multiple VFs per container, we first need to derive a child
  container from the main one and the device, then attach the PASID table.

  Guest programs the PASID table, sends invalidations when removing
  mappings which are relayed to the host on the child container. Page
  faults and response queue would be per container, so if multiple VF per
  container, we could have one queue for the parent (level-2 faults) and
  one for each child (level-1 faults).

Thanks,
Jean


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-07 Thread Jason Gunthorpe
On Wed, Apr 07, 2021 at 02:08:33AM +, Tian, Kevin wrote:

> > Because if you don't then we enter insane world where a PASID is being
> > created under /dev/ioasid but its translation path flows through setup
> > done by VFIO and the whole user API becomes an incomprehensible mess.
> > 
> > How will you even associate the PASID with the other translation??
> 
> PASID is attached to a specific iommu domain (created by VFIO/VDPA), which
> has GPA->HPA mappings already configured. If we view that mapping as an
> attribute of the iommu domain, it's reasonable to have the userspace-bound
> pgtable through /dev/ioasid to nest on it.

A user controlled page table should absolutely not be an attribute of
a hidden kernel object, nor should two parts of the kernel silently
connect to each other via a hidden internal objects like this.

Security is important - the kind of connection must use some explicit
FD authorization to access shared objects, not be made implicit!

IMHO this direction is a dead end for this reason.

> > The entire translation path for any ioasid or PASID should be defined
> > only by /dev/ioasid. Everything else is a legacy API.
> > 
> > > If following your suggestion then VFIO must deny VFIO MAP operations
> > > on sva1 (assume userspace should not mix sva1 and sva2 in the same
> > > container and instead use /dev/ioasid to map for sva1)?
> > 
> > No, userspace creates an iosaid for the guest physical mapping and
> > passes this ioasid to VFIO PCI which will assign it as the first layer
> > mapping on the RID
> 
> Is it an dummy ioasid just for providing GPA mappings for nesting purpose
> of other IOASIDs? Then we waste one per VM?

Generic ioasid's are "free" they are just software constructs in the
kernel.

> > When PASIDs are allocated the uAPI will be told to logically nested
> > under the first ioasid. When VFIO authorizes a PASID for a RID it
> > checks that all the HW rules are being followed.
> 
> As I explained above, why cannot we just use iommu domain to connect 
> the dots? 

Security.

> Every passthrough framework needs to create an iommu domain
> first. and It needs to support both devices w/ PASID and devices w/o
> PASID.  For devices w/o PASID it needs to invent its own MAP
> interface anyway.  

No, it should consume a ioasid from /dev/ioasid, use a common ioasid
map interface and assign that ioasid to a RID.

Don't get so fixated on PASID as a special case

> Then why do we bother creating another MAP interface through
> /dev/ioasid which not only duplicates but also creating transition
> burden between two set of MAP interfaces when the guest turns on/off
> the pasid capability on the device?

Don't transition. Always use the new interface. qemu detects the
kernel supports /dev/ioasid and *all iommu page table configuration*
goes through there. VFIO and VDPA APIs become unused for iommu
configuration.

> 'universally' upon from which angle you look at this problem. From IOASID
> p.o.v possibly yes, but from device passthrough p.o.v. it's the opposite
> since the passthrough framework needs to handle devices w/o PASID anyway
> (or even for device w/ PASID it could send traffic w/o PASID) thus 
> 'universally'
> makes more sense if the passthrough framework can use one interface of its
> own to manage GPA mappings for all consumers (apply to the case when a
> PASID is allowed/authorized).

You correctly named it /dev/ioasid, it is a generic way to allocate,
manage and assign IOMMU page tables, which when generalized, only some
of which may consume a limited PASID.

RID and RID,PASID are the same thing, just a small difference in how
they match TLPs.

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-07 Thread Jason Gunthorpe
On Wed, Apr 07, 2021 at 08:17:50AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe
> > Sent: Tuesday, April 6, 2021 8:43 PM
> > 
> > On Tue, Apr 06, 2021 at 09:35:17AM +0800, Jason Wang wrote:
> > 
> > > > VFIO and VDPA has no buisness having map/unmap interfaces once we
> > have
> > > > /dev/ioasid. That all belongs in the iosaid side.
> > > >
> > > > I know they have those interfaces today, but that doesn't mean we have
> > > > to keep using them for PASID use cases, they should be replaced with a
> > > > 'do dma from this pasid on /dev/ioasid' interface certainly not a
> > > > 'here is a pasid from /dev/ioasid, go ahead and configure it youself'
> > > > interface
> > >
> > > So it looks like the PASID was bound to SVA in this design. I think it's 
> > > not
> > > necessairly the case:
> > 
> > No, I wish people would stop talking about SVA.
> > 
> > SVA and vSVA are a very special narrow configuration of a PASID. There
> > are lots of other PASID configurations! That is the whole point, a
> > PASID is complicated, there are many configuration scenarios, they
> > need to be in one place with a very clearly defined uAPI
> > 
> 
> I feel it also makes sense to allow a subsystem to specify which 
> configurations
> are permitted when allowing a PASID on its device

huh? why?

> e.g. excluding things like
> GPA mappings that existing subsystems (VFIO/VDPA) already handle well:

They don't "handle well", they have some historical baggage that is no
longer suitable for the complexity this area has in the modern world.

Forget about the existing APIs and replace them in /dev/ioasid.

> - Share GPA mappings between multiple devices (w/ or w/o PASID) for 
> better IOTLB efficiency;
>
> - Share GPA mappings between transactions w/ PASID and transactions w/o
> PASID from the same device (e.g. GPU) for better IOTLB efficiency;
> 
> - Use the same page table for GPA mappings before and after the guest 
> turns on/off the PASID capability;

All of these are cases you need to design the /dev/ioasid to handle.

It is pretty clear to me that you'll need non-PASID IOASID's as
well.

Ideally a generic IOASID would just be a page table and it doesn't
crystalize into a RID or RID,PASID routing until devices are attached
to it.

Since IOASID can be nested the only thing that makes any sense is for
each level of the nest to be visible under /dev/ioasid. 

What a complete mess it would be if vfio-pci owns the GPA table,
/dev/ioasid has a nested PASID, and vfio-mdev is running a mdev on top
of that PASID.

> All above are given as long as we continue to let VFIO/VDPA manage the
> iommu domain and associated GPA mappings for PASID.

So don't do that. Don't I keep saying this weird split is making a
horrible mess?

You can't reasonably build the complex PASID scenarios you talk about
above unless the entire translation path is owned by one entity:
/dev/ioasid.

You need to focus on figuring out what that looks like then figure out
how to move VDPA and VFIO to consume /dev/ioasid for all of their
translation instead of open-coding half-baked internal versions.

Jason


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-07 Thread Tian, Kevin
> From: Jason Gunthorpe
> Sent: Tuesday, April 6, 2021 8:43 PM
> 
> On Tue, Apr 06, 2021 at 09:35:17AM +0800, Jason Wang wrote:
> 
> > > VFIO and VDPA has no buisness having map/unmap interfaces once we
> have
> > > /dev/ioasid. That all belongs in the iosaid side.
> > >
> > > I know they have those interfaces today, but that doesn't mean we have
> > > to keep using them for PASID use cases, they should be replaced with a
> > > 'do dma from this pasid on /dev/ioasid' interface certainly not a
> > > 'here is a pasid from /dev/ioasid, go ahead and configure it youself'
> > > interface
> >
> > So it looks like the PASID was bound to SVA in this design. I think it's not
> > necessairly the case:
> 
> No, I wish people would stop talking about SVA.
> 
> SVA and vSVA are a very special narrow configuration of a PASID. There
> are lots of other PASID configurations! That is the whole point, a
> PASID is complicated, there are many configuration scenarios, they
> need to be in one place with a very clearly defined uAPI
> 

I feel it also makes sense to allow a subsystem to specify which configurations
are permitted when allowing a PASID on its device, e.g. excluding things like
GPA mappings that existing subsystems (VFIO/VDPA) already handle well:

- Share GPA mappings between multiple devices (w/ or w/o PASID) for 
better IOTLB efficiency;

- Share GPA mappings between transactions w/ PASID and transactions w/o
PASID from the same device (e.g. GPU) for better IOTLB efficiency;

- Use the same page table for GPA mappings before and after the guest 
turns on/off the PASID capability;

All above are given as long as we continue to let VFIO/VDPA manage the
iommu domain and associated GPA mappings for PASID. The IOMMU driver 
already ensures a nested PASID entry linking to the established GPA paging 
structure of the domain when the 1st-level pgtable is bound through 
/dev/ioasid. 

In contrast, above merits are lost if forcing a model where GPA mappings
for PASID must be constructed through /dev/ioasid, as this will lead to
multiple paging structures for the same GPA mappings implying worse 
IOTLB usage and unnecessary cost of invalidations.

Therefore, I envision a scheme where the subsystem could specify 
permitted PASID configurations when doing ALLOW_PASID, and then 
userspace queries per-PASID capability to learn which operations
are allowed, e.g.:

1) To enable vSVA, VFIO/VDPA allows pgtable binding and related invalidation/
fault ops through /dev/ioasid;

2) for vDPA control vq usage, no configuration is allowed through /dev/ioasid;

3) for new subsystem which doesn't carry any legacy or similar usage as 
VFIO/VDPA, it could permit all configurations through /dev/ioasid including 
1st-level binding and 2nd-level mapping ops;

This approach also allows us to grow the uAPI in a staging approach. Now 
focus on 1) and 2) as VFIO/VDPA are the only two users for now with good
legacy to cover the GPA mappings. More ops can be introduced for 3) when 
there is a real example to show what exact ops are required for such a new 
subsystem.

Is this a good strategy to move forward?

btw this discussion was raised when discussing the I/O page fault handling
process. Currently the IOMMU layer implements a per-device fault reporting
mechanism, which requires VFIO to register a handler to receive all faults 
on its device and then forwards to ioasid if it's due to 1st-level. Possibly it 
makes more sense to convert it into a per-pgtable reporting scheme, and 
then the owner of each pgtable should register its own handler. It means
for 1) VFIO will register a 2nd-level pgtable handler while /dev/ioasid
will register a 1st-level pgtable handler, while for 3) /dev/ioasid will 
register 
handlers for both 1st-level and 2nd-level pgtable. Jean? also want to know 
your thoughts...  

Thanks
Kevin


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-06 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, April 6, 2021 8:21 PM
> 
> On Tue, Apr 06, 2021 at 01:02:05AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Tuesday, April 6, 2021 7:40 AM
> > >
> > > On Fri, Apr 02, 2021 at 07:58:02AM +, Tian, Kevin wrote:
> > > > > From: Jason Gunthorpe 
> > > > > Sent: Thursday, April 1, 2021 9:47 PM
> > > > >
> > > > > On Thu, Apr 01, 2021 at 01:43:36PM +, Liu, Yi L wrote:
> > > > > > > From: Jason Gunthorpe 
> > > > > > > Sent: Thursday, April 1, 2021 9:16 PM
> > > > > > >
> > > > > > > On Thu, Apr 01, 2021 at 01:10:48PM +, Liu, Yi L wrote:
> > > > > > > > > From: Jason Gunthorpe 
> > > > > > > > > Sent: Thursday, April 1, 2021 7:47 PM
> > > > > > > > [...]
> > > > > > > > > I'm worried Intel views the only use of PASID in a guest is 
> > > > > > > > > with
> > > > > > > > > ENQCMD, but that is not consistent with the industry. We need
> to
> > > see
> > > > > > > > > normal nested PASID support with assigned PCI VFs.
> > > > > > > >
> > > > > > > > I'm not quire flow here. Intel also allows PASID usage in guest
> > > without
> > > > > > > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it
> without
> > > > > > > ENQCMD.
> > > > > > >
> > > > > > > Then you need all the parts, the hypervisor calls from the vIOMMU,
> > > and
> > > > > > > you can't really use a vPASID.
> > > > > >
> > > > > > This is a diagram shows the vSVA setup.
> > > > >
> > > > > I'm not talking only about vSVA. Generic PASID support with arbitary
> > > > > mappings.
> > > > >
> > > > > And how do you deal with the vPASID vs pPASID issue if the system
> has
> > > > > a mix of physical devices and mdevs?
> > > > >
> > > >
> > > > We plan to support two schemes. One is vPASID identity-mapped to
> > > > pPASID then the mixed scenario just works, with the limitation of
> > > > lacking of live migration support. The other is non-identity-mapped
> > > > scheme, where live migration is supported but physical devices and
> > > > mdevs should not be mixed in one VM if both expose SVA capability
> > > > (requires some filtering check in Qemu).
> > >
> > > That just becomes "block vPASID support if any device that
> > > doesn't use ENQCMD is plugged into the guest"
> >
> > The limitation is only for physical device. and in reality it is not that
> > bad. To support live migration with physical device we anyway need
> > additional work to migrate the device state (e.g. based on Max's work),
> > then it's not unreasonable to also mediate guest programming of
> > device specific PASID register to enable vPASID (need to translate in
> > the whole VM lifespan but likely is not a hot path).
> 
> IMHO that is pretty unreasonable.. More likely we end up with vPASID
> tables in each migratable device like KVM has.

just like mdev needs to maintain allowed PASID list, this extends it to
all migratable devices.

> 
> > > Which needs a special VFIO capability of some kind so qemu knows to
> > > block it. This really needs to all be layed out together so someone
> > > can understand it :(
> >
> > Or could simply based on whether the VFIO device supports live migration.
> 
> You need to define affirmative caps that indicate that vPASID will be
> supported by the VFIO device.

Yes, this is required as I acked in another mail.

> 
> > > Why doesn't the siov cookbook explaining this stuff??
> > >
> > > > We hope the /dev/ioasid can support both schemes, with the minimal
> > > > requirement of allowing userspace to tag a vPASID to a pPASID and
> > > > allowing mdev to translate vPASID into pPASID, i.e. not assuming that
> > > > the guest will always use pPASID.
> > >
> > > What I'm a unclear of is if /dev/ioasid even needs to care about
> > > vPASID or if vPASID is just a hidden artifact of the KVM connection to
> > > setup the translation table and the vIOMMU driver in qemu.
> >
> > Not just for KVM. Also required by mdev, which needs to translate
> > vPASID into pPASID when ENQCMD is not used.
> 
> Do we have any mdev's that will do this?

definitely. Actually any mdev which doesn't do ENQCMD needs to do this.
In normal case, the PASID is programmed to a MMIO register (or in-memory
context) associate with the backend resource of the mdev. The value 
programmed from the guest is vPASID, thus must be translated into pPASID
before updating the physical register.

> 
> > should only care about the operations related to pPASID. VFIO could
> > carry vPASID information to mdev.
> 
> It depends how common this is, I suppose
> 

based on above I think it's a common case.

Thanks
Kevin


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-06 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, April 6, 2021 8:35 PM
> 
> On Tue, Apr 06, 2021 at 01:27:15AM +, Tian, Kevin wrote:
> >
> > and here is one example why using existing VFIO/VDPA interface makes
> > sense. say dev1 (w/ sva) and dev2 (w/o sva) are placed in a single VFIO
> > container.
> 
> Forget about SVA, it is an irrelevant detail of how a PASID is
> configured.
> 
> > The container is associated to an iommu domain which contains a
> > single 2nd-level page table, shared by both devices (when attached
> > to the domain).
> 
> This level should be described by an ioasid.
> 
> > The VFIO MAP operation is applied to the 2nd-level
> > page table thus naturally applied to both devices. Then userspace
> > could use /dev/ioasid to further allocate IOASIDs and bind multiple
> > 1st-level page tables for dev1, nested on the shared 2nd-level page
> > table.
> 
> Because if you don't then we enter insane world where a PASID is being
> created under /dev/ioasid but its translation path flows through setup
> done by VFIO and the whole user API becomes an incomprehensible mess.
> 
> How will you even associate the PASID with the other translation??

PASID is attached to a specific iommu domain (created by VFIO/VDPA), which
has GPA->HPA mappings already configured. If we view that mapping as an
attribute of the iommu domain, it's reasonable to have the userspace-bound
pgtable through /dev/ioasid to nest on it.


> 
> The entire translation path for any ioasid or PASID should be defined
> only by /dev/ioasid. Everything else is a legacy API.
> 
> > If following your suggestion then VFIO must deny VFIO MAP operations
> > on sva1 (assume userspace should not mix sva1 and sva2 in the same
> > container and instead use /dev/ioasid to map for sva1)?
> 
> No, userspace creates an iosaid for the guest physical mapping and
> passes this ioasid to VFIO PCI which will assign it as the first layer
> mapping on the RID

Is it an dummy ioasid just for providing GPA mappings for nesting purpose
of other IOASIDs? Then we waste one per VM?

> 
> When PASIDs are allocated the uAPI will be told to logically nested
> under the first ioasid. When VFIO authorizes a PASID for a RID it
> checks that all the HW rules are being followed.

As I explained above, why cannot we just use iommu domain to connect 
the dots? Every passthrough framework needs to create an iommu domain
first. and It needs to support both devices w/ PASID and devices w/o PASID.
For devices w/o PASID it needs to invent its own MAP interface anyway.
Then why do we bother creating another MAP interface through /dev/ioasid
which not only duplicates but also creating transition burden between 
two set of MAP interfaces when the guest turns on/off the pasid capability
on the device?

> 
> If there are rules like groups of VFIO devices must always use the
> same IOASID then VFIO will check these too (and realistically qemu
> will have only one guest physical map ioasid anyhow)
> 
> There is no real difference between setting up an IOMMU table for a
> (RID,PASID) tuple or just a RID. We can do it universally with
> one interface for all consumers.
> 

'universally' upon from which angle you look at this problem. From IOASID
p.o.v possibly yes, but from device passthrough p.o.v. it's the opposite
since the passthrough framework needs to handle devices w/o PASID anyway
(or even for device w/ PASID it could send traffic w/o PASID) thus 'universally'
makes more sense if the passthrough framework can use one interface of its
own to manage GPA mappings for all consumers (apply to the case when a
PASID is allowed/authorized).

Thanks
Kevin


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-06 Thread Jason Wang



在 2021/4/6 下午8:42, Jason Gunthorpe 写道:

On Tue, Apr 06, 2021 at 09:35:17AM +0800, Jason Wang wrote:


VFIO and VDPA has no buisness having map/unmap interfaces once we have
/dev/ioasid. That all belongs in the iosaid side.

I know they have those interfaces today, but that doesn't mean we have
to keep using them for PASID use cases, they should be replaced with a
'do dma from this pasid on /dev/ioasid' interface certainly not a
'here is a pasid from /dev/ioasid, go ahead and configure it youself'
interface
  
So it looks like the PASID was bound to SVA in this design. I think it's not

necessairly the case:

No, I wish people would stop talking about SVA.

SVA and vSVA are a very special narrow configuration of a PASID. There
are lots of other PASID configurations! That is the whole point, a
PASID is complicated, there are many configuration scenarios, they
need to be in one place with a very clearly defined uAPI



Right, that's my understanding as well.





1) PASID can be implemented without SVA, in this case a map/unmap interface
is still required

Any interface to manipulate a PASID should be under /dev/ioasid. We do
not want to duplicate this into every subsystem.



Yes.





2) For the case that hypervisor want to do some mediation in the middle for
a virtqueue. e.g in the case of control vq that is implemented in the
VF/ADI/SF itself, the hardware virtqueue needs to be controlled by Qemu,
Though binding qemu's page table to cvq can work but it looks like a
overkill, a small dedicated buffers that is mapped for this PASID seems more
suitalbe.

/dev/ioasid should allow userspace to setup any PASID configuration it
wants. There are many choices. That is the whole point, instead of
copying all the PASID configuration option into every
subsystem we have on place to configure it.

If you want a PASID (or generic ioasid) that has the guest physical
map, which is probably all that VDPA would ever want, then /dev/ioasid
should be able to prepare that.

If you just want to map a few buffers into a PASID then it should be
able to do that too.


So do you mean the device should not expose the PASID confiugration API to
guest? I think it could happen if we assign the whole device and let guest
to configure it for nested VMs.

This always needs co-operating with the vIOMMU driver. We can't have
nested PASID use without both parts working together.

The vIOMMU driver configures the PASID and assigns the mappings
(however complicated that turns out to be)

The VDPA/mdev driver authorizes the HW to use the ioasid mapping, eg
by authorizing a queue to issue PCIe TLPs with a specific PASID.

The authorization is triggered by the guest telling the vIOMMU to
allow a vRID to talk to a PASID, which qemu would have to translate to
telling something like the VDPA driver under the vRID that it can use
a PASID from /dev/ioasid

For security a VDPA/mdev device MUST NOT issue PASIDs that the vIOMMU
has not authorized its vRID to use. Otherwise the security model of
something like VFIO in the guest becomes completely broken.



Yes, that's how it should work.

Thanks




Jason





Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-06 Thread Jason Gunthorpe
On Tue, Apr 06, 2021 at 09:35:17AM +0800, Jason Wang wrote:

> > VFIO and VDPA has no buisness having map/unmap interfaces once we have
> > /dev/ioasid. That all belongs in the iosaid side.
> > 
> > I know they have those interfaces today, but that doesn't mean we have
> > to keep using them for PASID use cases, they should be replaced with a
> > 'do dma from this pasid on /dev/ioasid' interface certainly not a
> > 'here is a pasid from /dev/ioasid, go ahead and configure it youself'
> > interface
>  
> So it looks like the PASID was bound to SVA in this design. I think it's not
> necessairly the case:

No, I wish people would stop talking about SVA.

SVA and vSVA are a very special narrow configuration of a PASID. There
are lots of other PASID configurations! That is the whole point, a
PASID is complicated, there are many configuration scenarios, they
need to be in one place with a very clearly defined uAPI

> 1) PASID can be implemented without SVA, in this case a map/unmap interface
> is still required

Any interface to manipulate a PASID should be under /dev/ioasid. We do
not want to duplicate this into every subsystem.

> 2) For the case that hypervisor want to do some mediation in the middle for
> a virtqueue. e.g in the case of control vq that is implemented in the
> VF/ADI/SF itself, the hardware virtqueue needs to be controlled by Qemu,
> Though binding qemu's page table to cvq can work but it looks like a
> overkill, a small dedicated buffers that is mapped for this PASID seems more
> suitalbe.

/dev/ioasid should allow userspace to setup any PASID configuration it
wants. There are many choices. That is the whole point, instead of
copying all the PASID configuration option into every
subsystem we have on place to configure it.

If you want a PASID (or generic ioasid) that has the guest physical
map, which is probably all that VDPA would ever want, then /dev/ioasid
should be able to prepare that.

If you just want to map a few buffers into a PASID then it should be
able to do that too.

> So do you mean the device should not expose the PASID confiugration API to
> guest? I think it could happen if we assign the whole device and let guest
> to configure it for nested VMs.

This always needs co-operating with the vIOMMU driver. We can't have
nested PASID use without both parts working together.

The vIOMMU driver configures the PASID and assigns the mappings
(however complicated that turns out to be)

The VDPA/mdev driver authorizes the HW to use the ioasid mapping, eg
by authorizing a queue to issue PCIe TLPs with a specific PASID.

The authorization is triggered by the guest telling the vIOMMU to
allow a vRID to talk to a PASID, which qemu would have to translate to
telling something like the VDPA driver under the vRID that it can use
a PASID from /dev/ioasid

For security a VDPA/mdev device MUST NOT issue PASIDs that the vIOMMU
has not authorized its vRID to use. Otherwise the security model of
something like VFIO in the guest becomes completely broken.

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-06 Thread Jason Gunthorpe
On Tue, Apr 06, 2021 at 01:27:15AM +, Tian, Kevin wrote:
> 
> and here is one example why using existing VFIO/VDPA interface makes
> sense. say dev1 (w/ sva) and dev2 (w/o sva) are placed in a single VFIO 
> container. 

Forget about SVA, it is an irrelevant detail of how a PASID is
configured.

> The container is associated to an iommu domain which contains a
> single 2nd-level page table, shared by both devices (when attached
> to the domain).

This level should be described by an ioasid.

> The VFIO MAP operation is applied to the 2nd-level
> page table thus naturally applied to both devices. Then userspace
> could use /dev/ioasid to further allocate IOASIDs and bind multiple
> 1st-level page tables for dev1, nested on the shared 2nd-level page
> table.

Because if you don't then we enter insane world where a PASID is being
created under /dev/ioasid but its translation path flows through setup
done by VFIO and the whole user API becomes an incomprehensible mess.

How will you even associate the PASID with the other translation??

The entire translation path for any ioasid or PASID should be defined
only by /dev/ioasid. Everything else is a legacy API.

> If following your suggestion then VFIO must deny VFIO MAP operations
> on sva1 (assume userspace should not mix sva1 and sva2 in the same
> container and instead use /dev/ioasid to map for sva1)? 

No, userspace creates an iosaid for the guest physical mapping and
passes this ioasid to VFIO PCI which will assign it as the first layer
mapping on the RID

When PASIDs are allocated the uAPI will be told to logically nested
under the first ioasid. When VFIO authorizes a PASID for a RID it
checks that all the HW rules are being followed.

If there are rules like groups of VFIO devices must always use the
same IOASID then VFIO will check these too (and realistically qemu
will have only one guest physical map ioasid anyhow)

There is no real difference between setting up an IOMMU table for a
(RID,PASID) tuple or just a RID. We can do it universally with
one interface for all consumers.

I wanted this when we were doing VDPA for the first time, now that we
are doing pasid and more difficult stuff I view it as essential.

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-06 Thread Jason Gunthorpe
On Tue, Apr 06, 2021 at 01:02:05AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Tuesday, April 6, 2021 7:40 AM
> > 
> > On Fri, Apr 02, 2021 at 07:58:02AM +, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe 
> > > > Sent: Thursday, April 1, 2021 9:47 PM
> > > >
> > > > On Thu, Apr 01, 2021 at 01:43:36PM +, Liu, Yi L wrote:
> > > > > > From: Jason Gunthorpe 
> > > > > > Sent: Thursday, April 1, 2021 9:16 PM
> > > > > >
> > > > > > On Thu, Apr 01, 2021 at 01:10:48PM +, Liu, Yi L wrote:
> > > > > > > > From: Jason Gunthorpe 
> > > > > > > > Sent: Thursday, April 1, 2021 7:47 PM
> > > > > > > [...]
> > > > > > > > I'm worried Intel views the only use of PASID in a guest is with
> > > > > > > > ENQCMD, but that is not consistent with the industry. We need to
> > see
> > > > > > > > normal nested PASID support with assigned PCI VFs.
> > > > > > >
> > > > > > > I'm not quire flow here. Intel also allows PASID usage in guest
> > without
> > > > > > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without
> > > > > > ENQCMD.
> > > > > >
> > > > > > Then you need all the parts, the hypervisor calls from the vIOMMU,
> > and
> > > > > > you can't really use a vPASID.
> > > > >
> > > > > This is a diagram shows the vSVA setup.
> > > >
> > > > I'm not talking only about vSVA. Generic PASID support with arbitary
> > > > mappings.
> > > >
> > > > And how do you deal with the vPASID vs pPASID issue if the system has
> > > > a mix of physical devices and mdevs?
> > > >
> > >
> > > We plan to support two schemes. One is vPASID identity-mapped to
> > > pPASID then the mixed scenario just works, with the limitation of
> > > lacking of live migration support. The other is non-identity-mapped
> > > scheme, where live migration is supported but physical devices and
> > > mdevs should not be mixed in one VM if both expose SVA capability
> > > (requires some filtering check in Qemu).
> > 
> > That just becomes "block vPASID support if any device that
> > doesn't use ENQCMD is plugged into the guest"
> 
> The limitation is only for physical device. and in reality it is not that
> bad. To support live migration with physical device we anyway need 
> additional work to migrate the device state (e.g. based on Max's work), 
> then it's not unreasonable to also mediate guest programming of 
> device specific PASID register to enable vPASID (need to translate in
> the whole VM lifespan but likely is not a hot path).

IMHO that is pretty unreasonable.. More likely we end up with vPASID
tables in each migratable device like KVM has.

> > Which needs a special VFIO capability of some kind so qemu knows to
> > block it. This really needs to all be layed out together so someone
> > can understand it :(
> 
> Or could simply based on whether the VFIO device supports live migration.

You need to define affirmative caps that indicate that vPASID will be
supported by the VFIO device.

> > Why doesn't the siov cookbook explaining this stuff??
> > 
> > > We hope the /dev/ioasid can support both schemes, with the minimal
> > > requirement of allowing userspace to tag a vPASID to a pPASID and
> > > allowing mdev to translate vPASID into pPASID, i.e. not assuming that
> > > the guest will always use pPASID.
> > 
> > What I'm a unclear of is if /dev/ioasid even needs to care about
> > vPASID or if vPASID is just a hidden artifact of the KVM connection to
> > setup the translation table and the vIOMMU driver in qemu.
> 
> Not just for KVM. Also required by mdev, which needs to translate
> vPASID into pPASID when ENQCMD is not used.

Do we have any mdev's that will do this?

> should only care about the operations related to pPASID. VFIO could
> carry vPASID information to mdev.

It depends how common this is, I suppose

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-06 Thread Jason Gunthorpe
On Tue, Apr 06, 2021 at 12:37:35AM +, Tian, Kevin wrote:

> With nested translation it is GVA->GPA->HPA. The kernel needs to
> fix fault related to GPA->HPA (managed by VFIO/VDPA) while 
> handle_mm_fault only handles HVA->HPA. In this case, the 2nd-level
> page fault is expected to be delivered to VFIO/VDPA first which then
> find HVA related to GPA, call handle_mm_fault to fix HVA->HPA,
> and then call iommu_map to fix GPA->HPA in the IOMMU page table.
> This is exactly like how CPU EPT violation is handled.

No, it should all be in the /dev/ioasid layer not duplicated into
every user.

> > If the fault needs to be fixed in the guest, then it needs to be
> > delivered over /dev/ioasid in some way and injected into the
> > vIOMMU. VFIO and VDPA have nothing to do with vIOMMU driver in quemu.
> > 
> > You need to have an interface under /dev/ioasid to create both page
> > table levels and part of that will be to tell the kernel what VA is
> > mapped and how to handle faults.
> 
> VFIO/VDPA already have their own interface to manage GPA->HPA
> mappings. Why do we want to duplicate it in /dev/ioasid? 

They have their own interface to manage other types of HW, we should
not duplicate PASID programming into there too.

Jason


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-05 Thread Tian, Kevin
> From: Tian, Kevin
> Sent: Tuesday, April 6, 2021 9:02 AM
> 
> > From: Jason Gunthorpe 
> > Sent: Tuesday, April 6, 2021 7:40 AM
> >
> > On Fri, Apr 02, 2021 at 07:58:02AM +, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe 
> > > > Sent: Thursday, April 1, 2021 9:47 PM
> > > >
> > > > On Thu, Apr 01, 2021 at 01:43:36PM +, Liu, Yi L wrote:
> > > > > > From: Jason Gunthorpe 
> > > > > > Sent: Thursday, April 1, 2021 9:16 PM
> > > > > >
> > > > > > On Thu, Apr 01, 2021 at 01:10:48PM +, Liu, Yi L wrote:
> > > > > > > > From: Jason Gunthorpe 
> > > > > > > > Sent: Thursday, April 1, 2021 7:47 PM
> > > > > > > [...]
> > > > > > > > I'm worried Intel views the only use of PASID in a guest is with
> > > > > > > > ENQCMD, but that is not consistent with the industry. We need
> to
> > see
> > > > > > > > normal nested PASID support with assigned PCI VFs.
> > > > > > >
> > > > > > > I'm not quire flow here. Intel also allows PASID usage in guest
> > without
> > > > > > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without
> > > > > > ENQCMD.
> > > > > >
> > > > > > Then you need all the parts, the hypervisor calls from the vIOMMU,
> > and
> > > > > > you can't really use a vPASID.
> > > > >
> > > > > This is a diagram shows the vSVA setup.
> > > >
> > > > I'm not talking only about vSVA. Generic PASID support with arbitary
> > > > mappings.
> > > >
> > > > And how do you deal with the vPASID vs pPASID issue if the system has
> > > > a mix of physical devices and mdevs?
> > > >
> > >
> > > We plan to support two schemes. One is vPASID identity-mapped to
> > > pPASID then the mixed scenario just works, with the limitation of
> > > lacking of live migration support. The other is non-identity-mapped
> > > scheme, where live migration is supported but physical devices and
> > > mdevs should not be mixed in one VM if both expose SVA capability
> > > (requires some filtering check in Qemu).
> >
> > That just becomes "block vPASID support if any device that
> > doesn't use ENQCMD is plugged into the guest"
> 
> The limitation is only for physical device. and in reality it is not that
> bad. To support live migration with physical device we anyway need
> additional work to migrate the device state (e.g. based on Max's work),
> then it's not unreasonable to also mediate guest programming of
> device specific PASID register to enable vPASID (need to translate in
> the whole VM lifespan but likely is not a hot path).
> 
> >
> > Which needs a special VFIO capability of some kind so qemu knows to
> > block it. This really needs to all be layed out together so someone
> > can understand it :(
> 
> Or could simply based on whether the VFIO device supports live migration.

Actually you are right on this point. VFIO should provide a per-device
capability to indicate whether vPASID is allowed on this device. likely 
yes for mdev, by default no for pdev (unless explicitly opt in). Qemu
should enable vPASID only if all assigned devices support it, and then 
provide vPASID information when using VFIO API to allow pPASIDs.

> 
> >
> > Why doesn't the siov cookbook explaining this stuff??
> >
> > > We hope the /dev/ioasid can support both schemes, with the minimal
> > > requirement of allowing userspace to tag a vPASID to a pPASID and
> > > allowing mdev to translate vPASID into pPASID, i.e. not assuming that
> > > the guest will always use pPASID.
> >
> > What I'm a unclear of is if /dev/ioasid even needs to care about
> > vPASID or if vPASID is just a hidden artifact of the KVM connection to
> > setup the translation table and the vIOMMU driver in qemu.
> 
> Not just for KVM. Also required by mdev, which needs to translate
> vPASID into pPASID when ENQCMD is not used. As I replied in another
> mail, possibly we don't need /dev/ioasid to know this fact, which
> should only care about the operations related to pPASID. VFIO could
> carry vPASID information to mdev. KVM should have its own interface
> to know this information, as you suggested earlier.
> 
> Thanks
> Kevin


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-05 Thread Jason Wang



在 2021/4/6 上午7:42, Jason Gunthorpe 写道:

On Fri, Apr 02, 2021 at 08:22:28AM +, Tian, Kevin wrote:

From: Jason Gunthorpe 
Sent: Tuesday, March 30, 2021 9:29 PM


First, userspace may use ioasid in a non-SVA scenario where ioasid is
bound to specific security context (e.g. a control vq in vDPA) instead of
tying to mm. In this case there is no pgtable binding initiated from user
space. Instead, ioasid is allocated from /dev/ioasid and then programmed
to the intended security context through specific passthrough framework
which manages that context.

This sounds like the exact opposite of what I'd like to see.

I do not want to see every subsystem gaining APIs to program a
PASID. All of that should be consolidated in *one place*.

I do not want to see VDPA and VFIO have two nearly identical sets of
APIs to control the PASID.

Drivers consuming a PASID, like VDPA, should consume the PASID and do
nothing more than authorize the HW to use it.

quemu should have general code under the viommu driver that drives
/dev/ioasid to create PASID's and manage the IO mapping according to
the guest's needs.

Drivers like VDPA and VFIO should simply accept that PASID and
configure/authorize their HW to do DMA's with its tag.


I agree with you on consolidating things in one place (especially for the
general SVA support). But here I was referring to an usage without
pgtable binding (Possibly Jason. W can say more here), where the
userspace just wants to allocate PASIDs, program/accept PASIDs to
various workqueues (device specific), and then use MAP/UNMAP
interface to manage address spaces associated with each PASID.
I just wanted to point out that the latter two steps are through
VFIO/VDPA specific interfaces.

No, don't do that.

VFIO and VDPA has no buisness having map/unmap interfaces once we have
/dev/ioasid. That all belongs in the iosaid side.

I know they have those interfaces today, but that doesn't mean we have
to keep using them for PASID use cases, they should be replaced with a
'do dma from this pasid on /dev/ioasid' interface certainly not a
'here is a pasid from /dev/ioasid, go ahead and configure it youself'
interface



So it looks like the PASID was bound to SVA in this design. I think it's 
not necessairly the case:


1) PASID can be implemented without SVA, in this case a map/unmap 
interface is still required
2) For the case that hypervisor want to do some mediation in the middle 
for a virtqueue. e.g in the case of control vq that is implemented in 
the VF/ADI/SF itself, the hardware virtqueue needs to be controlled by 
Qemu, Though binding qemu's page table to cvq can work but it looks like 
a overkill, a small dedicated buffers that is mapped for this PASID 
seems more suitalbe.





This is because PASID is *complicated* in the general case! For
instance all the two level stuff you are talking about must not leak
into every user!

Jason



So do you mean the device should not expose the PASID confiugration API 
to guest? I think it could happen if we assign the whole device and let 
guest to configure it for nested VMs.


Thanks








RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-05 Thread Tian, Kevin
> From: Jason Gunthorpe
> Sent: Tuesday, April 6, 2021 7:43 AM
> 
> On Fri, Apr 02, 2021 at 08:22:28AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Tuesday, March 30, 2021 9:29 PM
> > >
> > > >
> > > > First, userspace may use ioasid in a non-SVA scenario where ioasid is
> > > > bound to specific security context (e.g. a control vq in vDPA) instead 
> > > > of
> > > > tying to mm. In this case there is no pgtable binding initiated from 
> > > > user
> > > > space. Instead, ioasid is allocated from /dev/ioasid and then
> programmed
> > > > to the intended security context through specific passthrough
> framework
> > > > which manages that context.
> > >
> > > This sounds like the exact opposite of what I'd like to see.
> > >
> > > I do not want to see every subsystem gaining APIs to program a
> > > PASID. All of that should be consolidated in *one place*.
> > >
> > > I do not want to see VDPA and VFIO have two nearly identical sets of
> > > APIs to control the PASID.
> > >
> > > Drivers consuming a PASID, like VDPA, should consume the PASID and do
> > > nothing more than authorize the HW to use it.
> > >
> > > quemu should have general code under the viommu driver that drives
> > > /dev/ioasid to create PASID's and manage the IO mapping according to
> > > the guest's needs.
> > >
> > > Drivers like VDPA and VFIO should simply accept that PASID and
> > > configure/authorize their HW to do DMA's with its tag.
> > >
> >
> > I agree with you on consolidating things in one place (especially for the
> > general SVA support). But here I was referring to an usage without
> > pgtable binding (Possibly Jason. W can say more here), where the
> > userspace just wants to allocate PASIDs, program/accept PASIDs to
> > various workqueues (device specific), and then use MAP/UNMAP
> > interface to manage address spaces associated with each PASID.
> > I just wanted to point out that the latter two steps are through
> > VFIO/VDPA specific interfaces.
> 
> No, don't do that.
> 
> VFIO and VDPA has no buisness having map/unmap interfaces once we have
> /dev/ioasid. That all belongs in the iosaid side.
> 
> I know they have those interfaces today, but that doesn't mean we have
> to keep using them for PASID use cases, they should be replaced with a
> 'do dma from this pasid on /dev/ioasid' interface certainly not a
> 'here is a pasid from /dev/ioasid, go ahead and configure it youself'
> interface
> 
> This is because PASID is *complicated* in the general case! For
> instance all the two level stuff you are talking about must not leak
> into every user!
> 

Hi, Jason,

I didn't get your last comment how the two level stuff is leaked into every
user. Could you elaborate it a bit?

and here is one example why using existing VFIO/VDPA interface makes
sense. say dev1 (w/ sva) and dev2 (w/o sva) are placed in a single VFIO 
container. The container is associated to an iommu domain which contains 
a single 2nd-level page table, shared by both devices (when attached to
the domain). The VFIO MAP operation is applied to the 2nd-level page 
table thus naturally applied to both devices. Then userspace could use 
/dev/ioasid to further allocate IOASIDs and bind multiple 1st-level page 
tables for dev1, nested on the shared 2nd-level page table. 

If following your suggestion then VFIO must deny VFIO MAP operations
on sva1 (assume userspace should not mix sva1 and sva2 in the same
container and instead use /dev/ioasid to map for sva1)? and even for 
a sva-capable device there is a window before the guest actually enables 
sva on that device then VFIO should still accept MAP in that window 
and then deny it after sva is enabled by the guest? This all sounds
unnecessary complex while there is already a clean way to achieve it...

Thanks
Kevin


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-05 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, April 6, 2021 7:40 AM
> 
> On Fri, Apr 02, 2021 at 07:58:02AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Thursday, April 1, 2021 9:47 PM
> > >
> > > On Thu, Apr 01, 2021 at 01:43:36PM +, Liu, Yi L wrote:
> > > > > From: Jason Gunthorpe 
> > > > > Sent: Thursday, April 1, 2021 9:16 PM
> > > > >
> > > > > On Thu, Apr 01, 2021 at 01:10:48PM +, Liu, Yi L wrote:
> > > > > > > From: Jason Gunthorpe 
> > > > > > > Sent: Thursday, April 1, 2021 7:47 PM
> > > > > > [...]
> > > > > > > I'm worried Intel views the only use of PASID in a guest is with
> > > > > > > ENQCMD, but that is not consistent with the industry. We need to
> see
> > > > > > > normal nested PASID support with assigned PCI VFs.
> > > > > >
> > > > > > I'm not quire flow here. Intel also allows PASID usage in guest
> without
> > > > > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without
> > > > > ENQCMD.
> > > > >
> > > > > Then you need all the parts, the hypervisor calls from the vIOMMU,
> and
> > > > > you can't really use a vPASID.
> > > >
> > > > This is a diagram shows the vSVA setup.
> > >
> > > I'm not talking only about vSVA. Generic PASID support with arbitary
> > > mappings.
> > >
> > > And how do you deal with the vPASID vs pPASID issue if the system has
> > > a mix of physical devices and mdevs?
> > >
> >
> > We plan to support two schemes. One is vPASID identity-mapped to
> > pPASID then the mixed scenario just works, with the limitation of
> > lacking of live migration support. The other is non-identity-mapped
> > scheme, where live migration is supported but physical devices and
> > mdevs should not be mixed in one VM if both expose SVA capability
> > (requires some filtering check in Qemu).
> 
> That just becomes "block vPASID support if any device that
> doesn't use ENQCMD is plugged into the guest"

The limitation is only for physical device. and in reality it is not that
bad. To support live migration with physical device we anyway need 
additional work to migrate the device state (e.g. based on Max's work), 
then it's not unreasonable to also mediate guest programming of 
device specific PASID register to enable vPASID (need to translate in
the whole VM lifespan but likely is not a hot path).

> 
> Which needs a special VFIO capability of some kind so qemu knows to
> block it. This really needs to all be layed out together so someone
> can understand it :(

Or could simply based on whether the VFIO device supports live migration.

> 
> Why doesn't the siov cookbook explaining this stuff??
> 
> > We hope the /dev/ioasid can support both schemes, with the minimal
> > requirement of allowing userspace to tag a vPASID to a pPASID and
> > allowing mdev to translate vPASID into pPASID, i.e. not assuming that
> > the guest will always use pPASID.
> 
> What I'm a unclear of is if /dev/ioasid even needs to care about
> vPASID or if vPASID is just a hidden artifact of the KVM connection to
> setup the translation table and the vIOMMU driver in qemu.

Not just for KVM. Also required by mdev, which needs to translate
vPASID into pPASID when ENQCMD is not used. As I replied in another
mail, possibly we don't need /dev/ioasid to know this fact, which 
should only care about the operations related to pPASID. VFIO could
carry vPASID information to mdev. KVM should have its own interface
to know this information, as you suggested earlier.

Thanks
Kevin


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-05 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, April 6, 2021 7:35 AM
> 
> On Fri, Apr 02, 2021 at 07:30:23AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Friday, April 2, 2021 12:04 AM
> > >
> > > On Thu, Apr 01, 2021 at 02:08:17PM +, Liu, Yi L wrote:
> > >
> > > > DMA page faults are delivered to root-complex via page request
> message
> > > and
> > > > it is per-device according to PCIe spec. Page request handling flow is:
> > > >
> > > > 1) iommu driver receives a page request from device
> > > > 2) iommu driver parses the page request message. Get the RID,PASID,
> > > faulted
> > > >page and requested permissions etc.
> > > > 3) iommu driver triggers fault handler registered by device driver with
> > > >iommu_report_device_fault()
> > >
> > > This seems confused.
> > >
> > > The PASID should define how to handle the page fault, not the driver.
> > >
> > > I don't remember any device specific actions in ATS, so what is the
> > > driver supposed to do?
> > >
> > > > 4) device driver's fault handler signals an event FD to notify userspace
> to
> > > >fetch the information about the page fault. If it's VM case, inject 
> > > > the
> > > >page fault to VM and let guest to solve it.
> > >
> > > If the PASID is set to 'report page fault to userspace' then some
> > > event should come out of /dev/ioasid, or be reported to a linked
> > > eventfd, or whatever.
> > >
> > > If the PASID is set to 'SVM' then the fault should be passed to
> > > handle_mm_fault
> > >
> > > And so on.
> > >
> > > Userspace chooses what happens based on how they configure the PASID
> > > through /dev/ioasid.
> > >
> > > Why would a device driver get involved here?
> > >
> > > > Eric has sent below series for the page fault reporting for VM with
> passthru
> > > > device.
> > > > https://lore.kernel.org/kvm/20210223210625.604517-5-
> > > eric.au...@redhat.com/
> > >
> > > It certainly should not be in vfio pci. Everything using a PASID needs
> > > this infrastructure, VDPA, mdev, PCI, CXL, etc.
> > >
> >
> > This touches an interesting fact:
> >
> > The fault may be triggered in either 1st-level or 2nd-level page table,
> > when nested translation is enabled (in vSVA case). The 1st-level is bound
> > by the user space, which therefore needs to receive the fault event. The
> > 2nd-level is managed by VFIO (or vDPA), which needs to fix the fault in
> > kernel (e.g. find HVA per faulting GPA, call handle_mm_fault and map
> > GPA->HPA to IOMMU). Yi's current proposal lets VFIO to register the
> > device fault handler, which then forward the event through /dev/ioasid
> > to userspace only if it is a 1st-level fault. Are you suggesting a pgtable-
> > centric fault reporting mechanism to separate handlers in each level,
> > i.e. letting VFIO register handler only for 2nd-level fault and then /dev/
> > ioasid register handler for 1st-level fault?
> 
> This I'm struggling to understand. /dev/ioasid should handle all the
> faults cases, why would VFIO ever get involved in a fault? What would
> it even do?
> 
> If the fault needs to be fixed in the hypervisor then it is a kernel
> fault and it does handle_mm_fault. This absolutely should not be in
> VFIO or VDPA

With nested translation it is GVA->GPA->HPA. The kernel needs to
fix fault related to GPA->HPA (managed by VFIO/VDPA) while 
handle_mm_fault only handles HVA->HPA. In this case, the 2nd-level
page fault is expected to be delivered to VFIO/VDPA first which then
find HVA related to GPA, call handle_mm_fault to fix HVA->HPA,
and then call iommu_map to fix GPA->HPA in the IOMMU page table.
This is exactly like how CPU EPT violation is handled.

> 
> If the fault needs to be fixed in the guest, then it needs to be
> delivered over /dev/ioasid in some way and injected into the
> vIOMMU. VFIO and VDPA have nothing to do with vIOMMU driver in quemu.
> 
> You need to have an interface under /dev/ioasid to create both page
> table levels and part of that will be to tell the kernel what VA is
> mapped and how to handle faults.

VFIO/VDPA already have their own interface to manage GPA->HPA
mappings. Why do we want to duplicate it in /dev/ioasid? 

Thanks
Kevin


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-05 Thread Jason Gunthorpe
On Fri, Apr 02, 2021 at 08:22:28AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Tuesday, March 30, 2021 9:29 PM
> > 
> > >
> > > First, userspace may use ioasid in a non-SVA scenario where ioasid is
> > > bound to specific security context (e.g. a control vq in vDPA) instead of
> > > tying to mm. In this case there is no pgtable binding initiated from user
> > > space. Instead, ioasid is allocated from /dev/ioasid and then programmed
> > > to the intended security context through specific passthrough framework
> > > which manages that context.
> > 
> > This sounds like the exact opposite of what I'd like to see.
> > 
> > I do not want to see every subsystem gaining APIs to program a
> > PASID. All of that should be consolidated in *one place*.
> > 
> > I do not want to see VDPA and VFIO have two nearly identical sets of
> > APIs to control the PASID.
> > 
> > Drivers consuming a PASID, like VDPA, should consume the PASID and do
> > nothing more than authorize the HW to use it.
> > 
> > quemu should have general code under the viommu driver that drives
> > /dev/ioasid to create PASID's and manage the IO mapping according to
> > the guest's needs.
> > 
> > Drivers like VDPA and VFIO should simply accept that PASID and
> > configure/authorize their HW to do DMA's with its tag.
> > 
> 
> I agree with you on consolidating things in one place (especially for the
> general SVA support). But here I was referring to an usage without 
> pgtable binding (Possibly Jason. W can say more here), where the 
> userspace just wants to allocate PASIDs, program/accept PASIDs to 
> various workqueues (device specific), and then use MAP/UNMAP 
> interface to manage address spaces associated with each PASID. 
> I just wanted to point out that the latter two steps are through 
> VFIO/VDPA specific interfaces. 

No, don't do that.

VFIO and VDPA has no buisness having map/unmap interfaces once we have
/dev/ioasid. That all belongs in the iosaid side.

I know they have those interfaces today, but that doesn't mean we have
to keep using them for PASID use cases, they should be replaced with a
'do dma from this pasid on /dev/ioasid' interface certainly not a
'here is a pasid from /dev/ioasid, go ahead and configure it youself'
interface

This is because PASID is *complicated* in the general case! For
instance all the two level stuff you are talking about must not leak
into every user!

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-05 Thread Jason Gunthorpe
On Fri, Apr 02, 2021 at 07:58:02AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Thursday, April 1, 2021 9:47 PM
> > 
> > On Thu, Apr 01, 2021 at 01:43:36PM +, Liu, Yi L wrote:
> > > > From: Jason Gunthorpe 
> > > > Sent: Thursday, April 1, 2021 9:16 PM
> > > >
> > > > On Thu, Apr 01, 2021 at 01:10:48PM +, Liu, Yi L wrote:
> > > > > > From: Jason Gunthorpe 
> > > > > > Sent: Thursday, April 1, 2021 7:47 PM
> > > > > [...]
> > > > > > I'm worried Intel views the only use of PASID in a guest is with
> > > > > > ENQCMD, but that is not consistent with the industry. We need to see
> > > > > > normal nested PASID support with assigned PCI VFs.
> > > > >
> > > > > I'm not quire flow here. Intel also allows PASID usage in guest 
> > > > > without
> > > > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without
> > > > ENQCMD.
> > > >
> > > > Then you need all the parts, the hypervisor calls from the vIOMMU, and
> > > > you can't really use a vPASID.
> > >
> > > This is a diagram shows the vSVA setup.
> > 
> > I'm not talking only about vSVA. Generic PASID support with arbitary
> > mappings.
> > 
> > And how do you deal with the vPASID vs pPASID issue if the system has
> > a mix of physical devices and mdevs?
> > 
> 
> We plan to support two schemes. One is vPASID identity-mapped to
> pPASID then the mixed scenario just works, with the limitation of 
> lacking of live migration support. The other is non-identity-mapped 
> scheme, where live migration is supported but physical devices and 
> mdevs should not be mixed in one VM if both expose SVA capability 
> (requires some filtering check in Qemu). 

That just becomes "block vPASID support if any device that
doesn't use ENQCMD is plugged into the guest"

Which needs a special VFIO capability of some kind so qemu knows to
block it. This really needs to all be layed out together so someone
can understand it :(

Why doesn't the siov cookbook explaining this stuff??

> We hope the /dev/ioasid can support both schemes, with the minimal
> requirement of allowing userspace to tag a vPASID to a pPASID and
> allowing mdev to translate vPASID into pPASID, i.e. not assuming that
> the guest will always use pPASID.

What I'm a unclear of is if /dev/ioasid even needs to care about
vPASID or if vPASID is just a hidden artifact of the KVM connection to
setup the translation table and the vIOMMU driver in qemu.

Since the physical HW never sees the vPASID I'm inclined to think the
latter.

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-05 Thread Jason Gunthorpe
On Fri, Apr 02, 2021 at 07:30:23AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Friday, April 2, 2021 12:04 AM
> > 
> > On Thu, Apr 01, 2021 at 02:08:17PM +, Liu, Yi L wrote:
> > 
> > > DMA page faults are delivered to root-complex via page request message
> > and
> > > it is per-device according to PCIe spec. Page request handling flow is:
> > >
> > > 1) iommu driver receives a page request from device
> > > 2) iommu driver parses the page request message. Get the RID,PASID,
> > faulted
> > >page and requested permissions etc.
> > > 3) iommu driver triggers fault handler registered by device driver with
> > >iommu_report_device_fault()
> > 
> > This seems confused.
> > 
> > The PASID should define how to handle the page fault, not the driver.
> > 
> > I don't remember any device specific actions in ATS, so what is the
> > driver supposed to do?
> > 
> > > 4) device driver's fault handler signals an event FD to notify userspace 
> > > to
> > >fetch the information about the page fault. If it's VM case, inject the
> > >page fault to VM and let guest to solve it.
> > 
> > If the PASID is set to 'report page fault to userspace' then some
> > event should come out of /dev/ioasid, or be reported to a linked
> > eventfd, or whatever.
> > 
> > If the PASID is set to 'SVM' then the fault should be passed to
> > handle_mm_fault
> > 
> > And so on.
> > 
> > Userspace chooses what happens based on how they configure the PASID
> > through /dev/ioasid.
> > 
> > Why would a device driver get involved here?
> > 
> > > Eric has sent below series for the page fault reporting for VM with 
> > > passthru
> > > device.
> > > https://lore.kernel.org/kvm/20210223210625.604517-5-
> > eric.au...@redhat.com/
> > 
> > It certainly should not be in vfio pci. Everything using a PASID needs
> > this infrastructure, VDPA, mdev, PCI, CXL, etc.
> > 
> 
> This touches an interesting fact:
> 
> The fault may be triggered in either 1st-level or 2nd-level page table, 
> when nested translation is enabled (in vSVA case). The 1st-level is bound 
> by the user space, which therefore needs to receive the fault event. The 
> 2nd-level is managed by VFIO (or vDPA), which needs to fix the fault in 
> kernel (e.g. find HVA per faulting GPA, call handle_mm_fault and map 
> GPA->HPA to IOMMU). Yi's current proposal lets VFIO to register the 
> device fault handler, which then forward the event through /dev/ioasid 
> to userspace only if it is a 1st-level fault. Are you suggesting a pgtable-
> centric fault reporting mechanism to separate handlers in each level, 
> i.e. letting VFIO register handler only for 2nd-level fault and then /dev/
> ioasid register handler for 1st-level fault?

This I'm struggling to understand. /dev/ioasid should handle all the
faults cases, why would VFIO ever get involved in a fault? What would
it even do?

If the fault needs to be fixed in the hypervisor then it is a kernel
fault and it does handle_mm_fault. This absolutely should not be in
VFIO or VDPA

If the fault needs to be fixed in the guest, then it needs to be
delivered over /dev/ioasid in some way and injected into the
vIOMMU. VFIO and VDPA have nothing to do with vIOMMU driver in quemu.

You need to have an interface under /dev/ioasid to create both page
table levels and part of that will be to tell the kernel what VA is
mapped and how to handle faults.

VFIO/VDPA do *nothing* more than authorize the physical device to use
the given PASID.

In the VDPA case you might need to have SW access to the PASID, but
that should be provided by a generic iommu layer interface like
'copy_to/from_pasid()' not by involving VDPA in the address mapping.

Jason


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-02 Thread Liu, Yi L
Hi Jason,

> From: Jason Gunthorpe 
> Sent: Thursday, April 1, 2021 7:54 PM
> 
> On Thu, Apr 01, 2021 at 07:04:01AM +, Liu, Yi L wrote:
> 
> > After reading your reply in https://lore.kernel.org/linux-
> iommu/20210331123801.gd1463...@nvidia.com/#t
> > So you mean /dev/ioasid FD is per-VM instead of per-ioasid, so above
> skeleton
> > doesn't suit your idea.
> 
> You can do it one PASID per FD or multiple PASID's per FD. Most likely
> we will have high numbers of PASID's in a qemu process so I assume
> that number of FDs will start to be a contraining factor, thus
> multiplexing is reasonable.
> 
> It doesn't really change anything about the basic flow.
> 
> digging deeply into it either seems like a reasonable choice.
> 
> > +-+---+
> > |  userspace  |   kernel space  
> >   |
> > +-+---+
> > | ioasid_fd = | /dev/ioasid does below: 
> >   |
> > | open("/dev/ioasid", O_RDWR);|   struct ioasid_fd_ctx {
> >   |
> > | |   struct list_head ioasid_list; 
> >   |
> > | |   ...   
> >   |
> > | |   } ifd_ctx; // ifd_ctx is per ioasid_fd
> >   |
> 
> Sure, possibly an xarray not a list
> 
> > +-+---+
> > | ioctl(ioasid_fd,| /dev/ioasid does below: 
> >   |
> > |   ALLOC, );  |   struct ioasid_data {  
> >   |
> > | |   ioasid_t ioasid;  
> >   |
> > | |   struct list_head device_list; 
> >   |
> > | |   struct list_head next;
> >   |
> > | |   ...   
> >   |
> > | |   } id_data; // id_data is per ioasid   
> >   |
> > | | 
> >   |
> > | |   list_add(_data.next,   
> >   |
> > | |_ctx.ioasid_list);
> > |
> 
> Yes, this should have a kref in it too
> 
> > +-+---+
> > | ioctl(device_fd,| VFIO does below:
> >   |
> > |   DEVICE_ALLOW_IOASID,  | 1) get ioasid_fd, check if ioasid_fd is 
> > valid |
> > |   ioasid_fd,| 2) check if ioasid is allocated from 
> > ioasid_fd|
> > |   ioasid);  | 3) register device/domain info to 
> > /dev/ioasid |
> > | |tracked in id_data.device_list   
> >   |
> > | | 4) record the ioasid in VFIO's per-device   
> >   |
> > | |ioasid list for future security check
> >   |
> 
> You would provide a function that does steps 1&2 look at eventfd for
> instance.
> 
> I'm not sure we need to register the device with the ioasid. device
> should incr the kref on the ioasid_data at this point.
> 
> > +-+---+
> > | ioctl(ioasid_fd,| /dev/ioasid does below: 
> >   |
> > |   BIND_PGTBL,   | 1) find ioasid's id_data
> >   |
> > |   pgtbl_data,   | 2) loop the id_data.device_list and tell 
> > iommu|
> > |   ioasid);  |give ioasid access to the devices
> > |
> 
> This seems backwards, DEVICE_ALLOW_IOASID should tell the iommu to
> give the ioasid to the device.
> 
> Here the ioctl should be about assigning a memory map from the the
> current
> mm_struct to the pasid
> 
> > +-+---+
> > | ioctl(ioasid_fd,| /dev/ioasid does below: 
> >   |
> > |   UNBIND_PGTBL, | 1) find ioasid's id_data
> >   |
> > |   ioasid);  | 2) loop the id_data.device_list and tell 
> > iommu|
> > | |clear ioasid access to the devices   
> >   |
> 
> Also seems backwards. The ioctl here should be 'destroy ioasid' which
> wipes out the page table, halts DMA access and parks the PASID until
> all users are done.
> 
> > +-+---+
> > | ioctl(device_fd,| VFIO does below:
> >   |
> > |  DEVICE_DISALLOW_IOASID,| 1) check if ioasid is associated in VFIO's  
> >   |
> > |   ioasid_fd,|device ioasid list. 

RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-02 Thread Tian, Kevin
> From: Tian, Kevin
> Sent: Friday, April 2, 2021 3:58 PM
> 
> > From: Jason Gunthorpe 
> > Sent: Thursday, April 1, 2021 9:47 PM
> >
> > On Thu, Apr 01, 2021 at 01:43:36PM +, Liu, Yi L wrote:
> > > > From: Jason Gunthorpe 
> > > > Sent: Thursday, April 1, 2021 9:16 PM
> > > >
> > > > On Thu, Apr 01, 2021 at 01:10:48PM +, Liu, Yi L wrote:
> > > > > > From: Jason Gunthorpe 
> > > > > > Sent: Thursday, April 1, 2021 7:47 PM
> > > > > [...]
> > > > > > I'm worried Intel views the only use of PASID in a guest is with
> > > > > > ENQCMD, but that is not consistent with the industry. We need to
> see
> > > > > > normal nested PASID support with assigned PCI VFs.
> > > > >
> > > > > I'm not quire flow here. Intel also allows PASID usage in guest 
> > > > > without
> > > > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without
> > > > ENQCMD.
> > > >
> > > > Then you need all the parts, the hypervisor calls from the vIOMMU, and
> > > > you can't really use a vPASID.
> > >
> > > This is a diagram shows the vSVA setup.
> >
> > I'm not talking only about vSVA. Generic PASID support with arbitary
> > mappings.
> >
> > And how do you deal with the vPASID vs pPASID issue if the system has
> > a mix of physical devices and mdevs?
> >
> 
> We plan to support two schemes. One is vPASID identity-mapped to
> pPASID then the mixed scenario just works, with the limitation of
> lacking of live migration support. The other is non-identity-mapped
> scheme, where live migration is supported but physical devices and
> mdevs should not be mixed in one VM if both expose SVA capability
> (requires some filtering check in Qemu). Although we have some
> idea relaxing this restriction in the non-identity scheme, it requires
> more thinking given how the vSVA uAPI is being refactored.
> 
> In both cases the virtual VT-d will report a virtual capability to the guest,
> indicating that the guest must request PASID through a vcmd register
> instead of creating its own namespace. The vIOMMU returns a vPASID
> to the guest upon request. The vPASID could be directly mapped to a
> pPASID or allocated from a new namespace based on user configuration.
> 
> We hope the /dev/ioasid can support both schemes, with the minimal
> requirement of allowing userspace to tag a vPASID to a pPASID and
> allowing mdev to translate vPASID into pPASID, i.e. not assuming that
> the guest will always use pPASID.
> 

Per your comments in other threads I suppose this requirement should
be implemented in VFIO_ALLOW_PASID command instead of going 
through /dev/ioasid which only needs to know pPASID and its pgtable
management.

Thanks
Kevin


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-02 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, March 30, 2021 9:29 PM
> 
> >
> > First, userspace may use ioasid in a non-SVA scenario where ioasid is
> > bound to specific security context (e.g. a control vq in vDPA) instead of
> > tying to mm. In this case there is no pgtable binding initiated from user
> > space. Instead, ioasid is allocated from /dev/ioasid and then programmed
> > to the intended security context through specific passthrough framework
> > which manages that context.
> 
> This sounds like the exact opposite of what I'd like to see.
> 
> I do not want to see every subsystem gaining APIs to program a
> PASID. All of that should be consolidated in *one place*.
> 
> I do not want to see VDPA and VFIO have two nearly identical sets of
> APIs to control the PASID.
> 
> Drivers consuming a PASID, like VDPA, should consume the PASID and do
> nothing more than authorize the HW to use it.
> 
> quemu should have general code under the viommu driver that drives
> /dev/ioasid to create PASID's and manage the IO mapping according to
> the guest's needs.
> 
> Drivers like VDPA and VFIO should simply accept that PASID and
> configure/authorize their HW to do DMA's with its tag.
> 

I agree with you on consolidating things in one place (especially for the
general SVA support). But here I was referring to an usage without 
pgtable binding (Possibly Jason. W can say more here), where the 
userspace just wants to allocate PASIDs, program/accept PASIDs to 
various workqueues (device specific), and then use MAP/UNMAP 
interface to manage address spaces associated with each PASID. 
I just wanted to point out that the latter two steps are through 
VFIO/VDPA specific interfaces. 

Thanks
Kevin


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-02 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, April 1, 2021 9:47 PM
> 
> On Thu, Apr 01, 2021 at 01:43:36PM +, Liu, Yi L wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Thursday, April 1, 2021 9:16 PM
> > >
> > > On Thu, Apr 01, 2021 at 01:10:48PM +, Liu, Yi L wrote:
> > > > > From: Jason Gunthorpe 
> > > > > Sent: Thursday, April 1, 2021 7:47 PM
> > > > [...]
> > > > > I'm worried Intel views the only use of PASID in a guest is with
> > > > > ENQCMD, but that is not consistent with the industry. We need to see
> > > > > normal nested PASID support with assigned PCI VFs.
> > > >
> > > > I'm not quire flow here. Intel also allows PASID usage in guest without
> > > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without
> > > ENQCMD.
> > >
> > > Then you need all the parts, the hypervisor calls from the vIOMMU, and
> > > you can't really use a vPASID.
> >
> > This is a diagram shows the vSVA setup.
> 
> I'm not talking only about vSVA. Generic PASID support with arbitary
> mappings.
> 
> And how do you deal with the vPASID vs pPASID issue if the system has
> a mix of physical devices and mdevs?
> 

We plan to support two schemes. One is vPASID identity-mapped to
pPASID then the mixed scenario just works, with the limitation of 
lacking of live migration support. The other is non-identity-mapped 
scheme, where live migration is supported but physical devices and 
mdevs should not be mixed in one VM if both expose SVA capability 
(requires some filtering check in Qemu). Although we have some 
idea relaxing this restriction in the non-identity scheme, it requires 
more thinking given how the vSVA uAPI is being refactored.

In both cases the virtual VT-d will report a virtual capability to the guest,
indicating that the guest must request PASID through a vcmd register
instead of creating its own namespace. The vIOMMU returns a vPASID 
to the guest upon request. The vPASID could be directly mapped to a 
pPASID or allocated from a new namespace based on user configuration.

We hope the /dev/ioasid can support both schemes, with the minimal
requirement of allowing userspace to tag a vPASID to a pPASID and
allowing mdev to translate vPASID into pPASID, i.e. not assuming that
the guest will always use pPASID.

Thanks
Kevin


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-02 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Friday, April 2, 2021 12:04 AM
> 
> On Thu, Apr 01, 2021 at 02:08:17PM +, Liu, Yi L wrote:
> 
> > DMA page faults are delivered to root-complex via page request message
> and
> > it is per-device according to PCIe spec. Page request handling flow is:
> >
> > 1) iommu driver receives a page request from device
> > 2) iommu driver parses the page request message. Get the RID,PASID,
> faulted
> >page and requested permissions etc.
> > 3) iommu driver triggers fault handler registered by device driver with
> >iommu_report_device_fault()
> 
> This seems confused.
> 
> The PASID should define how to handle the page fault, not the driver.
> 
> I don't remember any device specific actions in ATS, so what is the
> driver supposed to do?
> 
> > 4) device driver's fault handler signals an event FD to notify userspace to
> >fetch the information about the page fault. If it's VM case, inject the
> >page fault to VM and let guest to solve it.
> 
> If the PASID is set to 'report page fault to userspace' then some
> event should come out of /dev/ioasid, or be reported to a linked
> eventfd, or whatever.
> 
> If the PASID is set to 'SVM' then the fault should be passed to
> handle_mm_fault
> 
> And so on.
> 
> Userspace chooses what happens based on how they configure the PASID
> through /dev/ioasid.
> 
> Why would a device driver get involved here?
> 
> > Eric has sent below series for the page fault reporting for VM with passthru
> > device.
> > https://lore.kernel.org/kvm/20210223210625.604517-5-
> eric.au...@redhat.com/
> 
> It certainly should not be in vfio pci. Everything using a PASID needs
> this infrastructure, VDPA, mdev, PCI, CXL, etc.
> 

This touches an interesting fact:

The fault may be triggered in either 1st-level or 2nd-level page table, 
when nested translation is enabled (in vSVA case). The 1st-level is bound 
by the user space, which therefore needs to receive the fault event. The 
2nd-level is managed by VFIO (or vDPA), which needs to fix the fault in 
kernel (e.g. find HVA per faulting GPA, call handle_mm_fault and map 
GPA->HPA to IOMMU). Yi's current proposal lets VFIO to register the 
device fault handler, which then forward the event through /dev/ioasid 
to userspace only if it is a 1st-level fault. Are you suggesting a pgtable-
centric fault reporting mechanism to separate handlers in each level, 
i.e. letting VFIO register handler only for 2nd-level fault and then /dev/
ioasid register handler for 1st-level fault?

Thanks
Kevin



RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Thursday, April 1, 2021 7:47 PM
[...]
> I'm worried Intel views the only use of PASID in a guest is with
> ENQCMD, but that is not consistent with the industry. We need to see
> normal nested PASID support with assigned PCI VFs.

I'm not quire flow here. Intel also allows PASID usage in guest without
ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without ENQCMD.

[...]

> I'm sure there will be some small differences, and you should clearly
> explain the entire uAPI surface so that soneone from AMD and ARM can
> review it.

good suggestion, will do.

> > - this per-ioasid SVA operations is not aligned with the native SVA usage
> >   model. Native SVA bind is per-device.
> 
> Seems like that is an error in native SVA.
> 
> SVA is a particular mode of the PASID's memory mapping table, it has
> nothing to do with a device.

I think it still has relationship with device. This is determined by the
DMA remapping hierarchy in hardware. e.g. Intel VT-d, the DMA isolation is
enforced first in device granularity and then PASID granularity. SVA makes
usage of both PASID and device granularity isolation.

Regards,
Yi Liu

> Jason


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Thursday, April 1, 2021 9:43 PM
> 
> On Thu, Apr 01, 2021 at 01:38:46PM +, Liu, Yi L wrote:
> > > From: Jean-Philippe Brucker 
> > > Sent: Thursday, April 1, 2021 8:05 PM
> > [...]
> > >
> > > Also wondering about:
> > >
> > > * Querying IOMMU nesting capabilities before binding page tables
> (which
> > >   page table formats are supported?). We were planning to have a VFIO
> cap,
> > >   but I'm guessing we need to go back to the sysfs solution?
> >
> > I think it can also be with /dev/ioasid.
> 
> Sure, anything to do with page table formats and setting page tables
> should go through ioasid.
> 
> > > * Invalidation, probably an ioasid_fd ioctl?
> >
> > yeah, if we are doing bind/unbind_pagtable via ioasid_fd, then yes,
> > invalidation should go this way as well. This is why I worried it may
> > fail to meet the requirement from you and Eric.
> 
> Yes, all manipulation of page tables, including removing memory ranges, or
> setting memory ranges to trigger a page fault behavior should go
> through here.
> 
> > > * Page faults, page response. From and to devices, and don't necessarily
> > >   have a PASID. But needed by vdpa as well, so that's also going through
> > >   /dev/ioasid?
> >
> > page faults should still be per-device, but the fault event fd may be stored
> > in /dev/ioasid. page response would be in /dev/ioasid just like 
> > invalidation.
> 
> Here you mean non-SVA page faults that are delegated to userspace to
> handle?

no, just SVA page faults. otherwise, no need to let userspace handle.

> 
> Why would that be per-device?
>
> Can you show the flow you imagine?

DMA page faults are delivered to root-complex via page request message and
it is per-device according to PCIe spec. Page request handling flow is:

1) iommu driver receives a page request from device
2) iommu driver parses the page request message. Get the RID,PASID, faulted
   page and requested permissions etc.
3) iommu driver triggers fault handler registered by device driver with
   iommu_report_device_fault()
4) device driver's fault handler signals an event FD to notify userspace to
   fetch the information about the page fault. If it's VM case, inject the
   page fault to VM and let guest to solve it.

Eric has sent below series for the page fault reporting for VM with passthru
device.
https://lore.kernel.org/kvm/20210223210625.604517-5-eric.au...@redhat.com/

Regards,
Yi Liu

> Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Jason Gunthorpe
On Thu, Apr 01, 2021 at 07:04:01AM +, Liu, Yi L wrote:

> After reading your reply in 
> https://lore.kernel.org/linux-iommu/20210331123801.gd1463...@nvidia.com/#t
> So you mean /dev/ioasid FD is per-VM instead of per-ioasid, so above skeleton
> doesn't suit your idea.

You can do it one PASID per FD or multiple PASID's per FD. Most likely
we will have high numbers of PASID's in a qemu process so I assume
that number of FDs will start to be a contraining factor, thus
multiplexing is reasonable.

It doesn't really change anything about the basic flow.

digging deeply into it either seems like a reasonable choice.

> +-+---+
> |  userspace  |   kernel space
> |
> +-+---+
> | ioasid_fd = | /dev/ioasid does below:   
> |
> | open("/dev/ioasid", O_RDWR);|   struct ioasid_fd_ctx {  
> |
> | |   struct list_head ioasid_list;   
> |
> | |   ... 
> |
> | |   } ifd_ctx; // ifd_ctx is per ioasid_fd  
> |

Sure, possibly an xarray not a list

> +-+---+
> | ioctl(ioasid_fd,| /dev/ioasid does below:   
> |
> |   ALLOC, );  |   struct ioasid_data {
> |
> | |   ioasid_t ioasid;
> |
> | |   struct list_head device_list;   
> |
> | |   struct list_head next;  
> |
> | |   ... 
> |
> | |   } id_data; // id_data is per ioasid 
> |
> | |   
> |
> | |   list_add(_data.next, 
> |
> | |_ctx.ioasid_list);
> |

Yes, this should have a kref in it too

> +-+---+
> | ioctl(device_fd,| VFIO does below:  
> |
> |   DEVICE_ALLOW_IOASID,  | 1) get ioasid_fd, check if ioasid_fd is valid 
> |
> |   ioasid_fd,| 2) check if ioasid is allocated from 
> ioasid_fd|
> |   ioasid);  | 3) register device/domain info to /dev/ioasid 
> |
> | |tracked in id_data.device_list 
> |
> | | 4) record the ioasid in VFIO's per-device 
> |
> | |ioasid list for future security check  
> |

You would provide a function that does steps 1&2 look at eventfd for
instance.

I'm not sure we need to register the device with the ioasid. device
should incr the kref on the ioasid_data at this point.

> +-+---+
> | ioctl(ioasid_fd,| /dev/ioasid does below:   
> |
> |   BIND_PGTBL,   | 1) find ioasid's id_data  
> |
> |   pgtbl_data,   | 2) loop the id_data.device_list and tell 
> iommu|
> |   ioasid);  |give ioasid access to the devices
> |

This seems backwards, DEVICE_ALLOW_IOASID should tell the iommu to
give the ioasid to the device.

Here the ioctl should be about assigning a memory map from the the current
mm_struct to the pasid

> +-+---+
> | ioctl(ioasid_fd,| /dev/ioasid does below:   
> |
> |   UNBIND_PGTBL, | 1) find ioasid's id_data  
> |
> |   ioasid);  | 2) loop the id_data.device_list and tell 
> iommu|
> | |clear ioasid access to the devices 
> |

Also seems backwards. The ioctl here should be 'destroy ioasid' which
wipes out the page table, halts DMA access and parks the PASID until
all users are done.

> +-+---+
> | ioctl(device_fd,| VFIO does below:  
> |
> |  DEVICE_DISALLOW_IOASID,| 1) check if ioasid is associated in VFIO's
> |
> |   ioasid_fd,|device ioasid list.
> |
> |   ioasid);  | 2) unregister device/domain info from 
> |
> | |/dev/ioasid, clear in id_data.device_list  
> |

This should disconnect the iommu and kref_put the ioasid_data

Remember the layering, only the device_fd knows what the pci_device 

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Jason Gunthorpe
On Thu, Apr 01, 2021 at 01:43:36PM +, Liu, Yi L wrote:
> > From: Jason Gunthorpe 
> > Sent: Thursday, April 1, 2021 9:16 PM
> > 
> > On Thu, Apr 01, 2021 at 01:10:48PM +, Liu, Yi L wrote:
> > > > From: Jason Gunthorpe 
> > > > Sent: Thursday, April 1, 2021 7:47 PM
> > > [...]
> > > > I'm worried Intel views the only use of PASID in a guest is with
> > > > ENQCMD, but that is not consistent with the industry. We need to see
> > > > normal nested PASID support with assigned PCI VFs.
> > >
> > > I'm not quire flow here. Intel also allows PASID usage in guest without
> > > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without
> > ENQCMD.
> > 
> > Then you need all the parts, the hypervisor calls from the vIOMMU, and
> > you can't really use a vPASID.
> 
> This is a diagram shows the vSVA setup.

I'm not talking only about vSVA. Generic PASID support with arbitary
mappings.

And how do you deal with the vPASID vs pPASID issue if the system has
a mix of physical devices and mdevs?

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Jason Gunthorpe
On Thu, Apr 01, 2021 at 01:10:48PM +, Liu, Yi L wrote:
> > From: Jason Gunthorpe 
> > Sent: Thursday, April 1, 2021 7:47 PM
> [...]
> > I'm worried Intel views the only use of PASID in a guest is with
> > ENQCMD, but that is not consistent with the industry. We need to see
> > normal nested PASID support with assigned PCI VFs.
> 
> I'm not quire flow here. Intel also allows PASID usage in guest without
> ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without ENQCMD.

Then you need all the parts, the hypervisor calls from the vIOMMU, and
you can't really use a vPASID.

I'm not sure how Intel intends to resolve all of this.

> > > - this per-ioasid SVA operations is not aligned with the native SVA usage
> > >   model. Native SVA bind is per-device.
> > 
> > Seems like that is an error in native SVA.
> > 
> > SVA is a particular mode of the PASID's memory mapping table, it has
> > nothing to do with a device.
> 
> I think it still has relationship with device. This is determined by the
> DMA remapping hierarchy in hardware. e.g. Intel VT-d, the DMA isolation is
> enforced first in device granularity and then PASID granularity. SVA makes
> usage of both PASID and device granularity isolation.

When the device driver authorizes a PASID the VT-d stuff should setup
the isolation parameters for the give pci_device and PASID.

Do not leak implementation details like this as uAPI. Authorization
and memory map are distinct ideas with distinct interfaces. Do not mix
them.

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Jason Gunthorpe
On Thu, Apr 01, 2021 at 01:38:46PM +, Liu, Yi L wrote:
> > From: Jean-Philippe Brucker 
> > Sent: Thursday, April 1, 2021 8:05 PM
> [...]
> > 
> > Also wondering about:
> > 
> > * Querying IOMMU nesting capabilities before binding page tables (which
> >   page table formats are supported?). We were planning to have a VFIO cap,
> >   but I'm guessing we need to go back to the sysfs solution?
> 
> I think it can also be with /dev/ioasid.

Sure, anything to do with page table formats and setting page tables
should go through ioasid.

> > * Invalidation, probably an ioasid_fd ioctl?
> 
> yeah, if we are doing bind/unbind_pagtable via ioasid_fd, then yes,
> invalidation should go this way as well. This is why I worried it may
> fail to meet the requirement from you and Eric.

Yes, all manipulation of page tables, including removing memory ranges, or
setting memory ranges to trigger a page fault behavior should go
through here.

> > * Page faults, page response. From and to devices, and don't necessarily
> >   have a PASID. But needed by vdpa as well, so that's also going through
> >   /dev/ioasid?
> 
> page faults should still be per-device, but the fault event fd may be stored
> in /dev/ioasid. page response would be in /dev/ioasid just like invalidation.

Here you mean non-SVA page faults that are delegated to userspace to handle?

Why would that be per-device?

Can you show the flow you imagine?

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Jason Gunthorpe
On Thu, Apr 01, 2021 at 02:05:00PM +0200, Jean-Philippe Brucker wrote:
> On Thu, Apr 01, 2021 at 07:04:01AM +, Liu, Yi L wrote:
> > > - how about AMD and ARM's vSVA support? Their PASID allocation and page
> > > table
> > >   happens within guest. They only need to bind the guest PASID table to
> > > host.
> 
> In this case each VM has its own IOASID space, and the host IOASID
> allocator doesn't participate. Plus this only makes sense when assigning a
> whole VF to a guest, and VFIO is the tool for this. So I wouldn't shoehorn
> those ops into /dev/ioasid, though we do need a transport for invalidate
> commands.

How does security work? Devices still have to be authorized to use the
PASID and this approach seems like it completely excludes mdev/vdpa
from ever using a PASID, and those are the most logical users.

> > >   Above model seems unable to fit them. (Jean, Eric, Jacob please feel 
> > > free
> > >   to correct me)
> > > - this per-ioasid SVA operations is not aligned with the native SVA usage
> > >   model. Native SVA bind is per-device.
> 
> Bare-metal SVA doesn't need /dev/ioasid either. 

It depends what you are doing. /dev/ioasid would provide fine grained
control over the memory mapping. It is not strict SVA, but I can see
applications where using a GPU with a pre-configured optimized mapping
could be interesting.

> A program uses a device handle to either ask whether SVA is enabled,
> or to enable it explicitly.  With or without /dev/ioasid, that step
> is required. OpenCL uses the first method - automatically enable
> "fine-grain system SVM" if available, and provide a flag to
> userspace.

SVA can be done with ioasid, we can decide if it makes sense to have
shortcuts in every driver

> So userspace does not need to know about PASID. It's only one method for
> doing SVA (some GPUs are context-switching page tables instead).

Sure, there are lots of approaches. Here we are only talking about
PASID enablement. PASID has more options.
 
> * Page faults, page response. From and to devices, and don't necessarily
>   have a PASID. But needed by vdpa as well, so that's also going through
>   /dev/ioasid?

Only real PASID's should use this interface. All the not-PASID stuff
is on its own.

VPDA should accept a PASID from here and configure the real
HW to attach the PASID to all DMA's connected to the virtio queues.

Jason


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Thursday, April 1, 2021 9:16 PM
> 
> On Thu, Apr 01, 2021 at 01:10:48PM +, Liu, Yi L wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Thursday, April 1, 2021 7:47 PM
> > [...]
> > > I'm worried Intel views the only use of PASID in a guest is with
> > > ENQCMD, but that is not consistent with the industry. We need to see
> > > normal nested PASID support with assigned PCI VFs.
> >
> > I'm not quire flow here. Intel also allows PASID usage in guest without
> > ENQCMD. e.g. Passthru a PF to guest, and use PASID on it without
> ENQCMD.
> 
> Then you need all the parts, the hypervisor calls from the vIOMMU, and
> you can't really use a vPASID.

This is a diagram shows the vSVA setup.

.-.  .---.
|   vIOMMU|  | Guest process CR3, FL only|
| |  '---'
./
| PASID Entry |--- PASID cache flush -
'-'   |
| |   V
| |CR3 in GPA
'-'
Guest
--| Shadow |--|
  vv  v
Host
.-.  .--.
|   pIOMMU|  | Bind FL for GVA-GPA  |
| |  '--'
./  |
| PASID Entry | V (Nested xlate)
'\.--.
| |   |SL for GPA-HPA, default domain|
| |   '--'
'-'
Where:
 - FL = First level/stage one page tables
 - SL = Second level/stage two page tables

https://lore.kernel.org/linux-iommu/20210302203545.436623-1-yi.l@intel.com/

> 
> I'm not sure how Intel intends to resolve all of this.
> 
> > > > - this per-ioasid SVA operations is not aligned with the native SVA
> usage
> > > >   model. Native SVA bind is per-device.
> > >
> > > Seems like that is an error in native SVA.
> > >
> > > SVA is a particular mode of the PASID's memory mapping table, it has
> > > nothing to do with a device.
> >
> > I think it still has relationship with device. This is determined by the
> > DMA remapping hierarchy in hardware. e.g. Intel VT-d, the DMA isolation
> is
> > enforced first in device granularity and then PASID granularity. SVA makes
> > usage of both PASID and device granularity isolation.
> 
> When the device driver authorizes a PASID the VT-d stuff should setup
> the isolation parameters for the give pci_device and PASID.

yes, both device and PASID is needed to setup VT-d stuff.

> Do not leak implementation details like this as uAPI. Authorization
> and memory map are distinct ideas with distinct interfaces. Do not mix
> them.

got you. Let's focus on the uAPI things here and leave implementation details
in RFC patches.

Thanks,
Yi Liu

> Jason


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Liu, Yi L
> From: Jean-Philippe Brucker 
> Sent: Thursday, April 1, 2021 8:05 PM
[...]
> 
> Also wondering about:
> 
> * Querying IOMMU nesting capabilities before binding page tables (which
>   page table formats are supported?). We were planning to have a VFIO cap,
>   but I'm guessing we need to go back to the sysfs solution?

I think it can also be with /dev/ioasid.

> 
> * Invalidation, probably an ioasid_fd ioctl?

yeah, if we are doing bind/unbind_pagtable via ioasid_fd, then yes,
invalidation should go this way as well. This is why I worried it may
fail to meet the requirement from you and Eric.

> * Page faults, page response. From and to devices, and don't necessarily
>   have a PASID. But needed by vdpa as well, so that's also going through
>   /dev/ioasid?

page faults should still be per-device, but the fault event fd may be stored
in /dev/ioasid. page response would be in /dev/ioasid just like invalidation.

Regards,
Yi Liu

> 
> Thanks,
> Jean


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Jacob Pan
Hi Jason,

On Wed, 31 Mar 2021 21:37:05 -0300, Jason Gunthorpe  wrote:

> On Wed, Mar 31, 2021 at 04:46:21PM -0700, Jacob Pan wrote:
> > Hi Jason,
> > 
> > On Wed, 31 Mar 2021 09:38:01 -0300, Jason Gunthorpe 
> > wrote: 
> > > > > Get rid of the ioasid set.
> > > > >
> > > > > Each driver has its own list of allowed ioasids.
> > >  [...]  
> > > 
> > > The /dev/ioasid FD replaces this security check. By becoming FD
> > > centric you don't need additional kernel security objects.
> > > 
> > > Any process with access to the /dev/ioasid FD is allowed to control
> > > those PASID. The seperation between VMs falls naturally from the
> > > seperation of FDs without creating additional, complicated, security
> > > infrastrucure in the kernel.
> > > 
> > > This is why all APIs must be FD focused, and you need to have a
> > > logical layering of responsibility.
> > > 
> > >  Allocate a /dev/ioasid FD
> > >  Allocate PASIDs inside the FD
Just to be super clear. Do we allocate a FD for each PASID and return the
FD to the user? Or return the plain PASID number back to the user space?

> > >  Assign memory to the PASIDS
> > > 
> > >  Open a device FD, eg from VFIO or VDP
> > >  Instruct the device FD to authorize the device to access PASID A in
> > >  an ioasid FD  
> > How do we know user provided PASID A was allocated by the ioasid FD?  
> 
> You pass in the ioasid FD and use a 'get pasid from fdno' API to
> extract the required kernel structure.
> 
Seems you are talking about two FDs:
- /dev/ioasid FD
- per IOASID FD
This API ioasid = get_pasid_from_fd(dev_ioasid_fd, ioasid_fd);
dev_ioasid_fd will find the xarray for all the PASIDs allocated under it,
ioasid_fd wil be the index into the xarray to retrieve the actual ioasid.
Correct?

> > Shouldn't we validate user input by tracking which PASIDs are
> > allocated by which ioasid FD?  
> 
> Yes, but it is integral to the ioasid FD, not something separated.
> 
OK, if we have per IOASID FD in addition to the /dev/ioasid FD we can
validate user input.

> > > VFIO extracts some kernel representation of the ioasid from the ioasid
> > > fd using an API
> > >   
> > This lookup API seems to be asking for per ioasid FD storage array.
> > Today, the ioasid_set is per mm and contains a Xarray.   
> 
> Right, put the xarray per FD. A set per mm is fairly nonsensical, we
> don't use the mm as that kind of security key.
> 
Sounds good, one xarray per /dev/ioasid FD.

> > Since each VM, KVM can only open one ioasid FD, this per FD array
> > would be equivalent to the per mm ioasid_set, right?  
> 
> Why only one?  Each interaction with the other FDs should include the
> PASID/FD pair. There is no restriction to just one.
> 
OK, one per subsystem-VM. For example, if a VM has a VFIO and a VDPA
device, it should only two /dev/ioasid FDs respectively. Correct?

> > > VFIO does some kernel call to IOMMU/IOASID layer that says 'tell the
> > > IOMMU that this PCI device is allowed to use this PASID'  
> >
> > Would it be redundant to what iommu_uapi_sva_bind_gpasid() does? I
> > thought the idea is to use ioasid FD IOCTL to issue IOMMU uAPI calls.
> > Or we can skip this step for now and wait for the user to do SVA bind.  
> 
> I'm not sure what you are asking.
> 
> Possibly some of the IOMMU API will need a bit adjusting to make
> things split.
> 
> The act of programming the page tables and the act of authorizing a
> PCI BDF to use a PASID are distinct things with two different IOCTLs.
> 
Why separate? I don't see a use case to just authorize a PASID but don't
bind it with a page table. The very act of bind page table *is* the
authorization.

> iommu_uapi_sva_bind_gpasid() is never called by anything, and it's
> uAPI is never implemented.
> 
Just a little background here. We have been working on the vSVA stack
since 2017. At the time, VFIO was the de facto interface for IOMMU-aware
driver framework. These uAPIs were always developed alone side with the
accompanying VFIO patches served as consumers. By the time these IOMMU uAPIs
were merged after reviews from most vendors, the VFIO patchset was
approaching maturity in around v7. This is when we suddenly saw a new
request to support VDPA, which attempted VFIO earlier but ultimately moved
away.

For a complex stack like vSVA, I feel we have to reduce moving parts and do
some divide and conquer.

> Joerg? Why did you merge dead uapi and dead code?
> 
> Jason


Thanks,

Jacob


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Jason Gunthorpe
On Thu, Apr 01, 2021 at 10:23:55AM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Wed, 31 Mar 2021 21:37:05 -0300, Jason Gunthorpe  wrote:
> 
> > On Wed, Mar 31, 2021 at 04:46:21PM -0700, Jacob Pan wrote:
> > > Hi Jason,
> > > 
> > > On Wed, 31 Mar 2021 09:38:01 -0300, Jason Gunthorpe 
> > > wrote: 
> > > > > > Get rid of the ioasid set.
> > > > > >
> > > > > > Each driver has its own list of allowed ioasids.
> > > >  [...]  
> > > > 
> > > > The /dev/ioasid FD replaces this security check. By becoming FD
> > > > centric you don't need additional kernel security objects.
> > > > 
> > > > Any process with access to the /dev/ioasid FD is allowed to control
> > > > those PASID. The seperation between VMs falls naturally from the
> > > > seperation of FDs without creating additional, complicated, security
> > > > infrastrucure in the kernel.
> > > > 
> > > > This is why all APIs must be FD focused, and you need to have a
> > > > logical layering of responsibility.
> > > > 
> > > >  Allocate a /dev/ioasid FD
> > > >  Allocate PASIDs inside the FD
> Just to be super clear. Do we allocate a FD for each PASID and return the
> FD to the user? Or return the plain PASID number back to the user space?

I would do multiple PASID's per /dev/ioasid FD because we expect alot
of PASIDs to be in use and we'd run into FDno limits.

> > > >  Assign memory to the PASIDS
> > > > 
> > > >  Open a device FD, eg from VFIO or VDP
> > > >  Instruct the device FD to authorize the device to access PASID A in
> > > >  an ioasid FD  
> > > How do we know user provided PASID A was allocated by the ioasid FD?  
> > 
> > You pass in the ioasid FD and use a 'get pasid from fdno' API to
> > extract the required kernel structure.
> > 
> Seems you are talking about two FDs:
> - /dev/ioasid FD

No, just this one.

> - per IOASID FD
> This API ioasid = get_pasid_from_fd(dev_ioasid_fd, ioasid_fd);
> dev_ioasid_fd will find the xarray for all the PASIDs allocated under it,
> ioasid_fd wil be the index into the xarray to retrieve the actual ioasid.
> Correct?

'ioasid_fd' is just the ioasid number in whatever numberspace the
/dev/ioasid FD's use.

> > Why only one?  Each interaction with the other FDs should include the
> > PASID/FD pair. There is no restriction to just one.

> OK, one per subsystem-VM. For example, if a VM has a VFIO and a VDPA
> device, it should only two /dev/ioasid FDs respectively. Correct?

No, only one.

For something like qemu's use case I mostly expect the vIOMMU driver
will open /dev/ioasid for each vIOMMU instance it creates (basically
only one)

> > The act of programming the page tables and the act of authorizing a
> > PCI BDF to use a PASID are distinct things with two different IOCTLs.
> > 
> Why separate? 

Because they have different owners and different layers in the
software.

It is not about use case, it is about putting the control points where
they naturally belong.

> For a complex stack like vSVA, I feel we have to reduce moving parts and do
> some divide and conquer.

uAPI should have all come together with a user and user application.

uAPI is hardest and most important part.

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Jason Gunthorpe
On Thu, Apr 01, 2021 at 02:08:17PM +, Liu, Yi L wrote:

> DMA page faults are delivered to root-complex via page request message and
> it is per-device according to PCIe spec. Page request handling flow is:
> 
> 1) iommu driver receives a page request from device
> 2) iommu driver parses the page request message. Get the RID,PASID, faulted
>page and requested permissions etc.
> 3) iommu driver triggers fault handler registered by device driver with
>iommu_report_device_fault()

This seems confused.

The PASID should define how to handle the page fault, not the driver.

I don't remember any device specific actions in ATS, so what is the
driver supposed to do?

> 4) device driver's fault handler signals an event FD to notify userspace to
>fetch the information about the page fault. If it's VM case, inject the
>page fault to VM and let guest to solve it.

If the PASID is set to 'report page fault to userspace' then some
event should come out of /dev/ioasid, or be reported to a linked
eventfd, or whatever.

If the PASID is set to 'SVM' then the fault should be passed to
handle_mm_fault

And so on.

Userspace chooses what happens based on how they configure the PASID
through /dev/ioasid.

Why would a device driver get involved here?

> Eric has sent below series for the page fault reporting for VM with passthru
> device.
> https://lore.kernel.org/kvm/20210223210625.604517-5-eric.au...@redhat.com/

It certainly should not be in vfio pci. Everything using a PASID needs
this infrastructure, VDPA, mdev, PCI, CXL, etc.

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Jason Gunthorpe
On Thu, Apr 01, 2021 at 04:38:44AM +, Liu, Yi L wrote:
> > From: Jason Gunthorpe 
> > Sent: Wednesday, March 31, 2021 8:41 PM
> > 
> > On Wed, Mar 31, 2021 at 07:38:36AM +, Liu, Yi L wrote:
> > 
> > > The reason is /dev/ioasid FD is per-VM since the ioasid allocated to
> > > the VM should be able to be shared by all assigned device for the VM.
> > > But the SVA operations (bind/unbind page table, cache_invalidate) should
> > > be per-device.
> > 
> > It is not *per-device* it is *per-ioasid*
> >
> > And as /dev/ioasid is an interface for controlling multiple ioasid's
> > there is no issue to also multiplex the page table manipulation for
> > multiple ioasids as well.
> > 
> > What you should do next is sketch out in some RFC the exactl ioctls
> > each FD would have and show how the parts I outlined would work and
> > point out any remaining gaps.
> > 
> > The device FD is something like the vfio_device FD from VFIO, it has
> > *nothing* to do with PASID beyond having a single ioctl to authorize
> > the device to use the PASID. All control of the PASID is in
> > /dev/ioasid.
> 
> good to see this reply. Your idea is much clearer to me now. If I'm getting
> you correctly. I think the skeleton is something like below:
> 
> 1) userspace opens a /dev/ioasid, meanwhile there will be an ioasid
>allocated and a per-ioasid context which can be used to do bind page
>table and cache invalidate, an ioasid FD returned to userspace.
> 2) userspace passes the ioasid FD to VFIO, let it associated with a device
>FD (like vfio_device FD).
> 3) userspace binds page table on the ioasid FD with the page table info.
> 4) userspace unbinds the page table on the ioasid FD
> 5) userspace de-associates the ioasid FD and device FD
> 
> Does above suit your outline?

Seems so

> If yes, I still have below concern and wish to see your opinion.
> - the ioasid FD and device association will happen at runtime instead of
>   just happen in the setup phase.

Of course, this is required for security. The vIOMMU must perform the
device association when the guest requires it. Otherwise a guest
cannot isolate a PASID to a single process/device pair.

I'm worried Intel views the only use of PASID in a guest is with
ENQCMD, but that is not consistent with the industry. We need to see
normal nested PASID support with assigned PCI VFs.

> - how about AMD and ARM's vSVA support? Their PASID allocation and page table
>   happens within guest. They only need to bind the guest PASID table to host.
>   Above model seems unable to fit them. (Jean, Eric, Jacob please feel free
>   to correct me)

No, everything needs the device association step or it is not
secure. 

You can give a PASID to a guest and allow it to manipulate it's memory
map directly, nested under the guest's CPU page tables.

However the guest cannot authorize a PCI BDF to utilize that PASID
without going through some kind of step in the hypervisor. A Guest
should not be able to authorize a PASID for a BDF it doesn't have
access to - only the hypervisor can enforce this.

This all must also fit into the mdev model where only the
device-specific mdev driver can do the device specific PASID
authorization. A hypercall is essential, or we need to stop pretending
mdev is a good idea.

I'm sure there will be some small differences, and you should clearly
explain the entire uAPI surface so that soneone from AMD and ARM can
review it.

> - this per-ioasid SVA operations is not aligned with the native SVA usage
>   model. Native SVA bind is per-device.

Seems like that is an error in native SVA.

SVA is a particular mode of the PASID's memory mapping table, it has
nothing to do with a device.

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Jean-Philippe Brucker
On Thu, Apr 01, 2021 at 07:04:01AM +, Liu, Yi L wrote:
> > - how about AMD and ARM's vSVA support? Their PASID allocation and page
> > table
> >   happens within guest. They only need to bind the guest PASID table to
> > host.

In this case each VM has its own IOASID space, and the host IOASID
allocator doesn't participate. Plus this only makes sense when assigning a
whole VF to a guest, and VFIO is the tool for this. So I wouldn't shoehorn
those ops into /dev/ioasid, though we do need a transport for invalidate
commands.

> >   Above model seems unable to fit them. (Jean, Eric, Jacob please feel free
> >   to correct me)
> > - this per-ioasid SVA operations is not aligned with the native SVA usage
> >   model. Native SVA bind is per-device.

Bare-metal SVA doesn't need /dev/ioasid either. A program uses a device
handle to either ask whether SVA is enabled, or to enable it explicitly.
With or without /dev/ioasid, that step is required. OpenCL uses the first
method - automatically enable "fine-grain system SVM" if available, and
provide a flag to userspace.

So userspace does not need to know about PASID. It's only one method for
doing SVA (some GPUs are context-switching page tables instead).

> After reading your reply in 
> https://lore.kernel.org/linux-iommu/20210331123801.gd1463...@nvidia.com/#t
> So you mean /dev/ioasid FD is per-VM instead of per-ioasid, so above skeleton
> doesn't suit your idea. I draft below skeleton to see if our mind is the
> same. But I still believe there is an open on how to fit ARM and AMD's
> vSVA support in this the per-ioasid SVA operation model. thoughts?
> 
> +-+---+
> |  userspace  |   kernel space
> |
> +-+---+
> | ioasid_fd = | /dev/ioasid does below:   
> |
> | open("/dev/ioasid", O_RDWR);|   struct ioasid_fd_ctx {  
> |
> | |   struct list_head ioasid_list;   
> |
> | |   ... 
> |
> | |   } ifd_ctx; // ifd_ctx is per ioasid_fd  
> |
> +-+---+
> | ioctl(ioasid_fd,| /dev/ioasid does below:   
> |
> |   ALLOC, );  |   struct ioasid_data {
> |
> | |   ioasid_t ioasid;
> |
> | |   struct list_head device_list;   
> |
> | |   struct list_head next;  
> |
> | |   ... 
> |
> | |   } id_data; // id_data is per ioasid 
> |
> | |   
> |
> | |   list_add(_data.next, 
> |
> | |_ctx.ioasid_list); 
> |
> +-+---+
> | ioctl(device_fd,| VFIO does below:  
> |
> |   DEVICE_ALLOW_IOASID,  | 1) get ioasid_fd, check if ioasid_fd is valid 
> |
> |   ioasid_fd,| 2) check if ioasid is allocated from 
> ioasid_fd|
> |   ioasid);  | 3) register device/domain info to /dev/ioasid 
> |
> | |tracked in id_data.device_list 
> |
> | | 4) record the ioasid in VFIO's per-device 
> |
> | |ioasid list for future security check  
> |
> +-+---+
> | ioctl(ioasid_fd,| /dev/ioasid does below:   
> |
> |   BIND_PGTBL,   | 1) find ioasid's id_data  
> |
> |   pgtbl_data,   | 2) loop the id_data.device_list and tell 
> iommu|
> |   ioasid);  |give ioasid access to the devices  
> |
> +-+---+
> | ioctl(ioasid_fd,| /dev/ioasid does below:   
> |
> |   UNBIND_PGTBL, | 1) find ioasid's id_data  
> |
> |   ioasid);  | 2) loop the id_data.device_list and tell 
> iommu|
> | |clear ioasid access to the devices 
> |
> +-+---+
> | ioctl(device_fd,| VFIO does below:  
> |
> |  DEVICE_DISALLOW_IOASID,| 1) check if ioasid is associated in VFIO's
> |
> |  

RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-04-01 Thread Liu, Yi L
Hi Jason,

> From: Liu, Yi L 
> Sent: Thursday, April 1, 2021 12:39 PM
> 
> > From: Jason Gunthorpe 
> > Sent: Wednesday, March 31, 2021 8:41 PM
> >
> > On Wed, Mar 31, 2021 at 07:38:36AM +, Liu, Yi L wrote:
> >
> > > The reason is /dev/ioasid FD is per-VM since the ioasid allocated to
> > > the VM should be able to be shared by all assigned device for the VM.
> > > But the SVA operations (bind/unbind page table, cache_invalidate)
> should
> > > be per-device.
> >
> > It is not *per-device* it is *per-ioasid*
> >
> > And as /dev/ioasid is an interface for controlling multiple ioasid's
> > there is no issue to also multiplex the page table manipulation for
> > multiple ioasids as well.
> >
> > What you should do next is sketch out in some RFC the exactl ioctls
> > each FD would have and show how the parts I outlined would work and
> > point out any remaining gaps.
> >
> > The device FD is something like the vfio_device FD from VFIO, it has
> > *nothing* to do with PASID beyond having a single ioctl to authorize
> > the device to use the PASID. All control of the PASID is in
> > /dev/ioasid.
> 
> good to see this reply. Your idea is much clearer to me now. If I'm getting
> you correctly. I think the skeleton is something like below:
> f
> 1) userspace opens a /dev/ioasid, meanwhile there will be an ioasid
>allocated and a per-ioasid context which can be used to do bind page
>table and cache invalidate, an ioasid FD returned to userspace.
> 2) userspace passes the ioasid FD to VFIO, let it associated with a device
>FD (like vfio_device FD).
> 3) userspace binds page table on the ioasid FD with the page table info.
> 4) userspace unbinds the page table on the ioasid FD
> 5) userspace de-associates the ioasid FD and device FD
> 
> Does above suit your outline?
> 
> If yes, I still have below concern and wish to see your opinion.
> - the ioasid FD and device association will happen at runtime instead of
>   just happen in the setup phase.
> - how about AMD and ARM's vSVA support? Their PASID allocation and page
> table
>   happens within guest. They only need to bind the guest PASID table to
> host.
>   Above model seems unable to fit them. (Jean, Eric, Jacob please feel free
>   to correct me)
> - this per-ioasid SVA operations is not aligned with the native SVA usage
>   model. Native SVA bind is per-device.

After reading your reply in 
https://lore.kernel.org/linux-iommu/20210331123801.gd1463...@nvidia.com/#t
So you mean /dev/ioasid FD is per-VM instead of per-ioasid, so above skeleton
doesn't suit your idea. I draft below skeleton to see if our mind is the
same. But I still believe there is an open on how to fit ARM and AMD's
vSVA support in this the per-ioasid SVA operation model. thoughts?

+-+---+
|  userspace  |   kernel space|
+-+---+
| ioasid_fd = | /dev/ioasid does below:   |
| open("/dev/ioasid", O_RDWR);|   struct ioasid_fd_ctx {  |
| |   struct list_head ioasid_list;   |
| |   ... |
| |   } ifd_ctx; // ifd_ctx is per ioasid_fd  |
+-+---+
| ioctl(ioasid_fd,| /dev/ioasid does below:   |
|   ALLOC, );  |   struct ioasid_data {|
| |   ioasid_t ioasid;|
| |   struct list_head device_list;   |
| |   struct list_head next;  |
| |   ... |
| |   } id_data; // id_data is per ioasid |
| |   |
| |   list_add(_data.next, |
| |_ctx.ioasid_list); |
+-+---+
| ioctl(device_fd,| VFIO does below:  |
|   DEVICE_ALLOW_IOASID,  | 1) get ioasid_fd, check if ioasid_fd is valid |
|   ioasid_fd,| 2) check if ioasid is allocated from ioasid_fd|
|   ioasid);  | 3) register device/domain info to /dev/ioasid |
| |tracked in id_data.device_list |
| | 4) record the ioasid in VFIO's per-device |
| |ioasid list for future security check  |
+-+---+
| 

RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Wednesday, March 31, 2021 8:41 PM
> 
> On Wed, Mar 31, 2021 at 07:38:36AM +, Liu, Yi L wrote:
> 
> > The reason is /dev/ioasid FD is per-VM since the ioasid allocated to
> > the VM should be able to be shared by all assigned device for the VM.
> > But the SVA operations (bind/unbind page table, cache_invalidate) should
> > be per-device.
> 
> It is not *per-device* it is *per-ioasid*
>
> And as /dev/ioasid is an interface for controlling multiple ioasid's
> there is no issue to also multiplex the page table manipulation for
> multiple ioasids as well.
> 
> What you should do next is sketch out in some RFC the exactl ioctls
> each FD would have and show how the parts I outlined would work and
> point out any remaining gaps.
> 
> The device FD is something like the vfio_device FD from VFIO, it has
> *nothing* to do with PASID beyond having a single ioctl to authorize
> the device to use the PASID. All control of the PASID is in
> /dev/ioasid.

good to see this reply. Your idea is much clearer to me now. If I'm getting
you correctly. I think the skeleton is something like below:

1) userspace opens a /dev/ioasid, meanwhile there will be an ioasid
   allocated and a per-ioasid context which can be used to do bind page
   table and cache invalidate, an ioasid FD returned to userspace.
2) userspace passes the ioasid FD to VFIO, let it associated with a device
   FD (like vfio_device FD).
3) userspace binds page table on the ioasid FD with the page table info.
4) userspace unbinds the page table on the ioasid FD
5) userspace de-associates the ioasid FD and device FD

Does above suit your outline?

If yes, I still have below concern and wish to see your opinion.
- the ioasid FD and device association will happen at runtime instead of
  just happen in the setup phase.
- how about AMD and ARM's vSVA support? Their PASID allocation and page table
  happens within guest. They only need to bind the guest PASID table to host.
  Above model seems unable to fit them. (Jean, Eric, Jacob please feel free
  to correct me)
- this per-ioasid SVA operations is not aligned with the native SVA usage
  model. Native SVA bind is per-device.

Regards,
Yi Liu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Jason Gunthorpe
On Wed, Mar 31, 2021 at 04:46:21PM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Wed, 31 Mar 2021 09:38:01 -0300, Jason Gunthorpe  wrote:
> 
> > > > Get rid of the ioasid set.
> > > >
> > > > Each driver has its own list of allowed ioasids.  
> >  [...]  
> > 
> > The /dev/ioasid FD replaces this security check. By becoming FD
> > centric you don't need additional kernel security objects.
> > 
> > Any process with access to the /dev/ioasid FD is allowed to control
> > those PASID. The seperation between VMs falls naturally from the
> > seperation of FDs without creating additional, complicated, security
> > infrastrucure in the kernel.
> > 
> > This is why all APIs must be FD focused, and you need to have a
> > logical layering of responsibility.
> > 
> >  Allocate a /dev/ioasid FD
> >  Allocate PASIDs inside the FD
> >  Assign memory to the PASIDS
> > 
> >  Open a device FD, eg from VFIO or VDP
> >  Instruct the device FD to authorize the device to access PASID A in
> >  an ioasid FD
> How do we know user provided PASID A was allocated by the ioasid FD?

You pass in the ioasid FD and use a 'get pasid from fdno' API to
extract the required kernel structure.

> Shouldn't we validate user input by tracking which PASIDs are
> allocated by which ioasid FD?

Yes, but it is integral to the ioasid FD, not something separated.

> > VFIO extracts some kernel representation of the ioasid from the ioasid
> > fd using an API
> > 
> This lookup API seems to be asking for per ioasid FD storage array. Today,
> the ioasid_set is per mm and contains a Xarray. 

Right, put the xarray per FD. A set per mm is fairly nonsensical, we
don't use the mm as that kind of security key.

> Since each VM, KVM can only open one ioasid FD, this per FD array
> would be equivalent to the per mm ioasid_set, right?

Why only one?  Each interaction with the other FDs should include the
PASID/FD pair. There is no restriction to just one.

> > VFIO does some kernel call to IOMMU/IOASID layer that says 'tell the
> > IOMMU that this PCI device is allowed to use this PASID'
>
> Would it be redundant to what iommu_uapi_sva_bind_gpasid() does? I thought
> the idea is to use ioasid FD IOCTL to issue IOMMU uAPI calls. Or we can
> skip this step for now and wait for the user to do SVA bind.

I'm not sure what you are asking.

Possibly some of the IOMMU API will need a bit adjusting to make
things split.

The act of programming the page tables and the act of authorizing a
PCI BDF to use a PASID are distinct things with two different IOCTLs.

iommu_uapi_sva_bind_gpasid() is never called by anything, and it's
uAPI is never implemented.

Joerg? Why did you merge dead uapi and dead code?

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Jacob Pan
Hi Jason,

On Wed, 31 Mar 2021 09:38:01 -0300, Jason Gunthorpe  wrote:

> > > Get rid of the ioasid set.
> > >
> > > Each driver has its own list of allowed ioasids.  
>  [...]  
> 
> The /dev/ioasid FD replaces this security check. By becoming FD
> centric you don't need additional kernel security objects.
> 
> Any process with access to the /dev/ioasid FD is allowed to control
> those PASID. The seperation between VMs falls naturally from the
> seperation of FDs without creating additional, complicated, security
> infrastrucure in the kernel.
> 
> This is why all APIs must be FD focused, and you need to have a
> logical layering of responsibility.
> 
>  Allocate a /dev/ioasid FD
>  Allocate PASIDs inside the FD
>  Assign memory to the PASIDS
> 
>  Open a device FD, eg from VFIO or VDP
>  Instruct the device FD to authorize the device to access PASID A in
>  an ioasid FD
How do we know user provided PASID A was allocated by the ioasid FD?
Shouldn't we validate user input by tracking which PASIDs are allocated by
which ioasid FD? This is one reason why we have ioasid_set and its xarray.

>* Prior to being authorized the device will have NO access to any
>  PASID
>* Presenting BOTH the device FD and the ioasid FD to the kernel
>  is the security check. Any process with both FDs is allowed to
>  make the connection. This is normal Unix FD centric thinking.
> 
> > > Register a ioasid in the driver's list by passing the fd and ioasid #
> > >  
> > 
> > The fd here is a device fd. Am I right?   
> 
> It would be the vfio_device FD, for instance, and a VFIO IOCTL.
> 
> > If yes, your idea is ioasid is allocated via /dev/ioasid and
> > associated with device fd via either VFIO or vDPA ioctl. right?
> > sorry I may be asking silly questions but really need to ensure we
> > are talking in the same page.  
> 
> Yes, this is right
> 
> > > No listening to events. A simple understandable security model.  
> > 
> > For this suggestion, I have a little bit concern if we may have A-B/B-A
> > lock sequence issue since it requires the /dev/ioasid (if it supports)
> > to call back into VFIO/VDPA to check if the ioasid has been registered
> > to device FD and record it in the per-device list. right? Let's have
> > more discussion based on the skeleton sent by Kevin.  
> 
> Callbacks would be backwards.
> 
> User calls vfio with vfio_device fd and dev/ioasid fd
> 
> VFIO extracts some kernel representation of the ioasid from the ioasid
> fd using an API
> 
This lookup API seems to be asking for per ioasid FD storage array. Today,
the ioasid_set is per mm and contains a Xarray. Since each VM, KVM can only
open one ioasid FD, this per FD array would be equivalent to the per mm
ioasid_set, right?

> VFIO does some kernel call to IOMMU/IOASID layer that says 'tell the
> IOMMU that this PCI device is allowed to use this PASID'
> 
Would it be redundant to what iommu_uapi_sva_bind_gpasid() does? I thought
the idea is to use ioasid FD IOCTL to issue IOMMU uAPI calls. Or we can
skip this step for now and wait for the user to do SVA bind.

> VFIO mdev drivers then record that the PASID is allowed in its own
> device specific struct for later checking during other system calls.


Thanks,

Jacob


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Jacob Pan
Hi Jason,

On Wed, 31 Mar 2021 15:33:24 -0300, Jason Gunthorpe  wrote:

> On Wed, Mar 31, 2021 at 11:20:30AM -0700, Jacob Pan wrote:
> > Hi Jason,
> > 
> > On Wed, 31 Mar 2021 14:31:48 -0300, Jason Gunthorpe 
> > wrote: 
> > > > > We should try to avoid hidden behind the scenes kernel
> > > > > interconnections between subsystems.
> > > > > 
>  [...]  
>  [...]  
> > yes, this is done in this patchset.
> >   
>  [...]  
> > Just to clarify, you are saying (when FREE happens before proper
> > teardown) there is no need to proactively notify all users of the
> > IOASID to drop their reference. Instead, just wait for the other
> > parties to naturally close and drop their references. Am I
> > understanding you correctly?  
> 
> Yes. What are receivers going to do when you notify them anyhow? What
> will a mdev do? This is how you get into they crazy locking problems.
> 
The receivers perform cleanup work similar to normal unbind. Drain/Abort
PASID. Locking is an issue in that the atomic notifier is under IOASID
spinlock, so I provided a common ordered workqueue to let mdev drivers
queue cleanup work that cannot be done in atomic context. Not ideal. Also
need to prevent nested notifications for certain cases.

> It is an error for userspace to shutdown like this, recover sensibly
> and don't crash the kernel. PCIe error TLPs are expected, supress
> them. That is what we decided on the mmu notifier discussion.
> 
> > I feel having the notifications can add two values:
> > 1. Shorten the duration of errors (as you mentioned below), FD close can
> > take a long and unpredictable time. e.g. FD shared.  
> 
> Only if userspace exits in some uncontrolled way. In a controlled exit
> it can close all the FDs in the right order.
> 
> It is OK if userspace does something weird and ends up with disabled
> IOASIDs. It shouldn't do that if it cares.
> 
Agreed.

> > 2. Provide teardown ordering among PASID users. i.e. vCPU, IOMMU, mdev.
> >  
> 
> This is a hard ask too, there is no natural ordering here I can see,
> obviously we want vcpu, mdev, iommu for qemu but that doesn't seem to
> fall out unless we explicitly hard wire it into the kernel.
> 
The ordering problem as I understood is that it is difficult for KVM to
rendezvous all vCPUs before updating PASID translation table. So there
could be in-flight enqcmd with the stale PASID after the PASID table update
and refcount drop.

If KVM is the last one to drop the PASID refcount, the PASID could be
immediately reused and starts a new life. The in-flight enqcmd with the
stale PASID could cause problems. The likelihood and window is very small.

If we ensure KVM does PASID table update before IOMMU and mdev driver, the
stale PASID in the in-flight enqcmd would be be drained before starting
a new life.

Perhaps Yi and Kevin can explain this better.

> Doesn't kvm always kill the vCPU first based on the mmu notifier
> shooting down all the memory? IIRC this happens before FD close?
> 
I don't know the answer, Kevin & Yi?

> > > The duration between unmapping the ioasid and releasing all HW access
> > > will have HW see PCIE TLP errors due to the blocked access. If
> > > userspace messes up the order it is fine to cause this. We already had
> > > this dicussion when talking about how to deal with process exit in the
> > > simple SVA case.  
> > Yes, we have disabled fault reporting during this period. The slight
> > differences vs. the simple SVA case is that KVM is also involved and
> > there might be an ordering requirement to stop vCPU first.  
> 
> KVM can continue to use the PASIDs, they are parked and DMA is
> permanently blocked. When KVM reaches a natural point in its teardown
> it can release them.
> 
> If you have to stop the vcpu from a iommu notifier you are in the
> crazy locking world I mentioned. IMHO don't create exciting locking
> problems just to avoid PCI errors in uncontrolled shutdown.
> 
> Suppress the errors instead.
> 
I agree, this simplify things a lot. Just need to clarify the in-flight
enqcmd case.

> Jason


Thanks,

Jacob


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Jason Gunthorpe
On Wed, Mar 31, 2021 at 11:20:30AM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Wed, 31 Mar 2021 14:31:48 -0300, Jason Gunthorpe  wrote:
> 
> > > > We should try to avoid hidden behind the scenes kernel
> > > > interconnections between subsystems.
> > > >   
> > > Can we? in case of exception. Since all these IOCTLs are coming from the
> > > unreliable user space, we must deal all exceptions.
> > >
> > > For example, when user closes /dev/ioasid FD before (or w/o) unbind
> > > IOCTL for VFIO, KVM, kernel must do cleanup and coordinate among
> > > subsystems. In this patchset, we have a per mm(ioasid_set) notifier to
> > > inform mdev, KVM to clean up and drop its refcount. Do you have any
> > > suggestion on this?  
> > 
> > The ioasid should be a reference counted object.
> > 
> yes, this is done in this patchset.
> 
> > When the FD is closed, or the ioasid is "destroyed" it just blocks DMA
> > and parks the PASID until *all* places release it. Upon a zero
> > refcount the PASID is recycled for future use.
> > 
> Just to clarify, you are saying (when FREE happens before proper
> teardown) there is no need to proactively notify all users of the IOASID to
> drop their reference. Instead, just wait for the other parties to naturally
> close and drop their references. Am I understanding you correctly?

Yes. What are receivers going to do when you notify them anyhow? What
will a mdev do? This is how you get into they crazy locking problems.

It is an error for userspace to shutdown like this, recover sensibly
and don't crash the kernel. PCIe error TLPs are expected, supress
them. That is what we decided on the mmu notifier discussion.

> I feel having the notifications can add two values:
> 1. Shorten the duration of errors (as you mentioned below), FD close can
> take a long and unpredictable time. e.g. FD shared.

Only if userspace exits in some uncontrolled way. In a controlled exit
it can close all the FDs in the right order.

It is OK if userspace does something weird and ends up with disabled
IOASIDs. It shouldn't do that if it cares.

> 2. Provide teardown ordering among PASID users. i.e. vCPU, IOMMU, mdev.

This is a hard ask too, there is no natural ordering here I can see,
obviously we want vcpu, mdev, iommu for qemu but that doesn't seem to
fall out unless we explicitly hard wire it into the kernel.

Doesn't kvm always kill the vCPU first based on the mmu notifier
shooting down all the memory? IIRC this happens before FD close?

> > The duration between unmapping the ioasid and releasing all HW access
> > will have HW see PCIE TLP errors due to the blocked access. If
> > userspace messes up the order it is fine to cause this. We already had
> > this dicussion when talking about how to deal with process exit in the
> > simple SVA case.
> Yes, we have disabled fault reporting during this period. The slight
> differences vs. the simple SVA case is that KVM is also involved and there
> might be an ordering requirement to stop vCPU first.

KVM can continue to use the PASIDs, they are parked and DMA is
permanently blocked. When KVM reaches a natural point in its teardown
it can release them.

If you have to stop the vcpu from a iommu notifier you are in the
crazy locking world I mentioned. IMHO don't create exciting locking
problems just to avoid PCI errors in uncontrolled shutdown.

Suppress the errors instead.

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Jacob Pan
Hi Jason,

On Wed, 31 Mar 2021 14:31:48 -0300, Jason Gunthorpe  wrote:

> > > We should try to avoid hidden behind the scenes kernel
> > > interconnections between subsystems.
> > >   
> > Can we? in case of exception. Since all these IOCTLs are coming from the
> > unreliable user space, we must deal all exceptions.
> >
> > For example, when user closes /dev/ioasid FD before (or w/o) unbind
> > IOCTL for VFIO, KVM, kernel must do cleanup and coordinate among
> > subsystems. In this patchset, we have a per mm(ioasid_set) notifier to
> > inform mdev, KVM to clean up and drop its refcount. Do you have any
> > suggestion on this?  
> 
> The ioasid should be a reference counted object.
> 
yes, this is done in this patchset.

> When the FD is closed, or the ioasid is "destroyed" it just blocks DMA
> and parks the PASID until *all* places release it. Upon a zero
> refcount the PASID is recycled for future use.
> 
Just to clarify, you are saying (when FREE happens before proper
teardown) there is no need to proactively notify all users of the IOASID to
drop their reference. Instead, just wait for the other parties to naturally
close and drop their references. Am I understanding you correctly?

I feel having the notifications can add two values:
1. Shorten the duration of errors (as you mentioned below), FD close can
take a long and unpredictable time. e.g. FD shared.
2. Provide teardown ordering among PASID users. i.e. vCPU, IOMMU, mdev.

> The duration between unmapping the ioasid and releasing all HW access
> will have HW see PCIE TLP errors due to the blocked access. If
> userspace messes up the order it is fine to cause this. We already had
> this dicussion when talking about how to deal with process exit in the
> simple SVA case.
Yes, we have disabled fault reporting during this period. The slight
differences vs. the simple SVA case is that KVM is also involved and there
might be an ordering requirement to stop vCPU first.

Thanks,

Jacob


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Jason Gunthorpe
On Wed, Mar 31, 2021 at 09:34:57AM -0700, Jacob Pan wrote:

> "3.3 PASID translation
> To support PASID isolation for Shared Work Queues used by VMs, the CPU must
> provide a way for the PASID to be communicated to the device in the DMWr
> transaction. On Intel CPUs, the CPU provides a PASID translation table in
> the vCPUs virtual machine control structures. During ENQCMD/ENQCMDS
> instruction execution in a VM, the PASID translation table is used by the
> CPU to replace the guest PASID in the work descriptor with a host PASID
> before the descriptor is sent to the device.3.3 PASID translation"

Yikes, a special ENQCMD table in the hypervisor!

Still, pass the /dev/ioasid into a KVM IOCTL and tell it to populate
this table. KVM only adds to the table when userspace presents a
/dev/ioasid FD.

> > Doesn't work submission go either to the mdev driver or through the
> > secure PASID of #1?
> 
> No, once a PASID is bound with IOMMU, KVM, and the mdev, work
> submission is all done in HW.  But I don't think this will change
> for either uAPI design.

The big note here is "only for things that use ENQCMD" and that is
basically nothing these days.

> > Everything should revolve around the /dev/ioasid FD. qemu should pass
> > it to all places that need to know about PASID's in the VM.
> 
> I guess we need to extend KVM interface to support PASIDs. Our original
> intention was to avoid introducing new interfaces.

New features need new interfaces, especially if there is a security
sensitivity! KVM should *not* automatically opt into security
sensitive stuff without being explicitly told what to do.

Here you'd need to authorized *two* things for IDXD:
 - The mdev needs to be told it is allowed to use PASID, this tells
   the IOMMU driver to connect the pci device under the mdev
 - KVM needs to be told to populate a vPASID to the 'ENQCMD'
   security table translated to a physical PASID.

If qemu doesn't explicitly enable the ENQCMD security table it should
be *left disabled* by KVM - even if someone else is using PASID in the
same process. And the API should be narrow like this just to the
EQNCMD table as who knows what will come down the road, or how it will
work.

Having a PASID wrongly leak out into the VM would be a security
disaster. Be explicit.

> > We should try to avoid hidden behind the scenes kernel
> > interconnections between subsystems.
> > 
> Can we? in case of exception. Since all these IOCTLs are coming from the
> unreliable user space, we must deal all exceptions.
>
> For example, when user closes /dev/ioasid FD before (or w/o) unbind IOCTL
> for VFIO, KVM, kernel must do cleanup and coordinate among subsystems.
> In this patchset, we have a per mm(ioasid_set) notifier to inform mdev, KVM
> to clean up and drop its refcount. Do you have any suggestion on this?

The ioasid should be a reference counted object.

When the FD is closed, or the ioasid is "destroyed" it just blocks DMA
and parks the PASID until *all* places release it. Upon a zero
refcount the PASID is recycled for future use.

The duration between unmapping the ioasid and releasing all HW access
will have HW see PCIE TLP errors due to the blocked access. If
userspace messes up the order it is fine to cause this. We already had
this dicussion when talking about how to deal with process exit in the
simple SVA case.

> Let me try to paraphrase, you are suggesting common helper code and data
> format but still driver specific storage of the mapping, correct?

The driver just needs to hold the datastructure in its memory.

Like an xarray, the driver can have an xarray inside its struct
device, but the xarray library provides all the implementation.

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Jacob Pan
Hi Jason,

On Wed, 31 Mar 2021 09:28:05 -0300, Jason Gunthorpe  wrote:

> On Tue, Mar 30, 2021 at 05:10:41PM -0700, Jacob Pan wrote:
>  [...]  
>  [...]  
>  [...]  
> > This requires the mdev driver to obtain a list of allowed
> > PASIDs(possibly during PASID bind time) prior to do enforcement. IMHO,
> > the PASID enforcement points are:
> > 1. During WQ configuration (e.g.program MSI)
> > 2. During work submission
> > 
> > For VT-d shared workqueue, there is no way to enforce #2 in mdev driver
> > in that the PASID is obtained from PASID MSR from the CPU and submitted
> > w/o driver involvement.  
> 
> I assume that the PASID MSR is privileged and only qemu can program
> it? Otherwise this seems like a security problem.
> 
yes.

> If qemu controls it then the idxd userspace driver in qemu must ensure
> it is only ever programmed to an authorized PASID.
> 
it is ensured for #1.

> > The enforcement for #2 is in the KVM PASID translation table, which
> > is per VM.  
> 
> I don't understand why KVM gets involved in PASID??
> 
Here is an excerpt from the SIOV spec.
https://software.intel.com/content/www/us/en/develop/download/intel-scalable-io-virtualization-technical-specification.html

"3.3 PASID translation
To support PASID isolation for Shared Work Queues used by VMs, the CPU must
provide a way for the PASID to be communicated to the device in the DMWr
transaction. On Intel CPUs, the CPU provides a PASID translation table in
the vCPUs virtual machine control structures. During ENQCMD/ENQCMDS
instruction execution in a VM, the PASID translation table is used by the
CPU to replace the guest PASID in the work descriptor with a host PASID
before the descriptor is sent to the device.3.3 PASID translation"

> Doesn't work submission go either to the mdev driver or through the
> secure PASID of #1?
> 
No, once a PASID is bound with IOMMU, KVM, and the mdev, work submission is
all done in HW.
But I don't think this will change for either uAPI design.

> > For our current VFIO mdev model, bind guest page table does not involve
> > mdev driver. So this is a gap we must fill, i.e. include a callback from
> > mdev driver?  
> 
> No not a callback, tell the mdev driver with a VFIO IOCTL that it is
> authorized to use a specific PASID because the vIOMMU was told to
> allow it by the guest kernel. Simple and straightforward.
> 
Make sense.

> > > ioasid_set doesn't seem to help at all, certainly not as a concept
> > > tied to /dev/ioasid.
> > >   
> > Yes, we can take the security role off ioasid_set once we have per mdev
> > list. However, ioasid_set being a per VM/mm entity also bridge
> > communications among kernel subsystems that don't have direct call path.
> > e.g. KVM, VDCM and IOMMU.  
> 
> Everything should revolve around the /dev/ioasid FD. qemu should pass
> it to all places that need to know about PASID's in the VM.
> 
I guess we need to extend KVM interface to support PASIDs. Our original
intention was to avoid introducing new interfaces.

> We should try to avoid hidden behind the scenes kernel
> interconnections between subsystems.
> 
Can we? in case of exception. Since all these IOCTLs are coming from the
unreliable user space, we must deal all exceptions.

For example, when user closes /dev/ioasid FD before (or w/o) unbind IOCTL
for VFIO, KVM, kernel must do cleanup and coordinate among subsystems.
In this patchset, we have a per mm(ioasid_set) notifier to inform mdev, KVM
to clean up and drop its refcount. Do you have any suggestion on this?

> 
> > > So when you 'allow' a mdev to access a PASID you want to say:
> > >  Allow Guest PASID A, map it to host PASID B on this /dev/ioasid FD
> > >   
> 
> > Host and guest PASID value, as well as device info are available through
> > iommu_uapi_sva_bind_gpasid(), we just need to feed that info to mdev
> > driver.  
> 
> You need that IOCTL to exist on the *mdev driver*. It is a VFIO ioctl,
> not a iommu or ioasid or sva IOCTL.
>
OK. A separate IOCTL and separate step.

> > > That seems like a good helper library to provide for drivers to use,
> > > but it should be a construct entirely contained in the driver.  
> > why? would it be cleaner if it is in the common code?  
> 
> No, it is the "mid layer" problematic design.
> 
> Having the iommu layer store driver-specific data on behalf of a
> driver will just make a mess. Use the natural layering we have and
> store driver specific data in the driver structs.
> 
> Add a library to help build the datastructure if it necessary.
> 
Let me try to paraphrase, you are suggesting common helper code and data
format but still driver specific storage of the mapping, correct?

Will try this out, seems cleaner.

> Jason


Thanks,

Jacob


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Jason Gunthorpe
On Wed, Mar 31, 2021 at 07:38:36AM +, Liu, Yi L wrote:

> The reason is /dev/ioasid FD is per-VM since the ioasid allocated to
> the VM should be able to be shared by all assigned device for the VM.
> But the SVA operations (bind/unbind page table, cache_invalidate) should
> be per-device.

It is not *per-device* it is *per-ioasid*

And as /dev/ioasid is an interface for controlling multiple ioasid's
there is no issue to also multiplex the page table manipulation for
multiple ioasids as well.

What you should do next is sketch out in some RFC the exactl ioctls
each FD would have and show how the parts I outlined would work and
point out any remaining gaps.

The device FD is something like the vfio_device FD from VFIO, it has
*nothing* to do with PASID beyond having a single ioctl to authorize
the device to use the PASID. All control of the PASID is in
/dev/ioasid.

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Jason Gunthorpe
On Wed, Mar 31, 2021 at 07:41:40AM +, Liu, Yi L wrote:
> > From: Jason Gunthorpe 
> > Sent: Tuesday, March 30, 2021 9:28 PM
> > 
> > On Tue, Mar 30, 2021 at 04:14:58AM +, Tian, Kevin wrote:
> > 
> > > One correction. The mdev should still construct the list of allowed 
> > > PASID's
> > as
> > > you said (by listening to IOASID_BIND/UNBIND event), in addition to the
> > ioasid
> > > set maintained per VM (updated when a PASID is allocated/freed). The
> > per-VM
> > > set is required for inter-VM isolation (verified when a pgtable is bound 
> > > to
> > the
> > > mdev/PASID), while the mdev's own list is necessary for intra-VM isolation
> > when
> > > multiple mdevs are assigned to the same VM (verified before loading a
> > PASID
> > > to the mdev). This series just handles the general part i.e. per-VM ioasid
> > set and
> > > leaves the mdev's own list to be managed by specific mdev driver which
> > listens
> > > to various IOASID events).
> > 
> > This is better, but I don't understand why we need such a convoluted
> > design.
> > 
> > Get rid of the ioasid set.
> >
> > Each driver has its own list of allowed ioasids.
> 
> First, I agree with you it's necessary to have a per-device allowed ioasid
> list. But besides it, I think we still need to ensure the ioasid used by a
> VM is really allocated to this VM. A VM should not use an ioasid allocated
> to another VM. right? Actually, this is the major intention for introducing
> ioasid_set.

The /dev/ioasid FD replaces this security check. By becoming FD
centric you don't need additional kernel security objects.

Any process with access to the /dev/ioasid FD is allowed to control
those PASID. The seperation between VMs falls naturally from the
seperation of FDs without creating additional, complicated, security
infrastrucure in the kernel.

This is why all APIs must be FD focused, and you need to have a
logical layering of responsibility.

 Allocate a /dev/ioasid FD
 Allocate PASIDs inside the FD
 Assign memory to the PASIDS

 Open a device FD, eg from VFIO or VDP
 Instruct the device FD to authorize the device to access PASID A in
 an ioasid FD
   * Prior to being authorized the device will have NO access to any
 PASID
   * Presenting BOTH the device FD and the ioasid FD to the kernel
 is the security check. Any process with both FDs is allowed to
 make the connection. This is normal Unix FD centric thinking.

> > Register a ioasid in the driver's list by passing the fd and ioasid #
> 
> The fd here is a device fd. Am I right? 

It would be the vfio_device FD, for instance, and a VFIO IOCTL.

> If yes, your idea is ioasid is allocated via /dev/ioasid and
> associated with device fd via either VFIO or vDPA ioctl. right?
> sorry I may be asking silly questions but really need to ensure we
> are talking in the same page.

Yes, this is right

> > No listening to events. A simple understandable security model.
> 
> For this suggestion, I have a little bit concern if we may have A-B/B-A
> lock sequence issue since it requires the /dev/ioasid (if it supports)
> to call back into VFIO/VDPA to check if the ioasid has been registered to
> device FD and record it in the per-device list. right? Let's have more
> discussion based on the skeleton sent by Kevin.

Callbacks would be backwards.

User calls vfio with vfio_device fd and dev/ioasid fd

VFIO extracts some kernel representation of the ioasid from the ioasid
fd using an API

VFIO does some kernel call to IOMMU/IOASID layer that says 'tell the
IOMMU that this PCI device is allowed to use this PASID'

VFIO mdev drivers then record that the PASID is allowed in its own
device specific struct for later checking during other system calls.

No lock inversions. No callbacks. Why do we need callbacks?? Simplify.

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Jason Gunthorpe
On Tue, Mar 30, 2021 at 05:10:41PM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Tue, 30 Mar 2021 10:43:13 -0300, Jason Gunthorpe  wrote:
> 
> > > If two mdevs from the same PF dev are assigned to two VMs, the PASID
> > > table will be shared. IOASID set ensures one VM cannot program another
> > > VM's PASIDs. I assume 'secure context' is per VM when it comes to host
> > > PASID.  
> > 
> > No, the mdev device driver must enforce this directly. It is the one
> > that programms the physical shared HW, it is the one that needs a list
> > of PASID's it is allowed to program *for each mdev*
> > 
> This requires the mdev driver to obtain a list of allowed PASIDs(possibly
> during PASID bind time) prior to do enforcement. IMHO, the PASID enforcement
> points are:
> 1. During WQ configuration (e.g.program MSI)
> 2. During work submission
> 
> For VT-d shared workqueue, there is no way to enforce #2 in mdev driver in
> that the PASID is obtained from PASID MSR from the CPU and submitted w/o
> driver involvement.

I assume that the PASID MSR is privileged and only qemu can program
it? Otherwise this seems like a security problem.

If qemu controls it then the idxd userspace driver in qemu must ensure
it is only ever programmed to an authorized PASID.

> The enforcement for #2 is in the KVM PASID translation table, which
> is per VM.

I don't understand why KVM gets involved in PASID??

Doesn't work submission go either to the mdev driver or through the
secure PASID of #1?

> For our current VFIO mdev model, bind guest page table does not involve
> mdev driver. So this is a gap we must fill, i.e. include a callback from
> mdev driver?

No not a callback, tell the mdev driver with a VFIO IOCTL that it is
authorized to use a specific PASID because the vIOMMU was told to
allow it by the guest kernel. Simple and straightforward.

> > ioasid_set doesn't seem to help at all, certainly not as a concept
> > tied to /dev/ioasid.
> > 
> Yes, we can take the security role off ioasid_set once we have per mdev
> list. However, ioasid_set being a per VM/mm entity also bridge
> communications among kernel subsystems that don't have direct call path.
> e.g. KVM, VDCM and IOMMU.

Everything should revolve around the /dev/ioasid FD. qemu should pass
it to all places that need to know about PASID's in the VM.

We should try to avoid hidden behind the scenes kernel
interconnections between subsystems.


> > So when you 'allow' a mdev to access a PASID you want to say:
> >  Allow Guest PASID A, map it to host PASID B on this /dev/ioasid FD
> > 

> Host and guest PASID value, as well as device info are available through
> iommu_uapi_sva_bind_gpasid(), we just need to feed that info to mdev driver.

You need that IOCTL to exist on the *mdev driver*. It is a VFIO ioctl,
not a iommu or ioasid or sva IOCTL.
 
> > That seems like a good helper library to provide for drivers to use,
> > but it should be a construct entirely contained in the driver.
> why? would it be cleaner if it is in the common code?

No, it is the "mid layer" problematic design.

Having the iommu layer store driver-specific data on behalf of a
driver will just make a mess. Use the natural layering we have and
store driver specific data in the driver structs.

Add a library to help build the datastructure if it necessary.

Jason


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Tuesday, March 30, 2021 9:43 PM
[..]
> No, the mdev device driver must enforce this directly. It is the one
> that programms the physical shared HW, it is the one that needs a list
> of PASID's it is allowed to program *for each mdev*
> 
> ioasid_set doesn't seem to help at all, certainly not as a concept
> tied to /dev/ioasid.
> 

As replied in another thread. We introduced ioasid_set based on the
motivation to have per-VM ioasid track, which is required when user
space tries to bind an ioasid with a device. Should ensure the ioasid
it is using was allocated to it. otherwise, we may suffer inter-VM ioasid
problem. It may not necessaty to be ioasid_set but a per-VM ioasid track
is necessary.

Regards,
Yi Liu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Liu, Yi L
> From: Jason Gunthorpe 
> Sent: Tuesday, March 30, 2021 9:28 PM
> 
> On Tue, Mar 30, 2021 at 04:14:58AM +, Tian, Kevin wrote:
> 
> > One correction. The mdev should still construct the list of allowed PASID's
> as
> > you said (by listening to IOASID_BIND/UNBIND event), in addition to the
> ioasid
> > set maintained per VM (updated when a PASID is allocated/freed). The
> per-VM
> > set is required for inter-VM isolation (verified when a pgtable is bound to
> the
> > mdev/PASID), while the mdev's own list is necessary for intra-VM isolation
> when
> > multiple mdevs are assigned to the same VM (verified before loading a
> PASID
> > to the mdev). This series just handles the general part i.e. per-VM ioasid
> set and
> > leaves the mdev's own list to be managed by specific mdev driver which
> listens
> > to various IOASID events).
> 
> This is better, but I don't understand why we need such a convoluted
> design.
> 
> Get rid of the ioasid set.
>
> Each driver has its own list of allowed ioasids.

First, I agree with you it's necessary to have a per-device allowed ioasid
list. But besides it, I think we still need to ensure the ioasid used by a
VM is really allocated to this VM. A VM should not use an ioasid allocated
to another VM. right? Actually, this is the major intention for introducing
ioasid_set.

> Register a ioasid in the driver's list by passing the fd and ioasid #

The fd here is a device fd. Am I right? If yes, your idea is ioasid is
allocated via /dev/ioasid and associated with device fd via either VFIO
or vDPA ioctl. right? sorry I may be asking silly questions but really
need to ensure we are talking in the same page.

> No listening to events. A simple understandable security model.

For this suggestion, I have a little bit concern if we may have A-B/B-A
lock sequence issue since it requires the /dev/ioasid (if it supports)
to call back into VFIO/VDPA to check if the ioasid has been registered to
device FD and record it in the per-device list. right? Let's have more
discussion based on the skeleton sent by Kevin.

Regards,
Yi Liu


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-31 Thread Liu, Yi L
Hi Jason,

> From: Jason Gunthorpe 
> Sent: Tuesday, March 30, 2021 9:29 PM
> 
> On Tue, Mar 30, 2021 at 01:37:05AM +, Tian, Kevin wrote:
[...]
> > Hi, Jason,
> >
> > Actually above is a major open while we are refactoring vSVA uAPI toward
> > this direction. We have two concerns about merging /dev/ioasid with
> > /dev/sva, and would like to hear your thought whether they are valid.
> >
> > First, userspace may use ioasid in a non-SVA scenario where ioasid is
> > bound to specific security context (e.g. a control vq in vDPA) instead of
> > tying to mm. In this case there is no pgtable binding initiated from user
> > space. Instead, ioasid is allocated from /dev/ioasid and then programmed
> > to the intended security context through specific passthrough framework
> > which manages that context.
> 
> This sounds like the exact opposite of what I'd like to see.
> 
> I do not want to see every subsystem gaining APIs to program a
> PASID. All of that should be consolidated in *one place*.
> 
> I do not want to see VDPA and VFIO have two nearly identical sets of
> APIs to control the PASID.
> 
> Drivers consuming a PASID, like VDPA, should consume the PASID and do
> nothing more than authorize the HW to use it.
> 
> quemu should have general code under the viommu driver that drives
> /dev/ioasid to create PASID's and manage the IO mapping according to
> the guest's needs.
> 
> Drivers like VDPA and VFIO should simply accept that PASID and
> configure/authorize their HW to do DMA's with its tag.
> 
> > Second, ioasid is managed per process/VM while pgtable binding is a
> > device-wise operation.  The userspace flow looks like below for an integral
> > /dev/ioasid interface:
> >
> > - ioctl(container->fd, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU)
> > - ioasid_fd = open(/dev/ioasid)
> > - ioctl(ioasid_fd, IOASID_GET_USVA_FD, _fd) //an empty context
> > - ioctl(device->fd, VFIO_DEVICE_SET_SVA, _fd); //sva_fd ties to
> device
> > - ioctl(sva_fd, USVA_GET_INFO, _info);
> > - ioctl(ioasid_fd, IOMMU_ALLOC_IOASID, );
> > - ioctl(sva_fd, USVA_BIND_PGTBL, _data);
> > - ioctl(sva_fd, USVA_FLUSH_CACHE, _info);
> > - ioctl(sva_fd, USVA_UNBIND_PGTBL, _data);
> > - ioctl(device->fd, VFIO_DEVICE_UNSET_SVA, _fd);
> > - close(sva_fd)
> > - close(ioasid_fd)
> >
> > Our hesitation here is based on one of your earlier comments that
> > you are not a fan of constructing fd's through ioctl. Are you OK with
> > above flow or have a better idea of handling it?
> 
> My reaction is to squash 'sva' and ioasid fds together, I can't see
> why you'd need two fds to manipulate a PASID.

The reason is /dev/ioasid FD is per-VM since the ioasid allocated to
the VM should be able to be shared by all assigned device for the VM.
But the SVA operations (bind/unbind page table, cache_invalidate) should
be per-device. If squashing the two fds to be one, then requires a device
tag for each vSVA ioctl. I'm not sure if it is good. Per me, it looks
better to have a SVA FD and associated with a device FD so that any ioctl
on it will be in the device level. This also benefits ARM and AMD's vSVA
support since they binds guest PASID table to host instead of binding
guest page tables to specific PASIDs.

Regards,
Yi Liu


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-30 Thread Jacob Pan
Hi Jason,

On Tue, 30 Mar 2021 10:43:13 -0300, Jason Gunthorpe  wrote:

> > If two mdevs from the same PF dev are assigned to two VMs, the PASID
> > table will be shared. IOASID set ensures one VM cannot program another
> > VM's PASIDs. I assume 'secure context' is per VM when it comes to host
> > PASID.  
> 
> No, the mdev device driver must enforce this directly. It is the one
> that programms the physical shared HW, it is the one that needs a list
> of PASID's it is allowed to program *for each mdev*
> 
This requires the mdev driver to obtain a list of allowed PASIDs(possibly
during PASID bind time) prior to do enforcement. IMHO, the PASID enforcement
points are:
1. During WQ configuration (e.g.program MSI)
2. During work submission

For VT-d shared workqueue, there is no way to enforce #2 in mdev driver in
that the PASID is obtained from PASID MSR from the CPU and submitted w/o
driver involvement. The enforcement for #2 is in the KVM PASID translation
table, which is per VM.

For our current VFIO mdev model, bind guest page table does not involve
mdev driver. So this is a gap we must fill, i.e. include a callback from
mdev driver?

> ioasid_set doesn't seem to help at all, certainly not as a concept
> tied to /dev/ioasid.
> 
Yes, we can take the security role off ioasid_set once we have per mdev
list. However, ioasid_set being a per VM/mm entity also bridge
communications among kernel subsystems that don't have direct call path.
e.g. KVM, VDCM and IOMMU.

> > No. the mdev driver consults with IOASID core When the guest programs a
> > guest PASID on to he mdev. VDCM driver does a lookup:
> > host_pasid = ioasid_find_by_spid(ioasid_set, guest_pasid);  
> 
> This is the wrong layering. Tell the mdev device directly what it is
> allowed to do. Do not pollute the ioasid core with security stuff.
> 
> > > I'd say you shoul have a single /dev/ioasid per VM and KVM should
> > > attach to that - it should get all the global events/etc that are not
> > > device specific.
> > >   
> > You mean a single /dev/ioasid FD per VM and KVM? I think that is what we
> > are doing in this set. A VM process can only open /dev/ioasid once, then
> > use the FD for allocation and pass the PASID for bind page table etc.  
> 
> Yes, I think that is reasonable.
> 
> Tag all the IOCTL's with the IOASID number.
>  
> > > Not sure what guest-host PASID means, these have to be 1:1 for device
> > > assignment to work - why would use something else for mdev?
> > >   
> > We have G-H PASID translation. They don't have to be 1:1.
> > IOASID Set Private ID (SPID) is intended as a generic solution for
> > guest PASID. Could you review the secion Section: IOASID Set Private ID
> > (SPID) in the doc patch?  
> 
> Again this only works for MDEV? How would you do translation for a
> real PF/VF?
> 
Right, we will need some mediation for PF/VF.

> So when you 'allow' a mdev to access a PASID you want to say:
>  Allow Guest PASID A, map it to host PASID B on this /dev/ioasid FD
> 
> ?
> 
Host and guest PASID value, as well as device info are available through
iommu_uapi_sva_bind_gpasid(), we just need to feed that info to mdev driver.

> That seems like a good helper library to provide for drivers to use,
> but it should be a construct entirely contained in the driver.
why? would it be cleaner if it is in the common code?

Thanks,

Jacob


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-30 Thread Jason Gunthorpe
On Tue, Mar 30, 2021 at 03:42:24PM +0200, Jean-Philippe Brucker wrote:
> On Tue, Mar 30, 2021 at 10:07:55AM -0300, Jason Gunthorpe wrote:
> > On Fri, Mar 26, 2021 at 09:06:42AM +0100, Jean-Philippe Brucker wrote:
> > 
> > > It's not inconceivable to have a control queue doing DMA tagged with
> > > PASID. The devices I know either use untagged DMA, or have a choice to use
> > > a PASID.
> > 
> > I don't think we should encourage that. A PASID and all the related is
> > so expensive compared to just doing normal untagged kernel DMA.
> 
> How is it expensive?  Low number of PASIDs, or slowing down DMA
> transactions?  PASIDs aren't a scarce resource on Arm systems, they have
> almost 1M unused PASIDs per VM.

There may be lots of PASIDs, but they are not without cost. The page
table behind them costs memory and cache occupancy, doing the lookups
hurts DMA performance.

Compare to a physical addressed kernel DMA (like x86 often sets up)
the runtime overheads from unnecessary PASID use is quite big.

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-30 Thread Jason Gunthorpe
On Mon, Mar 29, 2021 at 03:55:26PM -0700, Jacob Pan wrote:

> In one of the earlier discussions, I was made aware of some use cases (by
> AMD, iirc) where PASID can be used w/o IOMMU. That is why I tried to keep
> ioasid a separate subsystem. Other than that, I don't see an issue
> combining the two.

That sounds like nonsense. A freshly created ioasid should have *NO
DMA*. Every access to it should result in a PCI error until a mapping
for the address space is defined. It is called IO *address space* for
a reason.

So, what exactly do you do with a PASID without an IOMMU? You
certainly can't expose it through this interface because you can't
establish the first requirement of *NO DMA*.

While there may be an interesting use case, it looks to be kernel-only
and not relavent here.

> it matters when it comes to which interface to choose. Use /dev/ioasid to
> allocate if PASID value cannot be hidden. Use some other interface for bind
> current and allocate if a PASID is not visible to the user.

I just view it as a shortcut, it has less to do with "hidden" and more
to do if the shortcut is a valuable savings. If you swap four ioctls
with one ioctl I'd say that is not enough of a win 
 
> Yes, I think we are on the same page. For example, today's uacce or idxd
> driver creates a hidden PASID when user does open(), where a new WQ is
> provisioned and bound to current mm. This is the case where /dev/ioasid is
> not needed.

So that is a probelm for uacce, they shouldn't have created PASIDs at
open() time, there is no option to customize what is happening there.

> Sorry, I am not following. In the doc, I have an example to show the
> ioasid_set to VM/mm mapping. We use mm as the ioasid_set token to identify
> who the owner of an IOASID is. i.e. who allocated the IOASID. Non-owner
> cannot perform bind page table or free operations.

As I said to Kevin this seems very over complicated.

Access to the /dev/ioasid FD is the only authorization the kernel
needs.

> Yes, each PF/VF has its own PASID table. The device can do whatever
> it wants as long as the PASID is present in the table. Programming of the
> pIOMMU PASID table entry, however, is controlled by the host.
> 
> IMHO, there are two levels of security here:
> 1. A PASID can only be used by a secure context
> 2. A device can only use allowed PASIDs (PASID namespace is system-wide but
> PASID table storage is per PF/VF)
> 
> IOASID set is designed for #1.

#1 sounds like the mdev case, and as I said to Kevin each and every
mdev needs its own allow'd PASID list. There is no need for an ioasid
set to implement that.

> > If a device is sharing a single PCI function with different security
> > contexts (eg vfio mdev) then the device itself is responsible to
> > ensure that only the secure interface can program a PASID and a less
> > secure context can never self-enroll. 
> 
> If two mdevs from the same PF dev are assigned to two VMs, the PASID
> table will be shared. IOASID set ensures one VM cannot program another VM's
> PASIDs. I assume 'secure context' is per VM when it comes to host PASID.

No, the mdev device driver must enforce this directly. It is the one
that programms the physical shared HW, it is the one that needs a list
of PASID's it is allowed to program *for each mdev*

ioasid_set doesn't seem to help at all, certainly not as a concept
tied to /dev/ioasid.

> No. the mdev driver consults with IOASID core When the guest programs a
> guest PASID on to he mdev. VDCM driver does a lookup:
> host_pasid = ioasid_find_by_spid(ioasid_set, guest_pasid);

This is the wrong layering. Tell the mdev device directly what it is
allowed to do. Do not pollute the ioasid core with security stuff.

> > I'd say you shoul have a single /dev/ioasid per VM and KVM should
> > attach to that - it should get all the global events/etc that are not
> > device specific.
> > 
> You mean a single /dev/ioasid FD per VM and KVM? I think that is what we
> are doing in this set. A VM process can only open /dev/ioasid once, then
> use the FD for allocation and pass the PASID for bind page table etc.

Yes, I think that is reasonable.

Tag all the IOCTL's with the IOASID number.
 
> > Not sure what guest-host PASID means, these have to be 1:1 for device
> > assignment to work - why would use something else for mdev?
> > 
> We have G-H PASID translation. They don't have to be 1:1.
> IOASID Set Private ID (SPID) is intended as a generic solution for guest 
> PASID.
> Could you review the secion Section: IOASID Set Private ID (SPID) in the
> doc patch?

Again this only works for MDEV? How would you do translation for a
real PF/VF?

So when you 'allow' a mdev to access a PASID you want to say:
 Allow Guest PASID A, map it to host PASID B on this /dev/ioasid FD

?

That seems like a good helper library to provide for drivers to use,
but it should be a construct entirely contained in the driver.

> We also had some slides from last year. Slide 3s-6 mostly.
> 

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-30 Thread Jean-Philippe Brucker
On Tue, Mar 30, 2021 at 10:07:55AM -0300, Jason Gunthorpe wrote:
> On Fri, Mar 26, 2021 at 09:06:42AM +0100, Jean-Philippe Brucker wrote:
> 
> > It's not inconceivable to have a control queue doing DMA tagged with
> > PASID. The devices I know either use untagged DMA, or have a choice to use
> > a PASID.
> 
> I don't think we should encourage that. A PASID and all the related is
> so expensive compared to just doing normal untagged kernel DMA.

How is it expensive?  Low number of PASIDs, or slowing down DMA
transactions?  PASIDs aren't a scarce resource on Arm systems, they have
almost 1M unused PASIDs per VM.

Thanks,
Jean

> I assume HW has these features because virtualization use cases might
> use them, eg by using mdev to assign a command queue - then it would
> need be be contained by a PASID.
> 
> Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-30 Thread Jason Gunthorpe
On Tue, Mar 30, 2021 at 01:37:05AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Tuesday, March 30, 2021 12:32 AM
> > 
> > On Wed, Mar 24, 2021 at 12:05:28PM -0700, Jacob Pan wrote:
> > 
> > > > IMHO a use created PASID is either bound to a mm (current) at creation
> > > > time, or it will never be bound to a mm and its page table is under
> > > > user control via /dev/ioasid.
> > > >
> > > True for PASID used in native SVA bind. But for binding with a guest mm,
> > > PASID is allocated first (VT-d virtual cmd interface Spec 10.4.44), the
> > > bind with the host IOMMU when vIOMMU PASID cache is invalidated.
> > >
> > > Our intention is to have two separate interfaces:
> > > 1. /dev/ioasid (allocation/free only)
> > > 2. /dev/sva (handles all SVA related activities including page tables)
> > 
> > I'm not sure I understand why you'd want to have two things. Doesn't
> > that just complicate everything?
> > 
> > Manipulating the ioasid, including filling it with page tables, seems
> > an integral inseperable part of the whole interface. Why have two ?
> 
> Hi, Jason,
> 
> Actually above is a major open while we are refactoring vSVA uAPI toward
> this direction. We have two concerns about merging /dev/ioasid with
> /dev/sva, and would like to hear your thought whether they are valid.
> 
> First, userspace may use ioasid in a non-SVA scenario where ioasid is 
> bound to specific security context (e.g. a control vq in vDPA) instead of 
> tying to mm. In this case there is no pgtable binding initiated from user
> space. Instead, ioasid is allocated from /dev/ioasid and then programmed
> to the intended security context through specific passthrough framework
> which manages that context.

This sounds like the exact opposite of what I'd like to see.

I do not want to see every subsystem gaining APIs to program a
PASID. All of that should be consolidated in *one place*.

I do not want to see VDPA and VFIO have two nearly identical sets of
APIs to control the PASID.

Drivers consuming a PASID, like VDPA, should consume the PASID and do
nothing more than authorize the HW to use it.

quemu should have general code under the viommu driver that drives
/dev/ioasid to create PASID's and manage the IO mapping according to
the guest's needs.

Drivers like VDPA and VFIO should simply accept that PASID and
configure/authorize their HW to do DMA's with its tag.

> Second, ioasid is managed per process/VM while pgtable binding is a
> device-wise operation.  The userspace flow looks like below for an integral
> /dev/ioasid interface:
> 
> - ioctl(container->fd, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU)
> - ioasid_fd = open(/dev/ioasid)
> - ioctl(ioasid_fd, IOASID_GET_USVA_FD, _fd) //an empty context
> - ioctl(device->fd, VFIO_DEVICE_SET_SVA, _fd); //sva_fd ties to device
> - ioctl(sva_fd, USVA_GET_INFO, _info);
> - ioctl(ioasid_fd, IOMMU_ALLOC_IOASID, );
> - ioctl(sva_fd, USVA_BIND_PGTBL, _data);
> - ioctl(sva_fd, USVA_FLUSH_CACHE, _info);
> - ioctl(sva_fd, USVA_UNBIND_PGTBL, _data);
> - ioctl(device->fd, VFIO_DEVICE_UNSET_SVA, _fd);
> - close(sva_fd)
> - close(ioasid_fd)
> 
> Our hesitation here is based on one of your earlier comments that
> you are not a fan of constructing fd's through ioctl. Are you OK with
> above flow or have a better idea of handling it?

My reaction is to squash 'sva' and ioasid fds together, I can't see
why you'd need two fds to manipulate a PASID.

DEVICE_SET_SVA seems like the wrong language too, it should be more
like DEVICE_ALLOW_IOASID which only tells the iommu and driver to alow
the pci_device to use the IOASID.

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-30 Thread Jason Gunthorpe
On Tue, Mar 30, 2021 at 04:14:58AM +, Tian, Kevin wrote:

> One correction. The mdev should still construct the list of allowed PASID's as
> you said (by listening to IOASID_BIND/UNBIND event), in addition to the 
> ioasid 
> set maintained per VM (updated when a PASID is allocated/freed). The per-VM
> set is required for inter-VM isolation (verified when a pgtable is bound to 
> the 
> mdev/PASID), while the mdev's own list is necessary for intra-VM isolation 
> when 
> multiple mdevs are assigned to the same VM (verified before loading a PASID 
> to the mdev). This series just handles the general part i.e. per-VM ioasid 
> set and 
> leaves the mdev's own list to be managed by specific mdev driver which listens
> to various IOASID events).

This is better, but I don't understand why we need such a convoluted
design.

Get rid of the ioasid set.

Each driver has its own list of allowed ioasids.

Register a ioasid in the driver's list by passing the fd and ioasid #

No listening to events. A simple understandable security model.

Look - it took you three emails to even correctly explain the security
model you are striving for here, it is *obviously* too complicated for
anyone to understand or successfully implement. simplify smiplify
simplify.

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-30 Thread Jason Gunthorpe
On Tue, Mar 30, 2021 at 02:24:09AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Tuesday, March 30, 2021 12:32 AM
> > > In terms of usage for guest SVA, an ioasid_set is mostly tied to a host 
> > > mm,
> > > the use case is as the following:
> > 
> > From that doc:
> > 
> >   It is imperative to enforce
> >   VM-IOASID ownership such that a malicious guest cannot target DMA
> >   traffic outside its own IOASIDs, or free an active IOASID that belongs
> >   to another VM.
> > 
> > Huh?
> > 
> > Security in a PASID world comes from the IOMMU blocking access to the
> > PASID except from approved PCI-ID's. If a VF/PF is assigned to a guest
> > then that guest can cause the device to issue any PASID by having
> > complete control and the vIOMMU is supposed to tell the real IOMMU
> > what PASID's the device is alowed to access.
> > 
> > If a device is sharing a single PCI function with different security
> > contexts (eg vfio mdev) then the device itself is responsible to
> > ensure that only the secure interface can program a PASID and a less
> > secure context can never self-enroll.
> > 
> > Here the mdev driver would have to consule with the vIOMMU to ensure
> > the mdev device is allowed to access the PASID - is that what this
> > set stuff is about?
> > 
> > If yes, it is backwards. The MDEV is the thing doing the security, the
> > MDEV should have the list of allowed PASID's and a single PASID
> > created under /dev/ioasid should be loaded into MDEV with some 'Ok you
> > can use PASID xyz from FD abc' command.
> > 
> 
> The 'set' is per-VM. Once the mdev is assigned to a VM, all valid PASID's
> in the set of that VM are considered legitimate on this mdev.

No! That is a major security problem!

PASID authorization is *PER DEVICE*.

If I map a device into VFIO in userspace with full control over the HW
that device MUST ONLY have access to PASID's that have been registered
with vfio.

This means each time you register a PASID vfio must tell the IOMMU
driver to authorize the pci_device to access the PASID, the vIOMMU
driver must tell the hypervisor and the mdev under the PCI device MUST
have a per-device list of allowed PASIDs.

Otherwise userspace in a VM with vfio could tell the mdev driver to
talk to a PASID in the same VM but *that process doesn't own*. This is
absolutely not allowed.

Most likely the entire ioasid set and related need to be deleted as a
kernel concept.

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-30 Thread Jason Gunthorpe
On Fri, Mar 26, 2021 at 09:06:42AM +0100, Jean-Philippe Brucker wrote:

> It's not inconceivable to have a control queue doing DMA tagged with
> PASID. The devices I know either use untagged DMA, or have a choice to use
> a PASID.

I don't think we should encourage that. A PASID and all the related is
so expensive compared to just doing normal untagged kernel DMA.

I assume HW has these features because virtualization use cases might
use them, eg by using mdev to assign a command queue - then it would
need be be contained by a PASID.

Jason


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-29 Thread Tian, Kevin
> From: Tian, Kevin
> Sent: Tuesday, March 30, 2021 10:24 AM
> 
> > From: Jason Gunthorpe 
> > Sent: Tuesday, March 30, 2021 12:32 AM
> > > In terms of usage for guest SVA, an ioasid_set is mostly tied to a host 
> > > mm,
> > > the use case is as the following:
> >
> > From that doc:
> >
> >   It is imperative to enforce
> >   VM-IOASID ownership such that a malicious guest cannot target DMA
> >   traffic outside its own IOASIDs, or free an active IOASID that belongs
> >   to another VM.
> >
> > Huh?
> >
> > Security in a PASID world comes from the IOMMU blocking access to the
> > PASID except from approved PCI-ID's. If a VF/PF is assigned to a guest
> > then that guest can cause the device to issue any PASID by having
> > complete control and the vIOMMU is supposed to tell the real IOMMU
> > what PASID's the device is alowed to access.
> >
> > If a device is sharing a single PCI function with different security
> > contexts (eg vfio mdev) then the device itself is responsible to
> > ensure that only the secure interface can program a PASID and a less
> > secure context can never self-enroll.
> >
> > Here the mdev driver would have to consule with the vIOMMU to ensure
> > the mdev device is allowed to access the PASID - is that what this
> > set stuff is about?
> >
> > If yes, it is backwards. The MDEV is the thing doing the security, the
> > MDEV should have the list of allowed PASID's and a single PASID
> > created under /dev/ioasid should be loaded into MDEV with some 'Ok you
> > can use PASID xyz from FD abc' command.
> >
> 
> The 'set' is per-VM. Once the mdev is assigned to a VM, all valid PASID's
> in the set of that VM are considered legitimate on this mdev. The mdev
> driver will mediate guest operations which program PASID to the backend
> context and load the PASID only if it is within the 'set' (i.e. already
> allocated through /dev/ioasid). This prevents a malicious VM from attacking
> others. Though it's not mdev which directly maintaining the list of allowed
> PASID's, the effect is the same in concept.
> 

One correction. The mdev should still construct the list of allowed PASID's as
you said (by listening to IOASID_BIND/UNBIND event), in addition to the ioasid 
set maintained per VM (updated when a PASID is allocated/freed). The per-VM
set is required for inter-VM isolation (verified when a pgtable is bound to the 
mdev/PASID), while the mdev's own list is necessary for intra-VM isolation when 
multiple mdevs are assigned to the same VM (verified before loading a PASID 
to the mdev). This series just handles the general part i.e. per-VM ioasid set 
and 
leaves the mdev's own list to be managed by specific mdev driver which listens
to various IOASID events).

Thanks
Kevin


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-29 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, March 30, 2021 12:32 AM
> > In terms of usage for guest SVA, an ioasid_set is mostly tied to a host mm,
> > the use case is as the following:
> 
> From that doc:
> 
>   It is imperative to enforce
>   VM-IOASID ownership such that a malicious guest cannot target DMA
>   traffic outside its own IOASIDs, or free an active IOASID that belongs
>   to another VM.
> 
> Huh?
> 
> Security in a PASID world comes from the IOMMU blocking access to the
> PASID except from approved PCI-ID's. If a VF/PF is assigned to a guest
> then that guest can cause the device to issue any PASID by having
> complete control and the vIOMMU is supposed to tell the real IOMMU
> what PASID's the device is alowed to access.
> 
> If a device is sharing a single PCI function with different security
> contexts (eg vfio mdev) then the device itself is responsible to
> ensure that only the secure interface can program a PASID and a less
> secure context can never self-enroll.
> 
> Here the mdev driver would have to consule with the vIOMMU to ensure
> the mdev device is allowed to access the PASID - is that what this
> set stuff is about?
> 
> If yes, it is backwards. The MDEV is the thing doing the security, the
> MDEV should have the list of allowed PASID's and a single PASID
> created under /dev/ioasid should be loaded into MDEV with some 'Ok you
> can use PASID xyz from FD abc' command.
> 

The 'set' is per-VM. Once the mdev is assigned to a VM, all valid PASID's
in the set of that VM are considered legitimate on this mdev. The mdev
driver will mediate guest operations which program PASID to the backend
context and load the PASID only if it is within the 'set' (i.e. already 
allocated through /dev/ioasid). This prevents a malicious VM from attacking
others. Though it's not mdev which directly maintaining the list of allowed 
PASID's, the effect is the same in concept.

Thanks
Kevin


RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-29 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, March 30, 2021 12:32 AM
> 
> On Wed, Mar 24, 2021 at 12:05:28PM -0700, Jacob Pan wrote:
> 
> > > IMHO a use created PASID is either bound to a mm (current) at creation
> > > time, or it will never be bound to a mm and its page table is under
> > > user control via /dev/ioasid.
> > >
> > True for PASID used in native SVA bind. But for binding with a guest mm,
> > PASID is allocated first (VT-d virtual cmd interface Spec 10.4.44), the
> > bind with the host IOMMU when vIOMMU PASID cache is invalidated.
> >
> > Our intention is to have two separate interfaces:
> > 1. /dev/ioasid (allocation/free only)
> > 2. /dev/sva (handles all SVA related activities including page tables)
> 
> I'm not sure I understand why you'd want to have two things. Doesn't
> that just complicate everything?
> 
> Manipulating the ioasid, including filling it with page tables, seems
> an integral inseperable part of the whole interface. Why have two ?

Hi, Jason,

Actually above is a major open while we are refactoring vSVA uAPI toward
this direction. We have two concerns about merging /dev/ioasid with
/dev/sva, and would like to hear your thought whether they are valid.

First, userspace may use ioasid in a non-SVA scenario where ioasid is 
bound to specific security context (e.g. a control vq in vDPA) instead of 
tying to mm. In this case there is no pgtable binding initiated from user
space. Instead, ioasid is allocated from /dev/ioasid and then programmed
to the intended security context through specific passthrough framework
which manages that context.

Second, ioasid is managed per process/VM while pgtable binding is a
device-wise operation.  The userspace flow looks like below for an integral
/dev/ioasid interface:

---initialization--
- ioctl(container->fd, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU)
- ioasid_fd = open(/dev/ioasid)
- ioctl(ioasid_fd, IOASID_GET_USVA_FD, _fd) //an empty context
- ioctl(device->fd, VFIO_DEVICE_SET_SVA, _fd); //sva_fd ties to device
- ioctl(sva_fd, USVA_GET_INFO, _info);
---runtime
- ioctl(ioasid_fd, IOMMU_ALLOC_IOASID, );
- ioctl(sva_fd, USVA_BIND_PGTBL, _data);
- ioctl(sva_fd, USVA_FLUSH_CACHE, _info);
- ioctl(sva_fd, USVA_UNBIND_PGTBL, _data);
---destroy
- ioctl(device->fd, VFIO_DEVICE_UNSET_SVA, _fd);
- close(sva_fd)
- close(ioasid_fd)

Our hesitation here is based on one of your earlier comments that
you are not a fan of constructing fd's through ioctl. Are you OK with
above flow or have a better idea of handling it?

With separate interfaces then userspace just opens /dev/sva instead 
of getting it through ioasid_fd:

- ioasid_fd = open(/dev/ioasid)
- sva_fd = open(/dev/sva)

Thanks
Kevin


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-29 Thread Jacob Pan
Hi Jason,

On Mon, 29 Mar 2021 13:31:47 -0300, Jason Gunthorpe  wrote:

> On Wed, Mar 24, 2021 at 12:05:28PM -0700, Jacob Pan wrote:
> 
> > > IMHO a use created PASID is either bound to a mm (current) at creation
> > > time, or it will never be bound to a mm and its page table is under
> > > user control via /dev/ioasid.
> > >   
> > True for PASID used in native SVA bind. But for binding with a guest mm,
> > PASID is allocated first (VT-d virtual cmd interface Spec 10.4.44), the
> > bind with the host IOMMU when vIOMMU PASID cache is invalidated.
> > 
> > Our intention is to have two separate interfaces:
> > 1. /dev/ioasid (allocation/free only)
> > 2. /dev/sva (handles all SVA related activities including page tables)  
> 
> I'm not sure I understand why you'd want to have two things. Doesn't
> that just complicate everything?
> 
> Manipulating the ioasid, including filling it with page tables, seems
> an integral inseperable part of the whole interface. Why have two ?
> 
In one of the earlier discussions, I was made aware of some use cases (by
AMD, iirc) where PASID can be used w/o IOMMU. That is why I tried to keep
ioasid a separate subsystem. Other than that, I don't see an issue
combining the two.

> > > I thought the whole point of something like a /dev/ioasid was to get
> > > away from each and every device creating its own PASID interface?
> > >   
> > yes, but only for the use cases that need to expose PASID to the
> > userspace.  
> 
> Why "but only"? This thing should reach for a higher generality, not
> just be contained to solve some problem within qemu.
> 
I totally agree in terms of generality. I was just trying to point out
existing framework or drivers such as uacce and idxd driver does not have a
need to use /dev/ioasid.

> > > It maybe somewhat reasonable that some devices could have some easy
> > > 'make a SVA PASID on current' interface built in,  
> > I agree, this is the case PASID is hidden from the userspace, right?
> > e.g. uacce.  
> 
> "hidden", I guess, but does it matter so much?
> 
it matters when it comes to which interface to choose. Use /dev/ioasid to
allocate if PASID value cannot be hidden. Use some other interface for bind
current and allocate if a PASID is not visible to the user.

> The PASID would still consume a cgroup credit
> 
yes, credit still consumed. Just the PASID value is hidden.

> > > but anything more
> > > complicated should use /dev/ioasid, and anything consuming PASID
> > > should also have an API to import and attach a PASID from /dev/ioasid.
> > >   
> > Would the above two use cases constitute the "complicated" criteria? Or
> > we should say anything that need the explicit PASID value has to through
> > /dev/ioasid?  
> 
> Anything that needs more that creating a hidden PASID link'd to
> current should use the full interface.
> 
Yes, I think we are on the same page. For example, today's uacce or idxd
driver creates a hidden PASID when user does open(), where a new WQ is
provisioned and bound to current mm. This is the case where /dev/ioasid is
not needed.

> > In terms of usage for guest SVA, an ioasid_set is mostly tied to a host
> > mm, the use case is as the following:  
> 
> From that doc:
> 
>   It is imperative to enforce
>   VM-IOASID ownership such that a malicious guest cannot target DMA
>   traffic outside its own IOASIDs, or free an active IOASID that belongs
>   to another VM.
> 
> Huh?
> 
Sorry, I am not following. In the doc, I have an example to show the
ioasid_set to VM/mm mapping. We use mm as the ioasid_set token to identify
who the owner of an IOASID is. i.e. who allocated the IOASID. Non-owner
cannot perform bind page table or free operations.

Section: IOASID Set Private ID (SPID)
 .--..--.
 |   VM 1   ||   VM 2   |
 |  ||  |
 |--||--|
 | GPASID/SPID 101  || GPASID/SPID 101  |
 '--'---' Guest
 __|__|
   |  |   Host
   v  v
 .--..--.
 | Host IOASID 201  || Host IOASID 202  |
 '--''--'
 |   IOASID set 1   ||   IOASID set 2   |
 '--''--'


> Security in a PASID world comes from the IOMMU blocking access to the
> PASID except from approved PCI-ID's. If a VF/PF is assigned to a guest
> then that guest can cause the device to issue any PASID by having
> complete control and the vIOMMU is supposed to tell the real IOMMU
> what PASID's the device is alowed to access.
> 
Yes, each PF/VF has its own PASID table. The device can do whatever
it wants as long as the PASID is present in the table. Programming of the
pIOMMU PASID table entry, however, is controlled by the host.

IMHO, there are two levels of security here:
1. A PASID can only be 

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-29 Thread Jason Gunthorpe
On Wed, Mar 24, 2021 at 12:05:28PM -0700, Jacob Pan wrote:

> > IMHO a use created PASID is either bound to a mm (current) at creation
> > time, or it will never be bound to a mm and its page table is under
> > user control via /dev/ioasid.
> > 
> True for PASID used in native SVA bind. But for binding with a guest mm,
> PASID is allocated first (VT-d virtual cmd interface Spec 10.4.44), the
> bind with the host IOMMU when vIOMMU PASID cache is invalidated.
> 
> Our intention is to have two separate interfaces:
> 1. /dev/ioasid (allocation/free only)
> 2. /dev/sva (handles all SVA related activities including page tables)

I'm not sure I understand why you'd want to have two things. Doesn't
that just complicate everything?

Manipulating the ioasid, including filling it with page tables, seems
an integral inseperable part of the whole interface. Why have two ?

> > I thought the whole point of something like a /dev/ioasid was to get
> > away from each and every device creating its own PASID interface?
> > 
> yes, but only for the use cases that need to expose PASID to the
> userspace.

Why "but only"? This thing should reach for a higher generality, not
just be contained to solve some problem within qemu.

> > It maybe somewhat reasonable that some devices could have some easy
> > 'make a SVA PASID on current' interface built in,
> I agree, this is the case PASID is hidden from the userspace, right? e.g.
> uacce.

"hidden", I guess, but does it matter so much?

The PASID would still consume a cgroup credit

> > but anything more
> > complicated should use /dev/ioasid, and anything consuming PASID
> > should also have an API to import and attach a PASID from /dev/ioasid.
> > 
> Would the above two use cases constitute the "complicated" criteria? Or we
> should say anything that need the explicit PASID value has to through
> /dev/ioasid?

Anything that needs more that creating a hidden PASID link'd to
current should use the full interface.

> In terms of usage for guest SVA, an ioasid_set is mostly tied to a host mm,
> the use case is as the following:

>From that doc:

  It is imperative to enforce
  VM-IOASID ownership such that a malicious guest cannot target DMA
  traffic outside its own IOASIDs, or free an active IOASID that belongs
  to another VM.

Huh?

Security in a PASID world comes from the IOMMU blocking access to the
PASID except from approved PCI-ID's. If a VF/PF is assigned to a guest
then that guest can cause the device to issue any PASID by having
complete control and the vIOMMU is supposed to tell the real IOMMU
what PASID's the device is alowed to access.

If a device is sharing a single PCI function with different security
contexts (eg vfio mdev) then the device itself is responsible to
ensure that only the secure interface can program a PASID and a less
secure context can never self-enroll. 

Here the mdev driver would have to consule with the vIOMMU to ensure
the mdev device is allowed to access the PASID - is that what this
set stuff is about? 

If yes, it is backwards. The MDEV is the thing doing the security, the
MDEV should have the list of allowed PASID's and a single PASID
created under /dev/ioasid should be loaded into MDEV with some 'Ok you
can use PASID xyz from FD abc' command.

Because you absolutely don't want to have a generic 'set' that all the
mdevs are sharing as that violates the basic security principle at the
start - each and every device must have a unique list of what PASID's
it can talk to.

> 1. Identify a pool of PASIDs for permission checking (below to the same VM),
> e.g. only allow SVA binding for PASIDs allocated from the same set.
> 
> 2. Allow different PASID-aware kernel subsystems to associate, e.g. KVM,
> device drivers, and IOMMU driver. i.e. each KVM instance only cares about
> the ioasid_set associated with the VM. Events notifications are also within
> the ioasid_set to synchronize PASID states.
> 
> 3. Guest-Host PASID look up (each set has its own XArray to store the
> mapping)
> 
> 4. Quota control (going away once we have cgroup)

It sounds worrysome things have gone this way.

I'd say you shoul have a single /dev/ioasid per VM and KVM should
attach to that - it should get all the global events/etc that are not
device specific.

permission checking *must* be done on a per-device level, either inside the
mdev driver, or inside the IOMMU at a per-PCI device level.

Not sure what guest-host PASID means, these have to be 1:1 for device
assignment to work - why would use something else for mdev?

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-26 Thread Jean-Philippe Brucker
On Thu, Mar 25, 2021 at 02:16:45PM -0300, Jason Gunthorpe wrote:
> On Thu, Mar 25, 2021 at 10:02:36AM -0700, Jacob Pan wrote:
> > Hi Jean-Philippe,
> > 
> > On Thu, 25 Mar 2021 11:21:40 +0100, Jean-Philippe Brucker
> >  wrote:
> > 
> > > On Wed, Mar 24, 2021 at 03:12:30PM -0700, Jacob Pan wrote:
> > > > Hi Jason,
> > > > 
> > > > On Wed, 24 Mar 2021 14:03:38 -0300, Jason Gunthorpe 
> > > > wrote: 
> > > > > On Wed, Mar 24, 2021 at 10:02:46AM -0700, Jacob Pan wrote:  
> > > > > > > Also wondering about device driver allocating auxiliary domains
> > > > > > > for their private use, to do iommu_map/unmap on private PASIDs (a
> > > > > > > clean replacement to super SVA, for example). Would that go
> > > > > > > through the same path as /dev/ioasid and use the cgroup of
> > > > > > > current task?
> > > > > >
> > > > > > For the in-kernel private use, I don't think we should restrict
> > > > > > based on cgroup, since there is no affinity to user processes. I
> > > > > > also think the PASID allocation should just use kernel API instead
> > > > > > of /dev/ioasid. Why would user space need to know the actual PASID
> > > > > > # for device private domains? Maybe I missed your idea?
> > > > > 
> > > > > There is not much in the kernel that isn't triggered by a process, I
> > > > > would be careful about the idea that there is a class of users that
> > > > > can consume a cgroup controlled resource without being inside the
> > > > > cgroup.
> > > > > 
> > > > > We've got into trouble before overlooking this and with something
> > > > > greenfield like PASID it would be best built in to the API to prevent
> > > > > a mistake. eg accepting a cgroup or process input to the allocator.
> > > > >   
> > > > Make sense. But I think we only allow charging the current cgroup, how
> > > > about I add the following to ioasid_alloc():
> > > > 
> > > > misc_cg = get_current_misc_cg();
> > > > ret = misc_cg_try_charge(MISC_CG_RES_IOASID, misc_cg, 1);
> > > > if (ret) {
> > > > put_misc_cg(misc_cg);
> > > > return ret;
> > > > }  
> > > 
> > > Does that allow PASID allocation during driver probe, in kernel_init or
> > > modprobe context?
> > > 
> > Good point. Yes, you can get cgroup subsystem state in kernel_init for
> > charging/uncharging. I would think module_init should work also since it is
> > after kernel_init. I have tried the following:
> > static int __ref kernel_init(void *unused)
> >  {
> > int ret;
> > +   struct cgroup_subsys_state *css;
> > +   css = task_get_css(current, pids_cgrp_id);
> > 
> > But that would imply:
> > 1. IOASID has to be built-in, not as module

If IOASID is a module, the device driver will probe once the IOMMU module
is available, which I think always happens in probe deferral kworker.

> > 2. IOASIDs charged on PID1/init would not subject to cgroup limit since it
> > will be in the root cgroup and we don't support migration nor will migrate.
> > 
> > Then it comes back to the question of why do we try to limit in-kernel
> > users per cgroup if we can't enforce these cases.

It may be better to explicitly pass a cgroup during allocation as Jason
suggested. That way anyone using the API will have to be aware of this and
pass the root cgroup if that's what they want.

> Are these real use cases? Why would a driver binding to a device
> create a single kernel pasid at bind time? Why wouldn't it use
> untagged DMA?

It's not inconceivable to have a control queue doing DMA tagged with
PASID. The devices I know either use untagged DMA, or have a choice to use
a PASID. We're not outright forbidding PASID allocation at boot (I don't
think we can or should) and we won't be able to check every use of the
API, so I'm trying to figure out whether it will always default to root
cgroup, or crash in some corner case.

Thanks,
Jean


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-25 Thread Jacob Pan
Hi Jason,

On Thu, 25 Mar 2021 14:16:45 -0300, Jason Gunthorpe  wrote:

> On Thu, Mar 25, 2021 at 10:02:36AM -0700, Jacob Pan wrote:
> > Hi Jean-Philippe,
> > 
> > On Thu, 25 Mar 2021 11:21:40 +0100, Jean-Philippe Brucker
> >  wrote:
> >   
> > > On Wed, Mar 24, 2021 at 03:12:30PM -0700, Jacob Pan wrote:  
> > > > Hi Jason,
> > > > 
> > > > On Wed, 24 Mar 2021 14:03:38 -0300, Jason Gunthorpe 
> > > > wrote:   
> > > > > On Wed, Mar 24, 2021 at 10:02:46AM -0700, Jacob Pan wrote:
> > > > > > > Also wondering about device driver allocating auxiliary
> > > > > > > domains for their private use, to do iommu_map/unmap on
> > > > > > > private PASIDs (a clean replacement to super SVA, for
> > > > > > > example). Would that go through the same path as /dev/ioasid
> > > > > > > and use the cgroup of current task?  
> > > > > >
> > > > > > For the in-kernel private use, I don't think we should restrict
> > > > > > based on cgroup, since there is no affinity to user processes. I
> > > > > > also think the PASID allocation should just use kernel API
> > > > > > instead of /dev/ioasid. Why would user space need to know the
> > > > > > actual PASID # for device private domains? Maybe I missed your
> > > > > > idea?  
> > > > > 
> > > > > There is not much in the kernel that isn't triggered by a
> > > > > process, I would be careful about the idea that there is a class
> > > > > of users that can consume a cgroup controlled resource without
> > > > > being inside the cgroup.
> > > > > 
> > > > > We've got into trouble before overlooking this and with something
> > > > > greenfield like PASID it would be best built in to the API to
> > > > > prevent a mistake. eg accepting a cgroup or process input to the
> > > > > allocator. 
> > > > Make sense. But I think we only allow charging the current cgroup,
> > > > how about I add the following to ioasid_alloc():
> > > > 
> > > > misc_cg = get_current_misc_cg();
> > > > ret = misc_cg_try_charge(MISC_CG_RES_IOASID, misc_cg, 1);
> > > > if (ret) {
> > > > put_misc_cg(misc_cg);
> > > > return ret;
> > > > }
> > > 
> > > Does that allow PASID allocation during driver probe, in kernel_init
> > > or modprobe context?
> > >   
> > Good point. Yes, you can get cgroup subsystem state in kernel_init for
> > charging/uncharging. I would think module_init should work also since
> > it is after kernel_init. I have tried the following:
> > static int __ref kernel_init(void *unused)
> >  {
> > int ret;
> > +   struct cgroup_subsys_state *css;
> > +   css = task_get_css(current, pids_cgrp_id);
> > 
> > But that would imply:
> > 1. IOASID has to be built-in, not as module
> > 2. IOASIDs charged on PID1/init would not subject to cgroup limit since
> > it will be in the root cgroup and we don't support migration nor will
> > migrate.
> > 
> > Then it comes back to the question of why do we try to limit in-kernel
> > users per cgroup if we can't enforce these cases.  
> 
> Are these real use cases? Why would a driver binding to a device
> create a single kernel pasid at bind time? Why wouldn't it use
> untagged DMA?
> 
For VT-d, I don't see such use cases. All PASID allocations by the kernel
drivers has proper process context.

> When someone needs it they can rework it and explain why they are
> doing something sane.
> 
Agreed.

> Jason


Thanks,

Jacob


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-25 Thread Jason Gunthorpe
On Thu, Mar 25, 2021 at 10:02:36AM -0700, Jacob Pan wrote:
> Hi Jean-Philippe,
> 
> On Thu, 25 Mar 2021 11:21:40 +0100, Jean-Philippe Brucker
>  wrote:
> 
> > On Wed, Mar 24, 2021 at 03:12:30PM -0700, Jacob Pan wrote:
> > > Hi Jason,
> > > 
> > > On Wed, 24 Mar 2021 14:03:38 -0300, Jason Gunthorpe 
> > > wrote: 
> > > > On Wed, Mar 24, 2021 at 10:02:46AM -0700, Jacob Pan wrote:  
> > > > > > Also wondering about device driver allocating auxiliary domains
> > > > > > for their private use, to do iommu_map/unmap on private PASIDs (a
> > > > > > clean replacement to super SVA, for example). Would that go
> > > > > > through the same path as /dev/ioasid and use the cgroup of
> > > > > > current task?
> > > > >
> > > > > For the in-kernel private use, I don't think we should restrict
> > > > > based on cgroup, since there is no affinity to user processes. I
> > > > > also think the PASID allocation should just use kernel API instead
> > > > > of /dev/ioasid. Why would user space need to know the actual PASID
> > > > > # for device private domains? Maybe I missed your idea?
> > > > 
> > > > There is not much in the kernel that isn't triggered by a process, I
> > > > would be careful about the idea that there is a class of users that
> > > > can consume a cgroup controlled resource without being inside the
> > > > cgroup.
> > > > 
> > > > We've got into trouble before overlooking this and with something
> > > > greenfield like PASID it would be best built in to the API to prevent
> > > > a mistake. eg accepting a cgroup or process input to the allocator.
> > > >   
> > > Make sense. But I think we only allow charging the current cgroup, how
> > > about I add the following to ioasid_alloc():
> > > 
> > >   misc_cg = get_current_misc_cg();
> > >   ret = misc_cg_try_charge(MISC_CG_RES_IOASID, misc_cg, 1);
> > >   if (ret) {
> > >   put_misc_cg(misc_cg);
> > >   return ret;
> > >   }  
> > 
> > Does that allow PASID allocation during driver probe, in kernel_init or
> > modprobe context?
> > 
> Good point. Yes, you can get cgroup subsystem state in kernel_init for
> charging/uncharging. I would think module_init should work also since it is
> after kernel_init. I have tried the following:
> static int __ref kernel_init(void *unused)
>  {
> int ret;
> +   struct cgroup_subsys_state *css;
> +   css = task_get_css(current, pids_cgrp_id);
> 
> But that would imply:
> 1. IOASID has to be built-in, not as module
> 2. IOASIDs charged on PID1/init would not subject to cgroup limit since it
> will be in the root cgroup and we don't support migration nor will migrate.
> 
> Then it comes back to the question of why do we try to limit in-kernel
> users per cgroup if we can't enforce these cases.

Are these real use cases? Why would a driver binding to a device
create a single kernel pasid at bind time? Why wouldn't it use
untagged DMA?

When someone needs it they can rework it and explain why they are
doing something sane.

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-25 Thread Jacob Pan
Hi Jean-Philippe,

On Thu, 25 Mar 2021 11:21:40 +0100, Jean-Philippe Brucker
 wrote:

> On Wed, Mar 24, 2021 at 03:12:30PM -0700, Jacob Pan wrote:
> > Hi Jason,
> > 
> > On Wed, 24 Mar 2021 14:03:38 -0300, Jason Gunthorpe 
> > wrote: 
> > > On Wed, Mar 24, 2021 at 10:02:46AM -0700, Jacob Pan wrote:  
> > > > > Also wondering about device driver allocating auxiliary domains
> > > > > for their private use, to do iommu_map/unmap on private PASIDs (a
> > > > > clean replacement to super SVA, for example). Would that go
> > > > > through the same path as /dev/ioasid and use the cgroup of
> > > > > current task?
> > > >
> > > > For the in-kernel private use, I don't think we should restrict
> > > > based on cgroup, since there is no affinity to user processes. I
> > > > also think the PASID allocation should just use kernel API instead
> > > > of /dev/ioasid. Why would user space need to know the actual PASID
> > > > # for device private domains? Maybe I missed your idea?
> > > 
> > > There is not much in the kernel that isn't triggered by a process, I
> > > would be careful about the idea that there is a class of users that
> > > can consume a cgroup controlled resource without being inside the
> > > cgroup.
> > > 
> > > We've got into trouble before overlooking this and with something
> > > greenfield like PASID it would be best built in to the API to prevent
> > > a mistake. eg accepting a cgroup or process input to the allocator.
> > >   
> > Make sense. But I think we only allow charging the current cgroup, how
> > about I add the following to ioasid_alloc():
> > 
> > misc_cg = get_current_misc_cg();
> > ret = misc_cg_try_charge(MISC_CG_RES_IOASID, misc_cg, 1);
> > if (ret) {
> > put_misc_cg(misc_cg);
> > return ret;
> > }  
> 
> Does that allow PASID allocation during driver probe, in kernel_init or
> modprobe context?
> 
Good point. Yes, you can get cgroup subsystem state in kernel_init for
charging/uncharging. I would think module_init should work also since it is
after kernel_init. I have tried the following:
static int __ref kernel_init(void *unused)
 {
int ret;
+   struct cgroup_subsys_state *css;
+   css = task_get_css(current, pids_cgrp_id);

But that would imply:
1. IOASID has to be built-in, not as module
2. IOASIDs charged on PID1/init would not subject to cgroup limit since it
will be in the root cgroup and we don't support migration nor will migrate.

Then it comes back to the question of why do we try to limit in-kernel
users per cgroup if we can't enforce these cases.

> Thanks,
> Jean
> 


Thanks,

Jacob


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-25 Thread Jean-Philippe Brucker
On Wed, Mar 24, 2021 at 10:02:46AM -0700, Jacob Pan wrote:
> > And a flag IOMMU_SVA_BIND_SUPERVISOR (not that I plan to implement it in
> > the SMMU, but I think we need to clean the current usage)
> > 
> You mean move #define SVM_FLAG_SUPERVISOR_MODE out of Intel code to be a
> generic flag in iommu-sva-lib.h called IOMMU_SVA_BIND_SUPERVISOR?

Yes, though it would need to be in iommu.h since it's used by device
drivers

> > Also wondering about device driver allocating auxiliary domains for their
> > private use, to do iommu_map/unmap on private PASIDs (a clean replacement
> > to super SVA, for example). Would that go through the same path as
> > /dev/ioasid and use the cgroup of current task?
> >
> For the in-kernel private use, I don't think we should restrict based on
> cgroup, since there is no affinity to user processes. I also think the
> PASID allocation should just use kernel API instead of /dev/ioasid. Why
> would user space need to know the actual PASID # for device private domains?
> Maybe I missed your idea?

No that's my bad, I didn't get the role of /dev/ioasid. Let me give the
series a proper read.

Thanks,
Jean


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-25 Thread Jean-Philippe Brucker
On Wed, Mar 24, 2021 at 03:12:30PM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Wed, 24 Mar 2021 14:03:38 -0300, Jason Gunthorpe  wrote:
> 
> > On Wed, Mar 24, 2021 at 10:02:46AM -0700, Jacob Pan wrote:
> > > > Also wondering about device driver allocating auxiliary domains for
> > > > their private use, to do iommu_map/unmap on private PASIDs (a clean
> > > > replacement to super SVA, for example). Would that go through the
> > > > same path as /dev/ioasid and use the cgroup of current task?  
> > >
> > > For the in-kernel private use, I don't think we should restrict based on
> > > cgroup, since there is no affinity to user processes. I also think the
> > > PASID allocation should just use kernel API instead of /dev/ioasid. Why
> > > would user space need to know the actual PASID # for device private
> > > domains? Maybe I missed your idea?  
> > 
> > There is not much in the kernel that isn't triggered by a process, I
> > would be careful about the idea that there is a class of users that
> > can consume a cgroup controlled resource without being inside the
> > cgroup.
> > 
> > We've got into trouble before overlooking this and with something
> > greenfield like PASID it would be best built in to the API to prevent
> > a mistake. eg accepting a cgroup or process input to the allocator.
> > 
> Make sense. But I think we only allow charging the current cgroup, how about
> I add the following to ioasid_alloc():
> 
>   misc_cg = get_current_misc_cg();
>   ret = misc_cg_try_charge(MISC_CG_RES_IOASID, misc_cg, 1);
>   if (ret) {
>   put_misc_cg(misc_cg);
>   return ret;
>   }

Does that allow PASID allocation during driver probe, in kernel_init or
modprobe context?

Thanks,
Jean



Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-24 Thread Jacob Pan
Hi Jason,

On Wed, 24 Mar 2021 14:03:38 -0300, Jason Gunthorpe  wrote:

> On Wed, Mar 24, 2021 at 10:02:46AM -0700, Jacob Pan wrote:
> > > Also wondering about device driver allocating auxiliary domains for
> > > their private use, to do iommu_map/unmap on private PASIDs (a clean
> > > replacement to super SVA, for example). Would that go through the
> > > same path as /dev/ioasid and use the cgroup of current task?  
> >
> > For the in-kernel private use, I don't think we should restrict based on
> > cgroup, since there is no affinity to user processes. I also think the
> > PASID allocation should just use kernel API instead of /dev/ioasid. Why
> > would user space need to know the actual PASID # for device private
> > domains? Maybe I missed your idea?  
> 
> There is not much in the kernel that isn't triggered by a process, I
> would be careful about the idea that there is a class of users that
> can consume a cgroup controlled resource without being inside the
> cgroup.
> 
> We've got into trouble before overlooking this and with something
> greenfield like PASID it would be best built in to the API to prevent
> a mistake. eg accepting a cgroup or process input to the allocator.
> 
Make sense. But I think we only allow charging the current cgroup, how about
I add the following to ioasid_alloc():

misc_cg = get_current_misc_cg();
ret = misc_cg_try_charge(MISC_CG_RES_IOASID, misc_cg, 1);
if (ret) {
put_misc_cg(misc_cg);
return ret;
}

BTW, IOASID will be one of the resources under the proposed misc cgroup.

Thanks,

Jacob


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-24 Thread Jacob Pan
Hi Jason,

On Mon, 22 Mar 2021 09:03:00 -0300, Jason Gunthorpe  wrote:

> On Fri, Mar 19, 2021 at 11:22:21AM -0700, Jacob Pan wrote:
> > Hi Jason,
> > 
> > On Fri, 19 Mar 2021 10:54:32 -0300, Jason Gunthorpe 
> > wrote: 
> > > On Fri, Mar 19, 2021 at 02:41:32PM +0100, Jean-Philippe Brucker
> > > wrote:  
> > > > On Fri, Mar 19, 2021 at 09:46:45AM -0300, Jason Gunthorpe wrote:
> > > > > On Fri, Mar 19, 2021 at 10:58:41AM +0100, Jean-Philippe Brucker
> > > > > wrote: 
> > > > > > Although there is no use for it at the moment (only two upstream
> > > > > > users and it looks like amdkfd always uses current too), I quite
> > > > > > like the client-server model where the privileged process does
> > > > > > bind() and programs the hardware queue on behalf of the client
> > > > > > process.
> > > > > 
> > > > > This creates a lot complexity, how do does process A get a secure
> > > > > reference to B? How does it access the memory in B to setup the
> > > > > HW?
> > > > 
> > > > mm_access() for example, and passing addresses via IPC
> > > 
> > > I'd rather the source process establish its own PASID and then pass
> > > the rights to use it to some other process via FD passing than try to
> > > go the other way. There are lots of security questions with something
> > > like mm_access.
> > >   
> > 
> > Thank you all for the input, it sounds like we are OK to remove mm
> > argument from iommu_sva_bind_device() and iommu_sva_alloc_pasid() for
> > now?
> > 
> > Let me try to summarize PASID allocation as below:
> > 
> > Interfaces  | Usage |  Limit| bind¹ |User visible
> > /dev/ioasid²| G-SVA/IOVA|  cgroup   | No
> > |Yes char dev³  | SVA   |  cgroup   |
> > Yes |No iommu driver| default PASID|  no
> > | No|No kernel  | super SVA | no
> > | yes   |No
> > 
> > ¹ Allocated during SVA bind
> > ² PASIDs allocated via /dev/ioasid are not bound to any mm. But its
> >   ownership is assigned to the process that does the allocation.  
> 
> What does "not bound to a mm" mean?
> 
I meant, the IOASID allocated via /dev/ioasid is in a clean state (just a
number). It's initial state is not bound to an mm. Unlike, sva_bind_device()
where the IOASID is allocated during bind time.

The use case is to support guest SVA bind, where allocation and bind are in
two separate steps.

> IMHO a use created PASID is either bound to a mm (current) at creation
> time, or it will never be bound to a mm and its page table is under
> user control via /dev/ioasid.
> 
True for PASID used in native SVA bind. But for binding with a guest mm,
PASID is allocated first (VT-d virtual cmd interface Spec 10.4.44), the
bind with the host IOMMU when vIOMMU PASID cache is invalidated.

Our intention is to have two separate interfaces:
1. /dev/ioasid (allocation/free only)
2. /dev/sva (handles all SVA related activities including page tables)

> I thought the whole point of something like a /dev/ioasid was to get
> away from each and every device creating its own PASID interface?
> 
yes, but only for the use cases that need to expose PASID to the userspace.
AFAICT, the cases are:
1. guest SVA (bind guest mm)
2. full PF/VF assignment(not mediated) where guest driver want to program
the actual PASID onto the device.

> It maybe somewhat reasonable that some devices could have some easy
> 'make a SVA PASID on current' interface built in,
I agree, this is the case PASID is hidden from the userspace, right? e.g.
uacce.

> but anything more
> complicated should use /dev/ioasid, and anything consuming PASID
> should also have an API to import and attach a PASID from /dev/ioasid.
> 
Would the above two use cases constitute the "complicated" criteria? Or we
should say anything that need the explicit PASID value has to through
/dev/ioasid?

Could you give some highlevel hint on the APIs that hook up IOASID
allocated from /dev/ioasid and use cases that combine device and domain
information? Yi is working on /dev/sva RFC, it would be good to have a
direction check.

> > Currently, the proposed /dev/ioasid interface does not map individual
> > PASID with an FD. The FD is at the ioasid_set granularity and bond to
> > the current mm. We could extend the IOCTLs to cover individual PASID-FD
> > passing case when use cases arise. Would this work?  
> 
> Is it a good idea that the FD is per ioasid_set ?
We were thinking the allocation IOCTL is on a per set basis, then we know
the ownership of between PASIDs and its set. If per PASID FD is needed, we
can extend.

> What is the set used
> for?
> 
I tried to document the concept in
https://lore.kernel.org/lkml/1614463286-97618-2-git-send-email-jacob.jun@linux.intel.com/

In terms of usage for guest SVA, an ioasid_set is mostly tied to a host mm,
the use case is as the following:
1. Identify a pool of PASIDs for permission checking (below to the same VM),
e.g. only allow SVA binding for PASIDs allocated from the same set.

2. Allow 

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-24 Thread Jason Gunthorpe
On Wed, Mar 24, 2021 at 10:02:46AM -0700, Jacob Pan wrote:
> > Also wondering about device driver allocating auxiliary domains for their
> > private use, to do iommu_map/unmap on private PASIDs (a clean replacement
> > to super SVA, for example). Would that go through the same path as
> > /dev/ioasid and use the cgroup of current task?
>
> For the in-kernel private use, I don't think we should restrict based on
> cgroup, since there is no affinity to user processes. I also think the
> PASID allocation should just use kernel API instead of /dev/ioasid. Why
> would user space need to know the actual PASID # for device private domains?
> Maybe I missed your idea?

There is not much in the kernel that isn't triggered by a process, I
would be careful about the idea that there is a class of users that
can consume a cgroup controlled resource without being inside the
cgroup.

We've got into trouble before overlooking this and with something
greenfield like PASID it would be best built in to the API to prevent
a mistake. eg accepting a cgroup or process input to the allocator.

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-24 Thread Jacob Pan
Hi Jean-Philippe,

On Mon, 22 Mar 2021 10:24:00 +0100, Jean-Philippe Brucker
 wrote:

> On Fri, Mar 19, 2021 at 11:22:21AM -0700, Jacob Pan wrote:
> > Hi Jason,
> > 
> > On Fri, 19 Mar 2021 10:54:32 -0300, Jason Gunthorpe 
> > wrote: 
> > > On Fri, Mar 19, 2021 at 02:41:32PM +0100, Jean-Philippe Brucker
> > > wrote:  
> > > > On Fri, Mar 19, 2021 at 09:46:45AM -0300, Jason Gunthorpe wrote:
> > > > > On Fri, Mar 19, 2021 at 10:58:41AM +0100, Jean-Philippe Brucker
> > > > > wrote: 
> > > > > > Although there is no use for it at the moment (only two upstream
> > > > > > users and it looks like amdkfd always uses current too), I quite
> > > > > > like the client-server model where the privileged process does
> > > > > > bind() and programs the hardware queue on behalf of the client
> > > > > > process.
> > > > > 
> > > > > This creates a lot complexity, how do does process A get a secure
> > > > > reference to B? How does it access the memory in B to setup the
> > > > > HW?
> > > > 
> > > > mm_access() for example, and passing addresses via IPC
> > > 
> > > I'd rather the source process establish its own PASID and then pass
> > > the rights to use it to some other process via FD passing than try to
> > > go the other way. There are lots of security questions with something
> > > like mm_access.
> > >   
> > 
> > Thank you all for the input, it sounds like we are OK to remove mm
> > argument from iommu_sva_bind_device() and iommu_sva_alloc_pasid() for
> > now?  
> 
> Fine by me. By the way the IDXD currently missues the bind API for
> supervisor PASID, and the drvdata parameter isn't otherwise used. This
> would be a good occasion to clean both. The new bind prototype could be:
> 
> struct iommu_sva *iommu_sva_bind_device(struct device *dev, int flags)
> 
yes, we really just hijacked drvdata as flags, it would be cleaner to use
flags explicitly.

> And a flag IOMMU_SVA_BIND_SUPERVISOR (not that I plan to implement it in
> the SMMU, but I think we need to clean the current usage)
> 
You mean move #define SVM_FLAG_SUPERVISOR_MODE out of Intel code to be a
generic flag in iommu-sva-lib.h called IOMMU_SVA_BIND_SUPERVISOR?

I agree if that is the proposal.

> > 
> > Let me try to summarize PASID allocation as below:
> > 
> > Interfaces  | Usage |  Limit| bind¹ |User visible
> > 
> > /dev/ioasid²| G-SVA/IOVA|  cgroup   | No
> > |Yes
> > 
> > char dev³   | SVA   |  cgroup   | Yes   |No
> > 
> > iommu driver| default PASID|  no| No|No
> >  
> 
> Is this PASID #0?
> 
True for native case but not limited to PASID#0 for guest case. E.g. for
mdev assignment with guest IOVA, the guest PASID would #0, but the host aux
domain default PASID can be non-zero. Here I meant to include both cases.

> > 
> > kernel  | super SVA | no| yes   |No
> >   
> 
> Also wondering about device driver allocating auxiliary domains for their
> private use, to do iommu_map/unmap on private PASIDs (a clean replacement
> to super SVA, for example). Would that go through the same path as
> /dev/ioasid and use the cgroup of current task?
>
For the in-kernel private use, I don't think we should restrict based on
cgroup, since there is no affinity to user processes. I also think the
PASID allocation should just use kernel API instead of /dev/ioasid. Why
would user space need to know the actual PASID # for device private domains?
Maybe I missed your idea?

> Thanks,
> Jean
> 
> > 
> > ¹ Allocated during SVA bind
> > ² PASIDs allocated via /dev/ioasid are not bound to any mm. But its
> >   ownership is assigned to the process that does the allocation.
> > ³ Include uacce, other private device driver char dev such as idxd
> > 
> > Currently, the proposed /dev/ioasid interface does not map individual
> > PASID with an FD. The FD is at the ioasid_set granularity and bond to
> > the current mm. We could extend the IOCTLs to cover individual PASID-FD
> > passing case when use cases arise. Would this work?
> > 
> > Thanks,
> > 
> > Jacob  


Thanks,

Jacob


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-22 Thread Jason Gunthorpe
On Fri, Mar 19, 2021 at 11:22:21AM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Fri, 19 Mar 2021 10:54:32 -0300, Jason Gunthorpe  wrote:
> 
> > On Fri, Mar 19, 2021 at 02:41:32PM +0100, Jean-Philippe Brucker wrote:
> > > On Fri, Mar 19, 2021 at 09:46:45AM -0300, Jason Gunthorpe wrote:  
> > > > On Fri, Mar 19, 2021 at 10:58:41AM +0100, Jean-Philippe Brucker wrote:
> > > >   
> > > > > Although there is no use for it at the moment (only two upstream
> > > > > users and it looks like amdkfd always uses current too), I quite
> > > > > like the client-server model where the privileged process does
> > > > > bind() and programs the hardware queue on behalf of the client
> > > > > process.  
> > > > 
> > > > This creates a lot complexity, how do does process A get a secure
> > > > reference to B? How does it access the memory in B to setup the HW?  
> > > 
> > > mm_access() for example, and passing addresses via IPC  
> > 
> > I'd rather the source process establish its own PASID and then pass
> > the rights to use it to some other process via FD passing than try to
> > go the other way. There are lots of security questions with something
> > like mm_access.
> > 
> 
> Thank you all for the input, it sounds like we are OK to remove mm argument
> from iommu_sva_bind_device() and iommu_sva_alloc_pasid() for now?
> 
> Let me try to summarize PASID allocation as below:
> 
> Interfaces| Usage |  Limit| bind¹ |User visible
> /dev/ioasid²  | G-SVA/IOVA|  cgroup   | No|Yes
> char dev³ | SVA   |  cgroup   | Yes   |No
> iommu driver  | default PASID|  no| No|No
> kernel| super SVA | no| yes   |No
> 
> ¹ Allocated during SVA bind
> ² PASIDs allocated via /dev/ioasid are not bound to any mm. But its
>   ownership is assigned to the process that does the allocation.

What does "not bound to a mm" mean?

IMHO a use created PASID is either bound to a mm (current) at creation
time, or it will never be bound to a mm and its page table is under
user control via /dev/ioasid.

I thought the whole point of something like a /dev/ioasid was to get
away from each and every device creating its own PASID interface?

It maybe somewhat reasonable that some devices could have some easy
'make a SVA PASID on current' interface built in, but anything more
complicated should use /dev/ioasid, and anything consuming PASID
should also have an API to import and attach a PASID from /dev/ioasid.

> Currently, the proposed /dev/ioasid interface does not map individual PASID
> with an FD. The FD is at the ioasid_set granularity and bond to the current
> mm. We could extend the IOCTLs to cover individual PASID-FD passing case
> when use cases arise. Would this work?

Is it a good idea that the FD is per ioasid_set ? What is the set used
for?

Usually kernel interfaces work nicer with a one fd/one object model.

But even if it is a set, you could pass the set between co-operating
processes and the PASID can be created in the correct 'current'. But
there is all kinds of security questsions as soon as you start doing
anything like this - is there really a use case?

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-22 Thread Jean-Philippe Brucker
On Fri, Mar 19, 2021 at 11:22:21AM -0700, Jacob Pan wrote:
> Hi Jason,
> 
> On Fri, 19 Mar 2021 10:54:32 -0300, Jason Gunthorpe  wrote:
> 
> > On Fri, Mar 19, 2021 at 02:41:32PM +0100, Jean-Philippe Brucker wrote:
> > > On Fri, Mar 19, 2021 at 09:46:45AM -0300, Jason Gunthorpe wrote:  
> > > > On Fri, Mar 19, 2021 at 10:58:41AM +0100, Jean-Philippe Brucker wrote:
> > > >   
> > > > > Although there is no use for it at the moment (only two upstream
> > > > > users and it looks like amdkfd always uses current too), I quite
> > > > > like the client-server model where the privileged process does
> > > > > bind() and programs the hardware queue on behalf of the client
> > > > > process.  
> > > > 
> > > > This creates a lot complexity, how do does process A get a secure
> > > > reference to B? How does it access the memory in B to setup the HW?  
> > > 
> > > mm_access() for example, and passing addresses via IPC  
> > 
> > I'd rather the source process establish its own PASID and then pass
> > the rights to use it to some other process via FD passing than try to
> > go the other way. There are lots of security questions with something
> > like mm_access.
> > 
> 
> Thank you all for the input, it sounds like we are OK to remove mm argument
> from iommu_sva_bind_device() and iommu_sva_alloc_pasid() for now?

Fine by me. By the way the IDXD currently missues the bind API for
supervisor PASID, and the drvdata parameter isn't otherwise used. This
would be a good occasion to clean both. The new bind prototype could be:

struct iommu_sva *iommu_sva_bind_device(struct device *dev, int flags)

And a flag IOMMU_SVA_BIND_SUPERVISOR (not that I plan to implement it in
the SMMU, but I think we need to clean the current usage)

> 
> Let me try to summarize PASID allocation as below:
> 
> Interfaces| Usage |  Limit| bind¹ |User visible
> 
> /dev/ioasid²  | G-SVA/IOVA|  cgroup   | No|Yes
> 
> char dev³ | SVA   |  cgroup   | Yes   |No
> 
> iommu driver  | default PASID|  no| No|No

Is this PASID #0?

> 
> kernel| super SVA | no| yes   |No
> 

Also wondering about device driver allocating auxiliary domains for their
private use, to do iommu_map/unmap on private PASIDs (a clean replacement
to super SVA, for example). Would that go through the same path as
/dev/ioasid and use the cgroup of current task?

Thanks,
Jean

> 
> ¹ Allocated during SVA bind
> ² PASIDs allocated via /dev/ioasid are not bound to any mm. But its
>   ownership is assigned to the process that does the allocation.
> ³ Include uacce, other private device driver char dev such as idxd
> 
> Currently, the proposed /dev/ioasid interface does not map individual PASID
> with an FD. The FD is at the ioasid_set granularity and bond to the current
> mm. We could extend the IOCTLs to cover individual PASID-FD passing case
> when use cases arise. Would this work?
> 
> Thanks,
> 
> Jacob


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-19 Thread Jacob Pan
Hi Jason,

On Fri, 19 Mar 2021 10:54:32 -0300, Jason Gunthorpe  wrote:

> On Fri, Mar 19, 2021 at 02:41:32PM +0100, Jean-Philippe Brucker wrote:
> > On Fri, Mar 19, 2021 at 09:46:45AM -0300, Jason Gunthorpe wrote:  
> > > On Fri, Mar 19, 2021 at 10:58:41AM +0100, Jean-Philippe Brucker wrote:
> > >   
> > > > Although there is no use for it at the moment (only two upstream
> > > > users and it looks like amdkfd always uses current too), I quite
> > > > like the client-server model where the privileged process does
> > > > bind() and programs the hardware queue on behalf of the client
> > > > process.  
> > > 
> > > This creates a lot complexity, how do does process A get a secure
> > > reference to B? How does it access the memory in B to setup the HW?  
> > 
> > mm_access() for example, and passing addresses via IPC  
> 
> I'd rather the source process establish its own PASID and then pass
> the rights to use it to some other process via FD passing than try to
> go the other way. There are lots of security questions with something
> like mm_access.
> 

Thank you all for the input, it sounds like we are OK to remove mm argument
from iommu_sva_bind_device() and iommu_sva_alloc_pasid() for now?

Let me try to summarize PASID allocation as below:

Interfaces  | Usage |  Limit| bind¹ |User visible

/dev/ioasid²| G-SVA/IOVA|  cgroup   | No|Yes

char dev³   | SVA   |  cgroup   | Yes   |No

iommu driver| default PASID|  no| No|No

kernel  | super SVA | no| yes   |No


¹ Allocated during SVA bind
² PASIDs allocated via /dev/ioasid are not bound to any mm. But its
  ownership is assigned to the process that does the allocation.
³ Include uacce, other private device driver char dev such as idxd

Currently, the proposed /dev/ioasid interface does not map individual PASID
with an FD. The FD is at the ioasid_set granularity and bond to the current
mm. We could extend the IOCTLs to cover individual PASID-FD passing case
when use cases arise. Would this work?

Thanks,

Jacob


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-19 Thread Jacob Pan
Hi Jean-Philippe,

On Fri, 19 Mar 2021 10:58:41 +0100, Jean-Philippe Brucker
 wrote:

> > Slightly off the title. As we are moving to use cgroup to limit PASID
> > allocations, it would be much simpler if we enforce on the current
> > task.  
> 
> Yes I think we should do that. Is there a problem with charging the
> process that does the PASID allocation even if the PASID indexes some
> other mm?
Besides complexity, my second concern is that we are sharing the misc
cgroup controller with other resources that do not have such behavior.

Cgroup v2 also has unified hierarchy which also requires coherent behavior
among controllers.

Thanks,

Jacob


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-19 Thread Jason Gunthorpe
On Fri, Mar 19, 2021 at 02:41:32PM +0100, Jean-Philippe Brucker wrote:
> On Fri, Mar 19, 2021 at 09:46:45AM -0300, Jason Gunthorpe wrote:
> > On Fri, Mar 19, 2021 at 10:58:41AM +0100, Jean-Philippe Brucker wrote:
> > 
> > > Although there is no use for it at the moment (only two upstream users and
> > > it looks like amdkfd always uses current too), I quite like the
> > > client-server model where the privileged process does bind() and programs
> > > the hardware queue on behalf of the client process.
> > 
> > This creates a lot complexity, how do does process A get a secure
> > reference to B? How does it access the memory in B to setup the HW?
> 
> mm_access() for example, and passing addresses via IPC

I'd rather the source process establish its own PASID and then pass
the rights to use it to some other process via FD passing than try to
go the other way. There are lots of security questions with something
like mm_access.

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-19 Thread Jean-Philippe Brucker
On Fri, Mar 19, 2021 at 09:46:45AM -0300, Jason Gunthorpe wrote:
> On Fri, Mar 19, 2021 at 10:58:41AM +0100, Jean-Philippe Brucker wrote:
> 
> > Although there is no use for it at the moment (only two upstream users and
> > it looks like amdkfd always uses current too), I quite like the
> > client-server model where the privileged process does bind() and programs
> > the hardware queue on behalf of the client process.
> 
> This creates a lot complexity, how do does process A get a secure
> reference to B? How does it access the memory in B to setup the HW?

mm_access() for example, and passing addresses via IPC

> Why do we need separation anyhow? SVM devices are supposed to be
> secure or they shouldn't do SVM.

Right

Thanks,
Jean


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-19 Thread Jason Gunthorpe
On Fri, Mar 19, 2021 at 10:58:41AM +0100, Jean-Philippe Brucker wrote:

> Although there is no use for it at the moment (only two upstream users and
> it looks like amdkfd always uses current too), I quite like the
> client-server model where the privileged process does bind() and programs
> the hardware queue on behalf of the client process.

This creates a lot complexity, how do does process A get a secure
reference to B? How does it access the memory in B to setup the HW?

Why do we need separation anyhow? SVM devices are supposed to be
secure or they shouldn't do SVM.

Jason


Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-19 Thread Jean-Philippe Brucker
Hi Jacob,

On Thu, Mar 18, 2021 at 05:22:34PM -0700, Jacob Pan wrote:
> Hi Jean,
> 
> Slightly off the title. As we are moving to use cgroup to limit PASID
> allocations, it would be much simpler if we enforce on the current task.

Yes I think we should do that. Is there a problem with charging the
process that does the PASID allocation even if the PASID indexes some
other mm?

> However, iommu_sva_alloc_pasid() takes an mm_struct pointer as argument
> which implies it can be something other the the current task mm. So far all
> kernel callers use current task mm. Is there a use case for doing PASID
> allocation on behalf of another mm? If not, can we remove the mm argument?

This would effectively remove the mm parameter from
iommu_sva_bind_device(). I'm not opposed to that, but reintroducing it
later will be difficult if IOMMU drivers start assuming that the bound mm
is from current.

Although there is no use for it at the moment (only two upstream users and
it looks like amdkfd always uses current too), I quite like the
client-server model where the privileged process does bind() and programs
the hardware queue on behalf of the client process.

Thanks,
Jean



Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-18 Thread Jacob Pan
Hi Jean,

Slightly off the title. As we are moving to use cgroup to limit PASID
allocations, it would be much simpler if we enforce on the current task.

However, iommu_sva_alloc_pasid() takes an mm_struct pointer as argument
which implies it can be something other the the current task mm. So far all
kernel callers use current task mm. Is there a use case for doing PASID
allocation on behalf of another mm? If not, can we remove the mm argument?

Thanks,

Jacob

>  /**
>   * iommu_sva_alloc_pasid - Allocate a PASID for the mm
> @@ -35,11 +44,11 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm,
> ioasid_t min, ioasid_t max) mutex_lock(_sva_lock);
>   if (mm->pasid) {
>   if (mm->pasid >= min && mm->pasid <= max)
> - ioasid_get(mm->pasid);
> + ioasid_get(iommu_sva_pasid, mm->pasid);
>   else
>   ret = -EOVERFLOW;
>   } else {
> - pasid = ioasid_alloc(_sva_pasid, min, max, mm);
> + pasid = ioasid_alloc(iommu_sva_pasid, min, max, mm);
>   if (pasid == INVALID_IOASID)
>   ret = -ENOMEM;

Thanks,

Jacob


  1   2   >