date:20210527

Re: [PATCH v7 11/12] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-05-27 Thread Jason Wang



在 2021/5/28 上午11:54, Yongji Xie 写道:

On Fri, May 28, 2021 at 9:33 AM Jason Wang  wrote:


在 2021/5/27 下午6:14, Yongji Xie 写道:

On Thu, May 27, 2021 at 4:43 PM Jason Wang  wrote:

在 2021/5/27 下午4:41, Jason Wang 写道:

在 2021/5/27 下午3:34, Yongji Xie 写道:

On Thu, May 27, 2021 at 1:40 PM Jason Wang  wrote:

在 2021/5/27 下午1:08, Yongji Xie 写道:

On Thu, May 27, 2021 at 1:00 PM Jason Wang 
wrote:

在 2021/5/27 下午12:57, Yongji Xie 写道:

On Thu, May 27, 2021 at 12:13 PM Jason Wang 
wrote:

在 2021/5/17 下午5:55, Xie Yongji 写道:

+
+static int vduse_dev_msg_sync(struct vduse_dev *dev,
+   struct vduse_dev_msg *msg)
+{
+ init_waitqueue_head(&msg->waitq);
+ spin_lock(&dev->msg_lock);
+ vduse_enqueue_msg(&dev->send_list, msg);
+ wake_up(&dev->waitq);
+ spin_unlock(&dev->msg_lock);
+ wait_event_killable(msg->waitq, msg->completed);

What happens if the userspace(malicous) doesn't give a response
forever?

It looks like a DOS. If yes, we need to consider a way to fix that.


How about using wait_event_killable_timeout() instead?

Probably, and then we need choose a suitable timeout and more
important,
need to report the failure to virtio.


Makes sense to me. But it looks like some
vdpa_config_ops/virtio_config_ops such as set_status() didn't have a
return value.  Now I add a WARN_ON() for the failure. Do you mean we
need to add some change for virtio core to handle the failure?

Maybe, but I'm not sure how hard we can do that.


We need to change all virtio device drivers in this way.

Probably.



We had NEEDS_RESET but it looks we don't implement it.


Could it handle the failure of get_feature() and get/set_config()?

Looks not:

"

The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
that a reset is needed. If DRIVER_OK is set, after it sets
DEVICE_NEEDS_RESET, the device MUST send a device configuration change
notification to the driver.

"

This looks implies that NEEDS_RESET may only work after device is
probed. But in the current design, even the reset() is not reliable.



Or a rough idea is that maybe need some relaxing to be coupled loosely
with userspace. E.g the device (control path) is implemented in the
kernel but the datapath is implemented in the userspace like TUN/TAP.


I think it can work for most cases. One problem is that the set_config
might change the behavior of the data path at runtime, e.g.
virtnet_set_mac_address() in the virtio-net driver and
cache_type_store() in the virtio-blk driver. Not sure if this path is
able to return before the datapath is aware of this change.

Good point.

But set_config() should be rare:

E.g in the case of virtio-net with VERSION_1, config space is read
only, and it was set via control vq.

For block, we can

1) start from without WCE or
2) we add a config change notification to userspace or
3) extend the spec to use vq instead of config space

Thanks

Another thing if we want to go this way:

We need find a way to terminate the data path from the kernel side, to
implement to reset semantic.


Do you mean terminate the data path in vdpa_reset().


Yes.



   Is it ok to just
notify userspace to stop data path asynchronously?


For well-behaved userspace, yes but no for buggy or malicious ones.


But the buggy or malicious daemons can't do anything if my
understanding is correct.



You're right. I originally thought there can still have bouncing. But 
consider we don't do that during fault.


It should be safe.





I had an idea, how about terminate IOTLB in this case? Then we're in
fact turn datapath off.


Sorry, I didn't get your point here. What do you mean by terminating
IOTLB?



I meant terminate the bouncing but it looks safe after a second thought :)

Thanks



  Remove iotlb mapping? But userspace can still access the mapped
region.

Thanks,
Yongji



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: Re: [PATCH v7 11/12] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-05-27 Thread Yongji Xie

On Fri, May 28, 2021 at 9:33 AM Jason Wang  wrote:
>
>
> 在 2021/5/27 下午6:14, Yongji Xie 写道:
> > On Thu, May 27, 2021 at 4:43 PM Jason Wang  wrote:
> >>
> >> 在 2021/5/27 下午4:41, Jason Wang 写道:
> >>> 在 2021/5/27 下午3:34, Yongji Xie 写道:
>  On Thu, May 27, 2021 at 1:40 PM Jason Wang  wrote:
> > 在 2021/5/27 下午1:08, Yongji Xie 写道:
> >> On Thu, May 27, 2021 at 1:00 PM Jason Wang 
> >> wrote:
> >>> 在 2021/5/27 下午12:57, Yongji Xie 写道:
>  On Thu, May 27, 2021 at 12:13 PM Jason Wang 
>  wrote:
> > 在 2021/5/17 下午5:55, Xie Yongji 写道:
> >> +
> >> +static int vduse_dev_msg_sync(struct vduse_dev *dev,
> >> +   struct vduse_dev_msg *msg)
> >> +{
> >> + init_waitqueue_head(&msg->waitq);
> >> + spin_lock(&dev->msg_lock);
> >> + vduse_enqueue_msg(&dev->send_list, msg);
> >> + wake_up(&dev->waitq);
> >> + spin_unlock(&dev->msg_lock);
> >> + wait_event_killable(msg->waitq, msg->completed);
> > What happens if the userspace(malicous) doesn't give a response
> > forever?
> >
> > It looks like a DOS. If yes, we need to consider a way to fix that.
> >
>  How about using wait_event_killable_timeout() instead?
> >>> Probably, and then we need choose a suitable timeout and more
> >>> important,
> >>> need to report the failure to virtio.
> >>>
> >> Makes sense to me. But it looks like some
> >> vdpa_config_ops/virtio_config_ops such as set_status() didn't have a
> >> return value.  Now I add a WARN_ON() for the failure. Do you mean we
> >> need to add some change for virtio core to handle the failure?
> > Maybe, but I'm not sure how hard we can do that.
> >
>  We need to change all virtio device drivers in this way.
> >>>
> >>> Probably.
> >>>
> >>>
> > We had NEEDS_RESET but it looks we don't implement it.
> >
>  Could it handle the failure of get_feature() and get/set_config()?
> >>>
> >>> Looks not:
> >>>
> >>> "
> >>>
> >>> The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
> >>> that a reset is needed. If DRIVER_OK is set, after it sets
> >>> DEVICE_NEEDS_RESET, the device MUST send a device configuration change
> >>> notification to the driver.
> >>>
> >>> "
> >>>
> >>> This looks implies that NEEDS_RESET may only work after device is
> >>> probed. But in the current design, even the reset() is not reliable.
> >>>
> >>>
> > Or a rough idea is that maybe need some relaxing to be coupled loosely
> > with userspace. E.g the device (control path) is implemented in the
> > kernel but the datapath is implemented in the userspace like TUN/TAP.
> >
>  I think it can work for most cases. One problem is that the set_config
>  might change the behavior of the data path at runtime, e.g.
>  virtnet_set_mac_address() in the virtio-net driver and
>  cache_type_store() in the virtio-blk driver. Not sure if this path is
>  able to return before the datapath is aware of this change.
> >>>
> >>> Good point.
> >>>
> >>> But set_config() should be rare:
> >>>
> >>> E.g in the case of virtio-net with VERSION_1, config space is read
> >>> only, and it was set via control vq.
> >>>
> >>> For block, we can
> >>>
> >>> 1) start from without WCE or
> >>> 2) we add a config change notification to userspace or
> >>> 3) extend the spec to use vq instead of config space
> >>>
> >>> Thanks
> >>
> >> Another thing if we want to go this way:
> >>
> >> We need find a way to terminate the data path from the kernel side, to
> >> implement to reset semantic.
> >>
> > Do you mean terminate the data path in vdpa_reset().
>
>
> Yes.
>
>
> >   Is it ok to just
> > notify userspace to stop data path asynchronously?
>
>
> For well-behaved userspace, yes but no for buggy or malicious ones.
>

But the buggy or malicious daemons can't do anything if my
understanding is correct.

> I had an idea, how about terminate IOTLB in this case? Then we're in
> fact turn datapath off.
>

Sorry, I didn't get your point here. What do you mean by terminating
IOTLB? Remove iotlb mapping? But userspace can still access the mapped
region.

Thanks,
Yongji
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 11/12] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-05-27 Thread Jason Wang



在 2021/5/27 下午9:17, Yongji Xie 写道:

On Thu, May 27, 2021 at 4:41 PM Jason Wang  wrote:


在 2021/5/27 下午3:34, Yongji Xie 写道:

On Thu, May 27, 2021 at 1:40 PM Jason Wang  wrote:

在 2021/5/27 下午1:08, Yongji Xie 写道:

On Thu, May 27, 2021 at 1:00 PM Jason Wang  wrote:

在 2021/5/27 下午12:57, Yongji Xie 写道:

On Thu, May 27, 2021 at 12:13 PM Jason Wang  wrote:

在 2021/5/17 下午5:55, Xie Yongji 写道:

+
+static int vduse_dev_msg_sync(struct vduse_dev *dev,
+   struct vduse_dev_msg *msg)
+{
+ init_waitqueue_head(&msg->waitq);
+ spin_lock(&dev->msg_lock);
+ vduse_enqueue_msg(&dev->send_list, msg);
+ wake_up(&dev->waitq);
+ spin_unlock(&dev->msg_lock);
+ wait_event_killable(msg->waitq, msg->completed);

What happens if the userspace(malicous) doesn't give a response forever?

It looks like a DOS. If yes, we need to consider a way to fix that.


How about using wait_event_killable_timeout() instead?

Probably, and then we need choose a suitable timeout and more important,
need to report the failure to virtio.


Makes sense to me. But it looks like some
vdpa_config_ops/virtio_config_ops such as set_status() didn't have a
return value.  Now I add a WARN_ON() for the failure. Do you mean we
need to add some change for virtio core to handle the failure?

Maybe, but I'm not sure how hard we can do that.


We need to change all virtio device drivers in this way.


Probably.



We had NEEDS_RESET but it looks we don't implement it.


Could it handle the failure of get_feature() and get/set_config()?


Looks not:

"

The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
that a reset is needed. If DRIVER_OK is set, after it sets
DEVICE_NEEDS_RESET, the device MUST send a device configuration change
notification to the driver.

"

This looks implies that NEEDS_RESET may only work after device is
probed. But in the current design, even the reset() is not reliable.



Or a rough idea is that maybe need some relaxing to be coupled loosely
with userspace. E.g the device (control path) is implemented in the
kernel but the datapath is implemented in the userspace like TUN/TAP.


I think it can work for most cases. One problem is that the set_config
might change the behavior of the data path at runtime, e.g.
virtnet_set_mac_address() in the virtio-net driver and
cache_type_store() in the virtio-blk driver. Not sure if this path is
able to return before the datapath is aware of this change.


Good point.

But set_config() should be rare:

E.g in the case of virtio-net with VERSION_1, config space is read only,
and it was set via control vq.

For block, we can

1) start from without WCE or
2) we add a config change notification to userspace or

I prefer this way. And I think we also need to do similar things for
set/get_vq_state().



Yes, I agree.

Thanks




Thanks,
Yongji



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] /dev/ioasid uAPI proposal

2021-05-27 Thread Jason Wang



在 2021/5/27 下午3:58, Tian, Kevin 写道:

/dev/ioasid provides an unified interface for managing I/O page tables for
devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
etc.) are expected to use this interface instead of creating their own logic to
isolate untrusted device DMAs initiated by userspace.



Not a native speaker but /dev/ioas seems better?




This proposal describes the uAPI of /dev/ioasid and also sample sequences
with VFIO as example in typical usages. The driver-facing kernel API provided
by the iommu layer is still TBD, which can be discussed after consensus is
made on this uAPI.

It's based on a lengthy discussion starting from here:
https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/

It ends up to be a long writing due to many things to be summarized and
non-trivial effort required to connect them into a complete proposal.
Hope it provides a clean base to converge.

TOC

1. Terminologies and Concepts
2. uAPI Proposal
 2.1. /dev/ioasid uAPI
 2.2. /dev/vfio uAPI
 2.3. /dev/kvm uAPI
3. Sample structures and helper functions
4. PASID virtualization
5. Use Cases and Flows
 5.1. A simple example
 5.2. Multiple IOASIDs (no nesting)
 5.3. IOASID nesting (software)
 5.4. IOASID nesting (hardware)
 5.5. Guest SVA (vSVA)
 5.6. I/O page fault
 5.7. BIND_PASID_TABLE


1. Terminologies and Concepts
-

IOASID FD is the container holding multiple I/O address spaces. User
manages those address spaces through FD operations. Multiple FD's are
allowed per process, but with this proposal one FD should be sufficient for
all intended usages.

IOASID is the FD-local software handle representing an I/O address space.
Each IOASID is associated with a single I/O page table. IOASIDs can be
nested together, implying the output address from one I/O page table
(represented by child IOASID) must be further translated by another I/O
page table (represented by parent IOASID).

I/O address space can be managed through two protocols, according to
whether the corresponding I/O page table is constructed by the kernel or
the user. When kernel-managed, a dma mapping protocol (similar to
existing VFIO iommu type1) is provided for the user to explicitly specify
how the I/O address space is mapped. Otherwise, a different protocol is
provided for the user to bind an user-managed I/O page table to the
IOMMU, plus necessary commands for iotlb invalidation and I/O fault
handling.

Pgtable binding protocol can be used only on the child IOASID's, implying
IOASID nesting must be enabled. This is because the kernel doesn't trust
userspace. Nesting allows the kernel to enforce its DMA isolation policy
through the parent IOASID.

IOASID nesting can be implemented in two ways: hardware nesting and
software nesting. With hardware support the child and parent I/O page
tables are walked consecutively by the IOMMU to form a nested translation.
When it's implemented in software, the ioasid driver



Need to explain what did "ioasid driver" mean.

I guess it's the module that implements the IOASID abstraction:

1) RID
2) RID+PASID
3) others

And if yes, does it allow the device for software specific implementation:

1) swiotlb or
2) device specific IOASID implementation



is responsible for
merging the two-level mappings into a single-level shadow I/O page table.
Software nesting requires both child/parent page tables operated through
the dma mapping protocol, so any change in either level can be captured
by the kernel to update the corresponding shadow mapping.

An I/O address space takes effect in the IOMMU only after it is attached
to a device. The device in the /dev/ioasid context always refers to a
physical one or 'pdev' (PF or VF).

One I/O address space could be attached to multiple devices. In this case,
/dev/ioasid uAPI applies to all attached devices under the specified IOASID.

Based on the underlying IOMMU capability one device might be allowed
to attach to multiple I/O address spaces, with DMAs accessing them by
carrying different routing information. One of them is the default I/O
address space routed by PCI Requestor ID (RID) or ARM Stream ID. The
remaining are routed by RID + Process Address Space ID (PASID) or
Stream+Substream ID. For simplicity the following context uses RID and
PASID when talking about the routing information for I/O address spaces.

Device attachment is initiated through passthrough framework uAPI (use
VFIO for simplicity in following context). VFIO is responsible for identifying
the routing information and registering it to the ioasid driver when calling
ioasid attach helper function. It could be RID if the assigned device is
pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition,
user might also provide its view of virtual routing information (vPASID) in
the attach call, e.g. when multiple user-managed I/O address spaces are
attached to the vfio_device. I

Re: [PATCH v7 11/12] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-05-27 Thread Jason Wang



在 2021/5/27 下午6:14, Yongji Xie 写道:

On Thu, May 27, 2021 at 4:43 PM Jason Wang  wrote:


在 2021/5/27 下午4:41, Jason Wang 写道:

在 2021/5/27 下午3:34, Yongji Xie 写道:

On Thu, May 27, 2021 at 1:40 PM Jason Wang  wrote:

在 2021/5/27 下午1:08, Yongji Xie 写道:

On Thu, May 27, 2021 at 1:00 PM Jason Wang 
wrote:

在 2021/5/27 下午12:57, Yongji Xie 写道:

On Thu, May 27, 2021 at 12:13 PM Jason Wang 
wrote:

在 2021/5/17 下午5:55, Xie Yongji 写道:

+
+static int vduse_dev_msg_sync(struct vduse_dev *dev,
+   struct vduse_dev_msg *msg)
+{
+ init_waitqueue_head(&msg->waitq);
+ spin_lock(&dev->msg_lock);
+ vduse_enqueue_msg(&dev->send_list, msg);
+ wake_up(&dev->waitq);
+ spin_unlock(&dev->msg_lock);
+ wait_event_killable(msg->waitq, msg->completed);

What happens if the userspace(malicous) doesn't give a response
forever?

It looks like a DOS. If yes, we need to consider a way to fix that.


How about using wait_event_killable_timeout() instead?

Probably, and then we need choose a suitable timeout and more
important,
need to report the failure to virtio.


Makes sense to me. But it looks like some
vdpa_config_ops/virtio_config_ops such as set_status() didn't have a
return value.  Now I add a WARN_ON() for the failure. Do you mean we
need to add some change for virtio core to handle the failure?

Maybe, but I'm not sure how hard we can do that.


We need to change all virtio device drivers in this way.


Probably.



We had NEEDS_RESET but it looks we don't implement it.


Could it handle the failure of get_feature() and get/set_config()?


Looks not:

"

The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
that a reset is needed. If DRIVER_OK is set, after it sets
DEVICE_NEEDS_RESET, the device MUST send a device configuration change
notification to the driver.

"

This looks implies that NEEDS_RESET may only work after device is
probed. But in the current design, even the reset() is not reliable.



Or a rough idea is that maybe need some relaxing to be coupled loosely
with userspace. E.g the device (control path) is implemented in the
kernel but the datapath is implemented in the userspace like TUN/TAP.


I think it can work for most cases. One problem is that the set_config
might change the behavior of the data path at runtime, e.g.
virtnet_set_mac_address() in the virtio-net driver and
cache_type_store() in the virtio-blk driver. Not sure if this path is
able to return before the datapath is aware of this change.


Good point.

But set_config() should be rare:

E.g in the case of virtio-net with VERSION_1, config space is read
only, and it was set via control vq.

For block, we can

1) start from without WCE or
2) we add a config change notification to userspace or
3) extend the spec to use vq instead of config space

Thanks


Another thing if we want to go this way:

We need find a way to terminate the data path from the kernel side, to
implement to reset semantic.


Do you mean terminate the data path in vdpa_reset().



Yes.



  Is it ok to just
notify userspace to stop data path asynchronously?



For well-behaved userspace, yes but no for buggy or malicious ones.

I had an idea, how about terminate IOTLB in this case? Then we're in 
fact turn datapath off.


Thanks



  Userspace should
not be able to do any I/O at that time because the iotlb mapping is
already removed.

Thanks,
Yongji



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH 2/2] iommu: Drop unnecessary of_iommu.h includes

2021-05-27 Thread Rob Herring

The only place of_iommu.h is needed is in drivers/of/device.c. Remove it
from everywhere else.

Cc: Will Deacon 
Cc: Robin Murphy 
Cc: Joerg Roedel 
Cc: Rob Clark 
Cc: Marek Szyprowski 
Cc: Krzysztof Kozlowski 
Cc: Bjorn Andersson 
Cc: Yong Wu 
Cc: Matthias Brugger 
Cc: Heiko Stuebner 
Cc: Jean-Philippe Brucker 
Cc: Frank Rowand 
Cc: linux-arm-ker...@lists.infradead.org
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Rob Herring 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 1 -
 drivers/iommu/arm/arm-smmu/arm-smmu.c   | 1 -
 drivers/iommu/arm/arm-smmu/qcom_iommu.c | 1 -
 drivers/iommu/exynos-iommu.c| 1 -
 drivers/iommu/ipmmu-vmsa.c  | 1 -
 drivers/iommu/msm_iommu.c   | 1 -
 drivers/iommu/mtk_iommu.c   | 1 -
 drivers/iommu/mtk_iommu_v1.c| 1 -
 drivers/iommu/omap-iommu.c  | 1 -
 drivers/iommu/rockchip-iommu.c  | 1 -
 drivers/iommu/virtio-iommu.c| 1 -
 drivers/of/platform.c   | 1 -
 12 files changed, 12 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 54b2f27b81d4..2ddc3cd5a7d1 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -23,7 +23,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 6f72c4d208ca..dba15f312cbd 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -31,7 +31,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/iommu/arm/arm-smmu/qcom_iommu.c 
b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
index 4294abe389b2..021cf8f65ffc 100644
--- a/drivers/iommu/arm/arm-smmu/qcom_iommu.c
+++ b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
@@ -25,7 +25,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c
index 7623d8c371f5..d0fbf1d10e18 100644
--- a/drivers/iommu/exynos-iommu.c
+++ b/drivers/iommu/exynos-iommu.c
@@ -17,7 +17,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c
index aaa6a4d59057..51ea6f00db2f 100644
--- a/drivers/iommu/ipmmu-vmsa.c
+++ b/drivers/iommu/ipmmu-vmsa.c
@@ -19,7 +19,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c
index 7880f307cb2d..3a38352b603f 100644
--- a/drivers/iommu/msm_iommu.c
+++ b/drivers/iommu/msm_iommu.c
@@ -18,7 +18,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include 
 #include 
diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index e06b8a0e2b56..6f7c69688ce2 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -19,7 +19,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/iommu/mtk_iommu_v1.c b/drivers/iommu/mtk_iommu_v1.c
index 5915d7b38211..778e66f5f1aa 100644
--- a/drivers/iommu/mtk_iommu_v1.c
+++ b/drivers/iommu/mtk_iommu_v1.c
@@ -22,7 +22,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c
index 26e517eb0dd3..91749654fd49 100644
--- a/drivers/iommu/omap-iommu.c
+++ b/drivers/iommu/omap-iommu.c
@@ -22,7 +22,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/iommu/rockchip-iommu.c b/drivers/iommu/rockchip-iommu.c
index 7a2932772fdf..bb50e015b1d5 100644
--- a/drivers/iommu/rockchip-iommu.c
+++ b/drivers/iommu/rockchip-iommu.c
@@ -21,7 +21,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 7c02481a81b4..d9f46f2c3058 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -14,7 +14,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/of/platform.c b/drivers/of/platform.c
index 25d448f5af91..74afbb7a4f5e 100644
--- a/drivers/of/platform.c
+++ b/drivers/of/platform.c
@@ -17,7 +17,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
-- 
2.27.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH 1/2] iommu: Remove unused of_get_dma_window()

2021-05-27 Thread Rob Herring

of_get_dma_window() was added in 2012 and removed in 2014 in commit
891846516317 ("memory: Add NVIDIA Tegra memory controller support").
Remove it and simplify the header to use forward declarations for
structs rather than includes.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Frank Rowand 
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Rob Herring 
---
 drivers/iommu/of_iommu.c | 68 
 include/linux/of_iommu.h | 17 ++
 2 files changed, 3 insertions(+), 82 deletions(-)

diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c
index a9d2df001149..5696314ae69e 100644
--- a/drivers/iommu/of_iommu.c
+++ b/drivers/iommu/of_iommu.c
@@ -19,74 +19,6 @@
 
 #define NO_IOMMU   1
 
-/**
- * of_get_dma_window - Parse *dma-window property and returns 0 if found.
- *
- * @dn: device node
- * @prefix: prefix for property name if any
- * @index: index to start to parse
- * @busno: Returns busno if supported. Otherwise pass NULL
- * @addr: Returns address that DMA starts
- * @size: Returns the range that DMA can handle
- *
- * This supports different formats flexibly. "prefix" can be
- * configured if any. "busno" and "index" are optionally
- * specified. Set 0(or NULL) if not used.
- */
-int of_get_dma_window(struct device_node *dn, const char *prefix, int index,
- unsigned long *busno, dma_addr_t *addr, size_t *size)
-{
-   const __be32 *dma_window, *end;
-   int bytes, cur_index = 0;
-   char propname[NAME_MAX], addrname[NAME_MAX], sizename[NAME_MAX];
-
-   if (!dn || !addr || !size)
-   return -EINVAL;
-
-   if (!prefix)
-   prefix = "";
-
-   snprintf(propname, sizeof(propname), "%sdma-window", prefix);
-   snprintf(addrname, sizeof(addrname), "%s#dma-address-cells", prefix);
-   snprintf(sizename, sizeof(sizename), "%s#dma-size-cells", prefix);
-
-   dma_window = of_get_property(dn, propname, &bytes);
-   if (!dma_window)
-   return -ENODEV;
-   end = dma_window + bytes / sizeof(*dma_window);
-
-   while (dma_window < end) {
-   u32 cells;
-   const void *prop;
-
-   /* busno is one cell if supported */
-   if (busno)
-   *busno = be32_to_cpup(dma_window++);
-
-   prop = of_get_property(dn, addrname, NULL);
-   if (!prop)
-   prop = of_get_property(dn, "#address-cells", NULL);
-
-   cells = prop ? be32_to_cpup(prop) : of_n_addr_cells(dn);
-   if (!cells)
-   return -EINVAL;
-   *addr = of_read_number(dma_window, cells);
-   dma_window += cells;
-
-   prop = of_get_property(dn, sizename, NULL);
-   cells = prop ? be32_to_cpup(prop) : of_n_size_cells(dn);
-   if (!cells)
-   return -EINVAL;
-   *size = of_read_number(dma_window, cells);
-   dma_window += cells;
-
-   if (cur_index++ == index)
-   break;
-   }
-   return 0;
-}
-EXPORT_SYMBOL_GPL(of_get_dma_window);
-
 static int of_iommu_xlate(struct device *dev,
  struct of_phandle_args *iommu_spec)
 {
diff --git a/include/linux/of_iommu.h b/include/linux/of_iommu.h
index 16f4b3e87f20..55c1eb300a86 100644
--- a/include/linux/of_iommu.h
+++ b/include/linux/of_iommu.h
@@ -2,29 +2,18 @@
 #ifndef __OF_IOMMU_H
 #define __OF_IOMMU_H
 
-#include 
-#include 
-#include 
+struct device;
+struct device_node;
+struct iommu_ops;
 
 #ifdef CONFIG_OF_IOMMU
 
-extern int of_get_dma_window(struct device_node *dn, const char *prefix,
-int index, unsigned long *busno, dma_addr_t *addr,
-size_t *size);
-
 extern const struct iommu_ops *of_iommu_configure(struct device *dev,
struct device_node *master_np,
const u32 *id);
 
 #else
 
-static inline int of_get_dma_window(struct device_node *dn, const char *prefix,
-   int index, unsigned long *busno, dma_addr_t *addr,
-   size_t *size)
-{
-   return -EINVAL;
-}
-
 static inline const struct iommu_ops *of_iommu_configure(struct device *dev,
 struct device_node *master_np,
 const u32 *id)
-- 
2.27.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-27 Thread Jason Gunthorpe

On Thu, May 27, 2021 at 02:53:42PM +1000, David Gibson wrote:

> > > If the physical device had a bug which meant the mdevs *weren't*
> > > properly isolated from each other, then those mdevs would share a
> > > group, and you *would* care about it.  Depending on how the isolation
> > > failed the mdevs might or might not also share a group with the parent
> > > physical device.
> > 
> > That isn't a real scenario.. mdevs that can't be isolated just
> > wouldn't be useful to exist
> 
> Really?  So what do you do when you discover some mdevs you thought
> were isolated actually aren't due to a hardware bug?  Drop support
> from the driver entirely?  In which case what do you say to the people
> who understandably complain "but... we had all the mdevs in one guest
> anyway, we don't care if they're not isolated"?

I've never said to eliminate groups entirely. 

What I'm saying is that all the cases we have for mdev today do not
require groups, but are forced to create a fake group anyhow just to
satisfy the odd VFIO requirement to have a group FD.

If some future mdev needs groups then sure, add the appropriate group
stuff.

But that doesn't effect the decision to have a VFIO group FD, or not.

> > > It ensures that they're parked at the moment the group moves from
> > > kernel to userspace ownership, but it can't prevent dpdk from
> > > accessing and unparking those devices via peer to peer DMA.
> > 
> > Right, and adding all this group stuff did nothing to alert the poor
> > admin that is running DPDK to this risk.
> 
> Didn't it?  Seems to me the admin that in order to give the group to
> DPDK, the admin had to find and unbind all the things in it... so is
> therefore aware that they're giving everything in it to DPDK.

Again, I've never said the *group* should be removed. I'm only
concerned about the *group FD*

When the admin found and unbound they didn't use the *group FD* in any
way.

> > You put the same security labels you'd put on the group to the devices
> > that consitute the group. It is only more tricky in the sense that the
> > script that would have to do this will need to do more than ID the
> > group to label but also ID the device members of the group and label
> > their char nodes.
> 
> Well, I guess, if you take the view that root is allowed to break the
> kernel.  I tend to prefer that although root can obviously break the
> kernel if they intend do, we should make it hard to do by accident -
> which in this case would mean the kernel *enforcing* that the devices
> in the group have the same security labels, which I can't really see
> how to do without an exposed group.

How is this "break the kernel"? It has nothing to do with the
kernel. Security labels are a user space concern.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-27 Thread Jason Gunthorpe

On Thu, May 27, 2021 at 02:58:30PM +1000, David Gibson wrote:
> On Tue, May 25, 2021 at 04:52:57PM -0300, Jason Gunthorpe wrote:
> > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote:
> > 
> > > 2. iommu backed mdev devices for SRIOV where mdev device is created per
> > > VF (mdev device == VF device) then that mdev device has same iommu
> > > protection scope as VF associated to it. 
> > 
> > This doesn't require, and certainly shouldn't create, a fake group.
> 
> It's only fake if you start with a narrow view of what a group is. 

A group is connected to drivers/iommu. A group object without *any*
relation to drivers/iommu is just a complete fiction, IMHO.

> > Only the VF's real IOMMU group should be used to model an iommu domain
> > linked to a VF. Injecting fake groups that are proxies for real groups
> > only opens the possibility of security problems like David is
> > concerned with.
> 
> It's not a proxy for a real group, it's a group of its own.  If you
> discover that (due to a hardware bug, for example) the mdev is *not*

What Kirti is talking about here is the case where a mdev is wrapped
around a VF and the DMA isolation stems directly from the SRIOV VF's
inherent DMA isolation, not anything the mdev wrapper did.

The group providing the isolation is the VF's group.

The group mdev implicitly creates is just a fake proxy that comes
along with mdev API. It doesn't do anything and it doesn't mean
anything.

> properly isolated from its parent PCI device, then both the mdev
> virtual device *and* the physical PCI device are in the same group.
> Groups including devices of different types and on different buses
> were considered from the start, and are precedented, if rare.

This is far too theoretical for me. A security broken mdev is
functionally useless.

We don't need to support it, and we don't need complicated software to
model it.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [git pull] IOMMU Fixes for Linux v5.13-rc3

2021-05-27 Thread pr-tracker-bot

The pull request you sent on Thu, 27 May 2021 19:57:25 +0200:

> git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git 
> tags/iommu-fixes-v5.13-rc3

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/96c132f837ff0639702d04d229da190f636a48b5

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-27 Thread Kirti Wankhede





On 5/27/2021 10:30 AM, David Gibson wrote:

On Wed, May 26, 2021 at 02:48:03AM +0530, Kirti Wankhede wrote:



On 5/26/2021 1:22 AM, Jason Gunthorpe wrote:

On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote:


2. iommu backed mdev devices for SRIOV where mdev device is created per
VF (mdev device == VF device) then that mdev device has same iommu
protection scope as VF associated to it.


This doesn't require, and certainly shouldn't create, a fake group.

Only the VF's real IOMMU group should be used to model an iommu domain
linked to a VF. Injecting fake groups that are proxies for real groups
only opens the possibility of security problems like David is
concerned with.



I think this security issue should be addressed by letting mdev device
inherit its parent's iommu_group, i.e. VF's iommu_group here.


No, that doesn't work.  AIUI part of the whole point of mdevs is to
allow chunks of a single PCI function to be handed out to different
places, because they're isolated from each other not by the system
IOMMU, but by a combination of MMU hardware in the hardware (e.g. in a
GPU card) and software in the mdev driver. 


That's correct for non-iommu backed mdev devices.


If mdevs inherited the
group of their parent device they wouldn't count as isolated from each
other, which they should.



For iommu backed mdev devices for SRIOV, where there can be single mdev 
device for its parent, here parent device is VF, there can't be multiple 
mdev devices associated with that VF. In this case mdev can inherit the 
group of parent device.


Kirti
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [git pull] IOMMU Fixes for Linux v5.13-rc3

2021-05-27 Thread Nadav Amit



> On May 27, 2021, at 10:57 AM, Joerg Roedel  wrote:
> 
> Signed PGP part
> Hi Linus,
> 
> The following changes since commit d07f6ca923ea0927a1024dfccafc5b53b61cfecc:
> 
>  Linux 5.13-rc2 (2021-05-16 15:27:44 -0700)

For 5.13-rc3? Not -rc4?

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[git pull] IOMMU Fixes for Linux v5.13-rc3

2021-05-27 Thread Joerg Roedel

Hi Linus,

The following changes since commit d07f6ca923ea0927a1024dfccafc5b53b61cfecc:

  Linux 5.13-rc2 (2021-05-16 15:27:44 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git 
tags/iommu-fixes-v5.13-rc3

for you to fetch changes up to 0ee74d5a48635c848c20f152d0d488bf84641304:

  iommu/vt-d: Fix sysfs leak in alloc_iommu() (2021-05-27 16:07:08 +0200)


IOMMU Fixes for Linux v5.13-rc3

Including:

- Important fix for the AMD IOMMU driver in the recently added
  page-specific invalidation code to fix a calculation.

- Fix a NULL-ptr dereference in the AMD IOMMU driver when a
  device switches domain types.

- Fixes for the Intel VT-d driver to check for allocation
  failure and do correct cleanup.

- Another fix for Intel VT-d to not allow supervisor page
  requests from devices when using second level page
  translation.

- Add a MODULE_DEVICE_TABLE to the VIRTIO IOMMU driver


Bixuan Cui (1):
  iommu/virtio: Add missing MODULE_DEVICE_TABLE

Dan Carpenter (1):
  iommu/vt-d: Check for allocation failure in aux_detach_device()

Jean-Philippe Brucker (1):
  iommu/amd: Clear DMA ops when switching domain

Lu Baolu (1):
  iommu/vt-d: Use user privilege for RID2PASID translation

Nadav Amit (1):
  iommu/amd: Fix wrong parentheses on page-specific invalidations

Rolf Eike Beer (1):
  iommu/vt-d: Fix sysfs leak in alloc_iommu()

 drivers/iommu/amd/iommu.c| 4 +++-
 drivers/iommu/intel/dmar.c   | 4 +++-
 drivers/iommu/intel/iommu.c  | 9 +++--
 drivers/iommu/intel/pasid.c  | 3 ++-
 drivers/iommu/virtio-iommu.c | 1 +
 5 files changed, 16 insertions(+), 5 deletions(-)

Please pull.

Thanks,

Joerg


signature.asc
Description: Digital signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 01/15] swiotlb: Refactor swiotlb init functions

2021-05-27 Thread Tom Lendacky

On 5/27/21 9:41 AM, Tom Lendacky wrote:
> On 5/27/21 8:02 AM, Christoph Hellwig wrote:
>> On Wed, May 19, 2021 at 11:50:07AM -0700, Florian Fainelli wrote:
>>> You convert this call site with swiotlb_init_io_tlb_mem() which did not
>>> do the set_memory_decrypted()+memset(). Is this okay or should
>>> swiotlb_init_io_tlb_mem() add an additional argument to do this
>>> conditionally?
>>
>> The zeroing is useful and was missing before.  I think having a clean
>> state here is the right thing.
>>
>> Not sure about the set_memory_decrypted, swiotlb_update_mem_attributes
>> kinda suggests it is too early to set the memory decrupted.
>>
>> Adding Tom who should now about all this.
> 
> The reason for adding swiotlb_update_mem_attributes() was because having
> the call to set_memory_decrypted() in swiotlb_init_with_tbl() triggered a
> BUG_ON() related to interrupts not being enabled yet during boot. So that
> call had to be delayed until interrupts were enabled.

I pulled down and tested the patch set and booted with SME enabled. The
following was seen during the boot:

[0.134184] BUG: Bad page state in process swapper  pfn:108002
[0.134196] page:(ptrval) refcount:0 mapcount:-128 
mapping: index:0x0 pfn:0x108002
[0.134201] flags: 0x17c000(node=0|zone=2|lastcpupid=0x1f)
[0.134208] raw: 0017c000 88847f355e28 88847f355e28 

[0.134210] raw:  0001 ff7f 

[0.134212] page dumped because: nonzero mapcount
[0.134213] Modules linked in:
[0.134218] CPU: 0 PID: 0 Comm: swapper Not tainted 5.13.0-rc2-sos-custom #3
[0.134221] Hardware name: ...
[0.134224] Call Trace:
[0.134233]  dump_stack+0x76/0x94
[0.134244]  bad_page+0xa6/0xf0
[0.134252]  __free_pages_ok+0x331/0x360
[0.134256]  memblock_free_all+0x158/0x1c1
[0.134267]  mem_init+0x1f/0x14c
[0.134273]  start_kernel+0x290/0x574
[0.134279]  secondary_startup_64_no_verify+0xb0/0xbb

I see this about 40 times during the boot, each with a different PFN. The
system boots (which seemed odd), but I don't know if there will be side
effects to this (I didn't stress the system).

I modified the code to add a flag to not do the set_memory_decrypted(), as
suggested by Florian, when invoked from swiotlb_init_with_tbl(), and that
eliminated the bad page state BUG.

Thanks,
Tom

> 
> Thanks,
> Tom
> 
>>
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 01/15] swiotlb: Refactor swiotlb init functions

2021-05-27 Thread Tom Lendacky

On 5/27/21 8:02 AM, Christoph Hellwig wrote:
> On Wed, May 19, 2021 at 11:50:07AM -0700, Florian Fainelli wrote:
>> You convert this call site with swiotlb_init_io_tlb_mem() which did not
>> do the set_memory_decrypted()+memset(). Is this okay or should
>> swiotlb_init_io_tlb_mem() add an additional argument to do this
>> conditionally?
> 
> The zeroing is useful and was missing before.  I think having a clean
> state here is the right thing.
> 
> Not sure about the set_memory_decrypted, swiotlb_update_mem_attributes
> kinda suggests it is too early to set the memory decrupted.
> 
> Adding Tom who should now about all this.

The reason for adding swiotlb_update_mem_attributes() was because having
the call to set_memory_decrypted() in swiotlb_init_with_tbl() triggered a
BUG_ON() related to interrupts not being enabled yet during boot. So that
call had to be delayed until interrupts were enabled.

Thanks,
Tom

> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 00/15] Restricted DMA

2021-05-27 Thread Christoph Hellwig

I just finished reviewing v7, sorry.  Let me find some time to see what
difference this version makes.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 13/15] dma-direct: Allocate memory from restricted DMA pool if available

2021-05-27 Thread Christoph Hellwig

> +#ifdef CONFIG_DMA_RESTRICTED_POOL
> + if (swiotlb_free(dev, page, size))
> + return;
> +#endif

Please avoid the ifdefs by either stubbing out the function to be a no-op
or by using IS_ENABLED.

> +#ifdef CONFIG_DMA_RESTRICTED_POOL
> + page = swiotlb_alloc(dev, size);
> + if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
> + __dma_direct_free_pages(dev, page, size);
> + page = NULL;
> + }
> +#endif

Same here, for the stub it would just return NULL.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 07/15] swiotlb: Update is_swiotlb_active to add a struct device argument

2021-05-27 Thread Christoph Hellwig

> + if (is_swiotlb_active(NULL)) {

Passing a NULL argument to this doesn't make sense.  They all should have
a struct device at hand, you'll just need to dig for it.

And this function should be about to go away anyway, but until then we
need to do this properly.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 04/15] swiotlb: Add restricted DMA pool initialization

2021-05-27 Thread Christoph Hellwig

I'd still much prefer to always have the pointer in struct device.
Especially as we're also looking into things like a global 64-bit bounce
buffer.  Something like this untested patch ontop of your series:


diff --git a/drivers/base/core.c b/drivers/base/core.c
index 628e33939aca..3cb95fa29f70 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include  /* for dma_default_coherent */
+#include 
 
 #include "base.h"
 #include "power/power.h"
@@ -2814,6 +2815,9 @@ void device_initialize(struct device *dev)
 defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL)
dev->dma_coherent = dma_default_coherent;
 #endif
+#ifdef CONFIG_SWIOTLB
+   dev->dma_io_tlb_mem = &io_tlb_default_mem;
+#endif
 }
 EXPORT_SYMBOL_GPL(device_initialize);
 
diff --git a/include/linux/device.h b/include/linux/device.h
index 4987608ea4ff..6aca6fa0facc 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -416,7 +416,7 @@ struct dev_links_info {
  * @dma_pools: Dma pools (if dma'ble device).
  * @dma_mem:   Internal for coherent mem override.
  * @cma_area:  Contiguous memory area for dma allocations
- * @dma_io_tlb_mem: Internal for swiotlb io_tlb_mem override.
+ * @dma_io_tlb_mem: Pointer to the swiotlb pool used.  Not for driver use.
  * @archdata:  For arch-specific additions.
  * @of_node:   Associated device tree node.
  * @fwnode:Associated device node supplied by platform firmware.
@@ -523,7 +523,7 @@ struct device {
struct cma *cma_area;   /* contiguous memory area for dma
   allocations */
 #endif
-#ifdef CONFIG_DMA_RESTRICTED_POOL
+#ifdef CONFIG_SWIOTLB
struct io_tlb_mem *dma_io_tlb_mem;
 #endif
/* arch specific additions */
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index e8cf49bd90c5..c153cd0c469c 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -95,6 +95,7 @@ struct io_tlb_mem {
spinlock_t lock;
struct dentry *debugfs;
bool late_alloc;
+   bool force_swiotlb;
struct io_tlb_slot {
phys_addr_t orig_addr;
size_t alloc_size;
@@ -103,30 +104,16 @@ struct io_tlb_mem {
 };
 extern struct io_tlb_mem *io_tlb_default_mem;
 
-static inline struct io_tlb_mem *get_io_tlb_mem(struct device *dev)
-{
-#ifdef CONFIG_DMA_RESTRICTED_POOL
-   if (dev && dev->dma_io_tlb_mem)
-   return dev->dma_io_tlb_mem;
-#endif /* CONFIG_DMA_RESTRICTED_POOL */
-
-   return io_tlb_default_mem;
-}
-
 static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
 {
-   struct io_tlb_mem *mem = get_io_tlb_mem(dev);
+   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
 
return mem && paddr >= mem->start && paddr < mem->end;
 }
 
 static inline bool is_dev_swiotlb_force(struct device *dev)
 {
-#ifdef CONFIG_DMA_RESTRICTED_POOL
-   if (dev->dma_io_tlb_mem)
-   return true;
-#endif /* CONFIG_DMA_RESTRICTED_POOL */
-   return false;
+   return dev->dma_io_tlb_mem->force_swiotlb;
 }
 
 void __init swiotlb_exit(void);
@@ -134,10 +121,8 @@ unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
 bool is_swiotlb_active(struct device *dev);
 void __init swiotlb_adjust_size(unsigned long size);
-#ifdef CONFIG_DMA_RESTRICTED_POOL
 struct page *swiotlb_alloc(struct device *dev, size_t size);
 bool swiotlb_free(struct device *dev, struct page *page, size_t size);
-#endif /* CONFIG_DMA_RESTRICTED_POOL */
 #else
 #define swiotlb_force SWIOTLB_NO_FORCE
 static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index d3fa4669229b..69d62e18f493 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -347,7 +347,7 @@ void __init swiotlb_exit(void)
 static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t 
size,
   enum dma_data_direction dir)
 {
-   struct io_tlb_mem *mem = get_io_tlb_mem(dev);
+   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
int index = (tlb_addr - mem->start) >> IO_TLB_SHIFT;
phys_addr_t orig_addr = mem->slots[index].orig_addr;
size_t alloc_size = mem->slots[index].alloc_size;
@@ -429,7 +429,7 @@ static unsigned int wrap_index(struct io_tlb_mem *mem, 
unsigned int index)
 static int find_slots(struct device *dev, phys_addr_t orig_addr,
size_t alloc_size)
 {
-   struct io_tlb_mem *mem = get_io_tlb_mem(dev);
+   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
unsigned long boundary_mask = dma_get_seg_boundary(dev);
dma_addr_t tbl_dma_addr =
phys_to_dma_unencrypted(dev, mem->start) & boundary_mask;
@@ -510,7 +510,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, 
phys_addr_t orig_addr,
size_t mapping_size, size_t alloc_size,
enum dma_data_direction dir, unsig

Re: [PATCH v7 04/15] swiotlb: Add restricted DMA pool initialization

2021-05-27 Thread Christoph Hellwig

On Mon, May 24, 2021 at 11:49:34AM -0400, Konrad Rzeszutek Wilk wrote:
> rmem_swiotlb_setup
> 
> ?
> 
> Which is ARM specific and inside the generic code?

I don't think it is arm specific at all.  It is OF specific, but just
about every platform but x86 uses OF.  And I can think of an ACPI version
of this as well.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 03/15] swiotlb: Add DMA_RESTRICTED_POOL

2021-05-27 Thread Christoph Hellwig

On Tue, May 18, 2021 at 02:42:03PM +0800, Claire Chang wrote:
> Add a new kconfig symbol, DMA_RESTRICTED_POOL, for restricted DMA pool.

Please merge this with the actual code that is getting added.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 02/15] swiotlb: Refactor swiotlb_create_debugfs

2021-05-27 Thread Christoph Hellwig

On Tue, May 18, 2021 at 02:42:02PM +0800, Claire Chang wrote:
>  struct io_tlb_mem *io_tlb_default_mem;
> +static struct dentry *debugfs_dir;
>  
>  /*
>   * Max segment that we can provide which (if pages are contingous) will
> @@ -662,18 +663,30 @@ EXPORT_SYMBOL_GPL(is_swiotlb_active);
>  
>  #ifdef CONFIG_DEBUG_FS
>  
> +static void swiotlb_create_debugfs(struct io_tlb_mem *mem, const char *name)
>  {
>   if (!mem)
> + return;

I don't think this check makes much sense here.

> +}
> +
> +static int __init swiotlb_create_default_debugfs(void)
> +{
> + struct io_tlb_mem *mem = io_tlb_default_mem;
> +
> + if (mem) {
> + swiotlb_create_debugfs(mem, "swiotlb");
> + debugfs_dir = mem->debugfs;
> + } else {
> + debugfs_dir = debugfs_create_dir("swiotlb", NULL);
> + }

This also looks rather strange.  I'd much rather create move the
directory creation of out swiotlb_create_debugfs.  E.g. something like:

static void swiotlb_create_debugfs_file(struct io_tlb_mem *mem)
{
debugfs_create_ulong("io_tlb_nslabs", 0400, mem->debugfs, &mem->nslabs);
debugfs_create_ulong("io_tlb_used", 0400, mem->debugfs, &mem->used);
}

static int __init swiotlb_init_debugfs(void)
{
debugfs_dir = debugfs_create_dir("swiotlb", NULL);
if (io_tlb_default_mem) {
io_tlb_default_mem->debugfs = debugfs_dir;
swiotlb_create_debugfs_files(io_tlb_default_mem);
}
return 0;
}
late_initcall(swiotlb_init_debugfs);

...

static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
struct device *dev)
{
...
mem->debugfs = debugfs_create_dir(rmem->name, debugfs_dir);
swiotlb_create_debugfs_files(mem->debugfs);


}
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: Re: [PATCH v7 11/12] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-05-27 Thread Yongji Xie

On Thu, May 27, 2021 at 4:41 PM Jason Wang  wrote:
>
>
> 在 2021/5/27 下午3:34, Yongji Xie 写道:
> > On Thu, May 27, 2021 at 1:40 PM Jason Wang  wrote:
> >>
> >> 在 2021/5/27 下午1:08, Yongji Xie 写道:
> >>> On Thu, May 27, 2021 at 1:00 PM Jason Wang  wrote:
>  在 2021/5/27 下午12:57, Yongji Xie 写道:
> > On Thu, May 27, 2021 at 12:13 PM Jason Wang  wrote:
> >> 在 2021/5/17 下午5:55, Xie Yongji 写道:
> >>> +
> >>> +static int vduse_dev_msg_sync(struct vduse_dev *dev,
> >>> +   struct vduse_dev_msg *msg)
> >>> +{
> >>> + init_waitqueue_head(&msg->waitq);
> >>> + spin_lock(&dev->msg_lock);
> >>> + vduse_enqueue_msg(&dev->send_list, msg);
> >>> + wake_up(&dev->waitq);
> >>> + spin_unlock(&dev->msg_lock);
> >>> + wait_event_killable(msg->waitq, msg->completed);
> >> What happens if the userspace(malicous) doesn't give a response 
> >> forever?
> >>
> >> It looks like a DOS. If yes, we need to consider a way to fix that.
> >>
> > How about using wait_event_killable_timeout() instead?
>  Probably, and then we need choose a suitable timeout and more important,
>  need to report the failure to virtio.
> 
> >>> Makes sense to me. But it looks like some
> >>> vdpa_config_ops/virtio_config_ops such as set_status() didn't have a
> >>> return value.  Now I add a WARN_ON() for the failure. Do you mean we
> >>> need to add some change for virtio core to handle the failure?
> >>
> >> Maybe, but I'm not sure how hard we can do that.
> >>
> > We need to change all virtio device drivers in this way.
>
>
> Probably.
>
>
> >
> >> We had NEEDS_RESET but it looks we don't implement it.
> >>
> > Could it handle the failure of get_feature() and get/set_config()?
>
>
> Looks not:
>
> "
>
> The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
> that a reset is needed. If DRIVER_OK is set, after it sets
> DEVICE_NEEDS_RESET, the device MUST send a device configuration change
> notification to the driver.
>
> "
>
> This looks implies that NEEDS_RESET may only work after device is
> probed. But in the current design, even the reset() is not reliable.
>
>
> >
> >> Or a rough idea is that maybe need some relaxing to be coupled loosely
> >> with userspace. E.g the device (control path) is implemented in the
> >> kernel but the datapath is implemented in the userspace like TUN/TAP.
> >>
> > I think it can work for most cases. One problem is that the set_config
> > might change the behavior of the data path at runtime, e.g.
> > virtnet_set_mac_address() in the virtio-net driver and
> > cache_type_store() in the virtio-blk driver. Not sure if this path is
> > able to return before the datapath is aware of this change.
>
>
> Good point.
>
> But set_config() should be rare:
>
> E.g in the case of virtio-net with VERSION_1, config space is read only,
> and it was set via control vq.
>
> For block, we can
>
> 1) start from without WCE or
> 2) we add a config change notification to userspace or

I prefer this way. And I think we also need to do similar things for
set/get_vq_state().

Thanks,
Yongji
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v8 15/15] of: Add plumbing for restricted DMA pool

2021-05-27 Thread Claire Chang

If a device is not behind an IOMMU, we look up the device node and set
up the restricted DMA when the restricted-dma-pool is presented.

Signed-off-by: Claire Chang 
---
 drivers/of/address.c| 33 +
 drivers/of/device.c |  3 +++
 drivers/of/of_private.h |  6 ++
 3 files changed, 42 insertions(+)

diff --git a/drivers/of/address.c b/drivers/of/address.c
index aca94c348bd4..6cc7eaaf7e11 100644
--- a/drivers/of/address.c
+++ b/drivers/of/address.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1112,6 +1113,38 @@ bool of_dma_is_coherent(struct device_node *np)
 }
 EXPORT_SYMBOL_GPL(of_dma_is_coherent);
 
+int of_dma_set_restricted_buffer(struct device *dev, struct device_node *np)
+{
+   struct device_node *node, *of_node = dev->of_node;
+   int count, i;
+
+   count = of_property_count_elems_of_size(of_node, "memory-region",
+   sizeof(u32));
+   /*
+* If dev->of_node doesn't exist or doesn't contain memory-region, try
+* the OF node having DMA configuration.
+*/
+   if (count <= 0) {
+   of_node = np;
+   count = of_property_count_elems_of_size(
+   of_node, "memory-region", sizeof(u32));
+   }
+
+   for (i = 0; i < count; i++) {
+   node = of_parse_phandle(of_node, "memory-region", i);
+   /*
+* There might be multiple memory regions, but only one
+* restricted-dma-pool region is allowed.
+*/
+   if (of_device_is_compatible(node, "restricted-dma-pool") &&
+   of_device_is_available(node))
+   return of_reserved_mem_device_init_by_idx(dev, of_node,
+ i);
+   }
+
+   return 0;
+}
+
 /**
  * of_mmio_is_nonposted - Check if device uses non-posted MMIO
  * @np:device node
diff --git a/drivers/of/device.c b/drivers/of/device.c
index c5a9473a5fb1..2defdca418ec 100644
--- a/drivers/of/device.c
+++ b/drivers/of/device.c
@@ -165,6 +165,9 @@ int of_dma_configure_id(struct device *dev, struct 
device_node *np,
 
arch_setup_dma_ops(dev, dma_start, size, iommu, coherent);
 
+   if (!iommu)
+   return of_dma_set_restricted_buffer(dev, np);
+
return 0;
 }
 EXPORT_SYMBOL_GPL(of_dma_configure_id);
diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h
index d717efbd637d..8fde97565d11 100644
--- a/drivers/of/of_private.h
+++ b/drivers/of/of_private.h
@@ -163,12 +163,18 @@ struct bus_dma_region;
 #if defined(CONFIG_OF_ADDRESS) && defined(CONFIG_HAS_DMA)
 int of_dma_get_range(struct device_node *np,
const struct bus_dma_region **map);
+int of_dma_set_restricted_buffer(struct device *dev, struct device_node *np);
 #else
 static inline int of_dma_get_range(struct device_node *np,
const struct bus_dma_region **map)
 {
return -ENODEV;
 }
+static inline int of_dma_set_restricted_buffer(struct device *dev,
+  struct device_node *np)
+{
+   return -ENODEV;
+}
 #endif
 
 #endif /* _LINUX_OF_PRIVATE_H */
-- 
2.31.1.818.g46aad6cb9e-goog

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v8 14/15] dt-bindings: of: Add restricted DMA pool

2021-05-27 Thread Claire Chang

Introduce the new compatible string, restricted-dma-pool, for restricted
DMA. One can specify the address and length of the restricted DMA memory
region by restricted-dma-pool in the reserved-memory node.

Signed-off-by: Claire Chang 
---
 .../reserved-memory/reserved-memory.txt   | 36 +--
 1 file changed, 33 insertions(+), 3 deletions(-)

diff --git 
a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt 
b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
index e8d3096d922c..46804f24df05 100644
--- a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
+++ b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
@@ -51,6 +51,23 @@ compatible (optional) - standard definition
   used as a shared pool of DMA buffers for a set of devices. It can
   be used by an operating system to instantiate the necessary pool
   management subsystem if necessary.
+- restricted-dma-pool: This indicates a region of memory meant to be
+  used as a pool of restricted DMA buffers for a set of devices. The
+  memory region would be the only region accessible to those devices.
+  When using this, the no-map and reusable properties must not be set,
+  so the operating system can create a virtual mapping that will be 
used
+  for synchronization. The main purpose for restricted DMA is to
+  mitigate the lack of DMA access control on systems without an IOMMU,
+  which could result in the DMA accessing the system memory at
+  unexpected times and/or unexpected addresses, possibly leading to 
data
+  leakage or corruption. The feature on its own provides a basic level
+  of protection against the DMA overwriting buffer contents at
+  unexpected times. However, to protect against general data leakage 
and
+  system memory corruption, the system needs to provide way to lock 
down
+  the memory access, e.g., MPU. Note that since coherent allocation
+  needs remapping, one must set up another device coherent pool by
+  shared-dma-pool and use dma_alloc_from_dev_coherent instead for 
atomic
+  coherent allocation.
 - vendor specific string in the form ,[-]
 no-map (optional) - empty property
 - Indicates the operating system must not create a virtual mapping
@@ -85,10 +102,11 @@ memory-region-names (optional) - a list of names, one for 
each corresponding
 
 Example
 ---
-This example defines 3 contiguous regions are defined for Linux kernel:
+This example defines 4 contiguous regions for Linux kernel:
 one default of all device drivers (named linux,cma@7200 and 64MiB in size),
-one dedicated to the framebuffer device (named framebuffer@7800, 8MiB), and
-one for multimedia processing (named multimedia-memory@7700, 64MiB).
+one dedicated to the framebuffer device (named framebuffer@7800, 8MiB),
+one for multimedia processing (named multimedia-memory@7700, 64MiB), and
+one for restricted dma pool (named restricted_dma_reserved@0x5000, 64MiB).
 
 / {
#address-cells = <1>;
@@ -120,6 +138,11 @@ one for multimedia processing (named 
multimedia-memory@7700, 64MiB).
compatible = "acme,multimedia-memory";
reg = <0x7700 0x400>;
};
+
+   restricted_dma_reserved: restricted_dma_reserved {
+   compatible = "restricted-dma-pool";
+   reg = <0x5000 0x400>;
+   };
};
 
/* ... */
@@ -138,4 +161,11 @@ one for multimedia processing (named 
multimedia-memory@7700, 64MiB).
memory-region = <&multimedia_reserved>;
/* ... */
};
+
+   pcie_device: pcie_device@0,0 {
+   reg = <0x8301 0x0 0x 0x0 0x0010
+  0x8301 0x0 0x0010 0x0 0x0010>;
+   memory-region = <&restricted_dma_mem_reserved>;
+   /* ... */
+   };
 };
-- 
2.31.1.818.g46aad6cb9e-goog

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v8 13/15] dma-direct: Allocate memory from restricted DMA pool if available

2021-05-27 Thread Claire Chang

The restricted DMA pool is preferred if available.

The restricted DMA pools provide a basic level of protection against the
DMA overwriting buffer contents at unexpected times. However, to protect
against general data leakage and system memory corruption, the system
needs to provide a way to lock down the memory access, e.g., MPU.

Note that since coherent allocation needs remapping, one must set up
another device coherent pool by shared-dma-pool and use
dma_alloc_from_dev_coherent instead for atomic coherent allocation.

Signed-off-by: Claire Chang 
---
 kernel/dma/direct.c | 38 +-
 1 file changed, 29 insertions(+), 9 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index eb4098323bbc..0d521f78c7b9 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -78,6 +78,10 @@ static bool dma_coherent_ok(struct device *dev, phys_addr_t 
phys, size_t size)
 static void __dma_direct_free_pages(struct device *dev, struct page *page,
size_t size)
 {
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+   if (swiotlb_free(dev, page, size))
+   return;
+#endif
dma_free_contiguous(dev, page, size);
 }
 
@@ -92,7 +96,17 @@ static struct page *__dma_direct_alloc_pages(struct device 
*dev, size_t size,
 
gfp |= dma_direct_optimal_gfp_mask(dev, dev->coherent_dma_mask,
   &phys_limit);
-   page = dma_alloc_contiguous(dev, size, gfp);
+
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+   page = swiotlb_alloc(dev, size);
+   if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
+   __dma_direct_free_pages(dev, page, size);
+   page = NULL;
+   }
+#endif
+
+   if (!page)
+   page = dma_alloc_contiguous(dev, size, gfp);
if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
dma_free_contiguous(dev, page, size);
page = NULL;
@@ -148,7 +162,7 @@ void *dma_direct_alloc(struct device *dev, size_t size,
gfp |= __GFP_NOWARN;
 
if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) &&
-   !force_dma_unencrypted(dev)) {
+   !force_dma_unencrypted(dev) && !is_dev_swiotlb_force(dev)) {
page = __dma_direct_alloc_pages(dev, size, gfp & ~__GFP_ZERO);
if (!page)
return NULL;
@@ -161,18 +175,23 @@ void *dma_direct_alloc(struct device *dev, size_t size,
}
 
if (!IS_ENABLED(CONFIG_ARCH_HAS_DMA_SET_UNCACHED) &&
-   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
-   !dev_is_dma_coherent(dev))
+   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && !dev_is_dma_coherent(dev) &&
+   !is_dev_swiotlb_force(dev))
return arch_dma_alloc(dev, size, dma_handle, gfp, attrs);
 
/*
 * Remapping or decrypting memory may block. If either is required and
 * we can't block, allocate the memory from the atomic pools.
+* If restricted DMA (i.e., is_dev_swiotlb_force) is required, one must
+* set up another device coherent pool by shared-dma-pool and use
+* dma_alloc_from_dev_coherent instead.
 */
if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
!gfpflags_allow_blocking(gfp) &&
(force_dma_unencrypted(dev) ||
-(IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && 
!dev_is_dma_coherent(dev
+(IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
+ !dev_is_dma_coherent(dev))) &&
+   !is_dev_swiotlb_force(dev))
return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
 
/* we always manually zero the memory once we are done */
@@ -253,15 +272,15 @@ void dma_direct_free(struct device *dev, size_t size,
unsigned int page_order = get_order(size);
 
if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) &&
-   !force_dma_unencrypted(dev)) {
+   !force_dma_unencrypted(dev) && !is_dev_swiotlb_force(dev)) {
/* cpu_addr is a struct page cookie, not a kernel address */
dma_free_contiguous(dev, cpu_addr, size);
return;
}
 
if (!IS_ENABLED(CONFIG_ARCH_HAS_DMA_SET_UNCACHED) &&
-   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
-   !dev_is_dma_coherent(dev)) {
+   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && !dev_is_dma_coherent(dev) &&
+   !is_dev_swiotlb_force(dev)) {
arch_dma_free(dev, size, cpu_addr, dma_addr, attrs);
return;
}
@@ -289,7 +308,8 @@ struct page *dma_direct_alloc_pages(struct device *dev, 
size_t size,
void *ret;
 
if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
-   force_dma_unencrypted(dev) && !gfpflags_allow_blocking(gfp))
+   force_dma_unencrypted(dev) && !gfpflags_allow_blocking(gfp) &&
+   !is_dev_swiotlb_force(dev))
return dma_direct_alloc_from_pool(dev, size, dma_han

[PATCH v8 12/15] swiotlb: Add restricted DMA alloc/free support.

2021-05-27 Thread Claire Chang

Add the functions, swiotlb_{alloc,free} to support the memory allocation
from restricted DMA pool.

Signed-off-by: Claire Chang 
---
 include/linux/swiotlb.h |  4 
 kernel/dma/swiotlb.c| 35 +--
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 0c5a18d9cf89..e8cf49bd90c5 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -134,6 +134,10 @@ unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
 bool is_swiotlb_active(struct device *dev);
 void __init swiotlb_adjust_size(unsigned long size);
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+struct page *swiotlb_alloc(struct device *dev, size_t size);
+bool swiotlb_free(struct device *dev, struct page *page, size_t size);
+#endif /* CONFIG_DMA_RESTRICTED_POOL */
 #else
 #define swiotlb_force SWIOTLB_NO_FORCE
 static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index c4fc2e444e7a..648bfdde4b0c 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -457,8 +457,9 @@ static int find_slots(struct device *dev, phys_addr_t 
orig_addr,
 
index = wrap = wrap_index(mem, ALIGN(mem->index, stride));
do {
-   if ((slot_addr(tbl_dma_addr, index) & iotlb_align_mask) !=
-   (orig_addr & iotlb_align_mask)) {
+   if (orig_addr &&
+   (slot_addr(tbl_dma_addr, index) & iotlb_align_mask) !=
+   (orig_addr & iotlb_align_mask)) {
index = wrap_index(mem, index + 1);
continue;
}
@@ -704,6 +705,36 @@ late_initcall(swiotlb_create_default_debugfs);
 #endif
 
 #ifdef CONFIG_DMA_RESTRICTED_POOL
+struct page *swiotlb_alloc(struct device *dev, size_t size)
+{
+   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+   phys_addr_t tlb_addr;
+   int index;
+
+   if (!mem)
+   return NULL;
+
+   index = find_slots(dev, 0, size);
+   if (index == -1)
+   return NULL;
+
+   tlb_addr = slot_addr(mem->start, index);
+
+   return pfn_to_page(PFN_DOWN(tlb_addr));
+}
+
+bool swiotlb_free(struct device *dev, struct page *page, size_t size)
+{
+   phys_addr_t tlb_addr = page_to_phys(page);
+
+   if (!is_swiotlb_buffer(dev, tlb_addr))
+   return false;
+
+   release_slots(dev, tlb_addr);
+
+   return true;
+}
+
 static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
struct device *dev)
 {
-- 
2.31.1.818.g46aad6cb9e-goog

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v8 11/15] dma-direct: Add a new wrapper __dma_direct_free_pages()

2021-05-27 Thread Claire Chang

Add a new wrapper __dma_direct_free_pages() that will be useful later
for swiotlb_free().

Signed-off-by: Claire Chang 
---
 kernel/dma/direct.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 078f7087e466..eb4098323bbc 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -75,6 +75,12 @@ static bool dma_coherent_ok(struct device *dev, phys_addr_t 
phys, size_t size)
min_not_zero(dev->coherent_dma_mask, dev->bus_dma_limit);
 }
 
+static void __dma_direct_free_pages(struct device *dev, struct page *page,
+   size_t size)
+{
+   dma_free_contiguous(dev, page, size);
+}
+
 static struct page *__dma_direct_alloc_pages(struct device *dev, size_t size,
gfp_t gfp)
 {
@@ -237,7 +243,7 @@ void *dma_direct_alloc(struct device *dev, size_t size,
return NULL;
}
 out_free_pages:
-   dma_free_contiguous(dev, page, size);
+   __dma_direct_free_pages(dev, page, size);
return NULL;
 }
 
@@ -273,7 +279,7 @@ void dma_direct_free(struct device *dev, size_t size,
else if (IS_ENABLED(CONFIG_ARCH_HAS_DMA_CLEAR_UNCACHED))
arch_dma_clear_uncached(cpu_addr, size);
 
-   dma_free_contiguous(dev, dma_direct_to_page(dev, dma_addr), size);
+   __dma_direct_free_pages(dev, dma_direct_to_page(dev, dma_addr), size);
 }
 
 struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
@@ -310,7 +316,7 @@ struct page *dma_direct_alloc_pages(struct device *dev, 
size_t size,
*dma_handle = phys_to_dma_direct(dev, page_to_phys(page));
return page;
 out_free_pages:
-   dma_free_contiguous(dev, page, size);
+   __dma_direct_free_pages(dev, page, size);
return NULL;
 }
 
@@ -329,7 +335,7 @@ void dma_direct_free_pages(struct device *dev, size_t size,
if (force_dma_unencrypted(dev))
set_memory_encrypted((unsigned long)vaddr, 1 << page_order);
 
-   dma_free_contiguous(dev, page, size);
+   __dma_direct_free_pages(dev, page, size);
 }
 
 #if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) || \
-- 
2.31.1.818.g46aad6cb9e-goog

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 01/15] swiotlb: Refactor swiotlb init functions

2021-05-27 Thread Christoph Hellwig

On Wed, May 19, 2021 at 11:50:07AM -0700, Florian Fainelli wrote:
> You convert this call site with swiotlb_init_io_tlb_mem() which did not
> do the set_memory_decrypted()+memset(). Is this okay or should
> swiotlb_init_io_tlb_mem() add an additional argument to do this
> conditionally?

The zeroing is useful and was missing before.  I think having a clean
state here is the right thing.

Not sure about the set_memory_decrypted, swiotlb_update_mem_attributes
kinda suggests it is too early to set the memory decrupted.

Adding Tom who should now about all this.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 00/15] Restricted DMA

2021-05-27 Thread Claire Chang

v8 here: https://lore.kernel.org/patchwork/cover/1437112/
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v8 10/15] swiotlb: Refactor swiotlb_tbl_unmap_single

2021-05-27 Thread Claire Chang

Add a new function, release_slots, to make the code reusable for supporting
different bounce buffer pools, e.g. restricted DMA pool.

Signed-off-by: Claire Chang 
---
 kernel/dma/swiotlb.c | 35 ---
 1 file changed, 20 insertions(+), 15 deletions(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 88b3471ac6a8..c4fc2e444e7a 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -550,27 +550,15 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, 
phys_addr_t orig_addr,
return tlb_addr;
 }
 
-/*
- * tlb_addr is the physical address of the bounce buffer to unmap.
- */
-void swiotlb_tbl_unmap_single(struct device *hwdev, phys_addr_t tlb_addr,
- size_t mapping_size, enum dma_data_direction dir,
- unsigned long attrs)
+static void release_slots(struct device *dev, phys_addr_t tlb_addr)
 {
-   struct io_tlb_mem *mem = get_io_tlb_mem(hwdev);
+   struct io_tlb_mem *mem = get_io_tlb_mem(dev);
unsigned long flags;
-   unsigned int offset = swiotlb_align_offset(hwdev, tlb_addr);
+   unsigned int offset = swiotlb_align_offset(dev, tlb_addr);
int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
int nslots = nr_slots(mem->slots[index].alloc_size + offset);
int count, i;
 
-   /*
-* First, sync the memory before unmapping the entry
-*/
-   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
-   (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
-   swiotlb_bounce(hwdev, tlb_addr, mapping_size, DMA_FROM_DEVICE);
-
/*
 * Return the buffer to the free list by setting the corresponding
 * entries to indicate the number of contiguous entries available.
@@ -605,6 +593,23 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, 
phys_addr_t tlb_addr,
spin_unlock_irqrestore(&mem->lock, flags);
 }
 
+/*
+ * tlb_addr is the physical address of the bounce buffer to unmap.
+ */
+void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr,
+ size_t mapping_size, enum dma_data_direction dir,
+ unsigned long attrs)
+{
+   /*
+* First, sync the memory before unmapping the entry
+*/
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
+   (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
+   swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_FROM_DEVICE);
+
+   release_slots(dev, tlb_addr);
+}
+
 void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr,
size_t size, enum dma_data_direction dir)
 {
-- 
2.31.1.818.g46aad6cb9e-goog

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v8 09/15] swiotlb: Move alloc_size to find_slots

2021-05-27 Thread Claire Chang

Move the maintenance of alloc_size to find_slots for better code
reusability later.

Signed-off-by: Claire Chang 
---
 kernel/dma/swiotlb.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index fa7f23fffc81..88b3471ac6a8 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -482,8 +482,11 @@ static int find_slots(struct device *dev, phys_addr_t 
orig_addr,
return -1;
 
 found:
-   for (i = index; i < index + nslots; i++)
+   for (i = index; i < index + nslots; i++) {
mem->slots[i].list = 0;
+   mem->slots[i].alloc_size =
+   alloc_size - ((i - index) << IO_TLB_SHIFT);
+   }
for (i = index - 1;
 io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 &&
 mem->slots[i].list; i--)
@@ -538,11 +541,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, 
phys_addr_t orig_addr,
 * This is needed when we sync the memory.  Then we sync the buffer if
 * needed.
 */
-   for (i = 0; i < nr_slots(alloc_size + offset); i++) {
+   for (i = 0; i < nr_slots(alloc_size + offset); i++)
mem->slots[index + i].orig_addr = slot_addr(orig_addr, i);
-   mem->slots[index + i].alloc_size =
-   alloc_size - (i << IO_TLB_SHIFT);
-   }
tlb_addr = slot_addr(mem->start, index) + offset;
if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
(dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
-- 
2.31.1.818.g46aad6cb9e-goog

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v8 08/15] swiotlb: Bounce data from/to restricted DMA pool if available

2021-05-27 Thread Claire Chang

Regardless of swiotlb setting, the restricted DMA pool is preferred if
available.

The restricted DMA pools provide a basic level of protection against the
DMA overwriting buffer contents at unexpected times. However, to protect
against general data leakage and system memory corruption, the system
needs to provide a way to lock down the memory access, e.g., MPU.

Note that is_dev_swiotlb_force doesn't check if
swiotlb_force == SWIOTLB_FORCE. Otherwise the memory allocation behavior
with default swiotlb will be changed by the following patche
("dma-direct: Allocate memory from restricted DMA pool if available").

Signed-off-by: Claire Chang 
---
 include/linux/swiotlb.h | 13 +
 kernel/dma/direct.c |  3 ++-
 kernel/dma/direct.h |  3 ++-
 kernel/dma/swiotlb.c|  8 
 4 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index c530c976d18b..0c5a18d9cf89 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -120,6 +120,15 @@ static inline bool is_swiotlb_buffer(struct device *dev, 
phys_addr_t paddr)
return mem && paddr >= mem->start && paddr < mem->end;
 }
 
+static inline bool is_dev_swiotlb_force(struct device *dev)
+{
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+   if (dev->dma_io_tlb_mem)
+   return true;
+#endif /* CONFIG_DMA_RESTRICTED_POOL */
+   return false;
+}
+
 void __init swiotlb_exit(void);
 unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
@@ -131,6 +140,10 @@ static inline bool is_swiotlb_buffer(struct device *dev, 
phys_addr_t paddr)
 {
return false;
 }
+static inline bool is_dev_swiotlb_force(struct device *dev)
+{
+   return false;
+}
 static inline void swiotlb_exit(void)
 {
 }
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 7a88c34d0867..078f7087e466 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -496,7 +496,8 @@ size_t dma_direct_max_mapping_size(struct device *dev)
 {
/* If SWIOTLB is active, use its maximum mapping size */
if (is_swiotlb_active(dev) &&
-   (dma_addressing_limited(dev) || swiotlb_force == SWIOTLB_FORCE))
+   (dma_addressing_limited(dev) || swiotlb_force == SWIOTLB_FORCE ||
+is_dev_swiotlb_force(dev)))
return swiotlb_max_mapping_size(dev);
return SIZE_MAX;
 }
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index 13e9e7158d94..f94813674e23 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -87,7 +87,8 @@ static inline dma_addr_t dma_direct_map_page(struct device 
*dev,
phys_addr_t phys = page_to_phys(page) + offset;
dma_addr_t dma_addr = phys_to_dma(dev, phys);
 
-   if (unlikely(swiotlb_force == SWIOTLB_FORCE))
+   if (unlikely(swiotlb_force == SWIOTLB_FORCE) ||
+   is_dev_swiotlb_force(dev))
return swiotlb_map(dev, phys, size, dir, attrs);
 
if (unlikely(!dma_capable(dev, dma_addr, size, true))) {
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index b2b6503ecd88..fa7f23fffc81 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -347,7 +347,7 @@ void __init swiotlb_exit(void)
 static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t 
size,
   enum dma_data_direction dir)
 {
-   struct io_tlb_mem *mem = io_tlb_default_mem;
+   struct io_tlb_mem *mem = get_io_tlb_mem(dev);
int index = (tlb_addr - mem->start) >> IO_TLB_SHIFT;
phys_addr_t orig_addr = mem->slots[index].orig_addr;
size_t alloc_size = mem->slots[index].alloc_size;
@@ -429,7 +429,7 @@ static unsigned int wrap_index(struct io_tlb_mem *mem, 
unsigned int index)
 static int find_slots(struct device *dev, phys_addr_t orig_addr,
size_t alloc_size)
 {
-   struct io_tlb_mem *mem = io_tlb_default_mem;
+   struct io_tlb_mem *mem = get_io_tlb_mem(dev);
unsigned long boundary_mask = dma_get_seg_boundary(dev);
dma_addr_t tbl_dma_addr =
phys_to_dma_unencrypted(dev, mem->start) & boundary_mask;
@@ -506,7 +506,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, 
phys_addr_t orig_addr,
size_t mapping_size, size_t alloc_size,
enum dma_data_direction dir, unsigned long attrs)
 {
-   struct io_tlb_mem *mem = io_tlb_default_mem;
+   struct io_tlb_mem *mem = get_io_tlb_mem(dev);
unsigned int offset = swiotlb_align_offset(dev, orig_addr);
unsigned int i;
int index;
@@ -557,7 +557,7 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, 
phys_addr_t tlb_addr,
  size_t mapping_size, enum dma_data_direction dir,
  unsigned long attrs)
 {
-   struct io_tlb_mem *mem = io_tlb_default_mem;
+   struct io_tlb_mem *mem = get_io_tlb_mem(hwdev);
unsigned long flags;
unsigned int offset = swiotlb_align_of

[PATCH v8 07/15] swiotlb: Update is_swiotlb_active to add a struct device argument

2021-05-27 Thread Claire Chang

Update is_swiotlb_active to add a struct device argument. This will be
useful later to allow for restricted DMA pool.

Signed-off-by: Claire Chang 
---
 drivers/gpu/drm/i915/gem/i915_gem_internal.c | 2 +-
 drivers/gpu/drm/nouveau/nouveau_ttm.c| 2 +-
 drivers/pci/xen-pcifront.c   | 2 +-
 include/linux/swiotlb.h  | 4 ++--
 kernel/dma/direct.c  | 2 +-
 kernel/dma/swiotlb.c | 4 ++--
 6 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_internal.c 
b/drivers/gpu/drm/i915/gem/i915_gem_internal.c
index ce6b664b10aa..7d48c433446b 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_internal.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_internal.c
@@ -42,7 +42,7 @@ static int i915_gem_object_get_pages_internal(struct 
drm_i915_gem_object *obj)
 
max_order = MAX_ORDER;
 #ifdef CONFIG_SWIOTLB
-   if (is_swiotlb_active()) {
+   if (is_swiotlb_active(NULL)) {
unsigned int max_segment;
 
max_segment = swiotlb_max_segment();
diff --git a/drivers/gpu/drm/nouveau/nouveau_ttm.c 
b/drivers/gpu/drm/nouveau/nouveau_ttm.c
index 65430912ff72..d0e998b9e2e8 100644
--- a/drivers/gpu/drm/nouveau/nouveau_ttm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_ttm.c
@@ -270,7 +270,7 @@ nouveau_ttm_init(struct nouveau_drm *drm)
}
 
 #if IS_ENABLED(CONFIG_SWIOTLB) && IS_ENABLED(CONFIG_X86)
-   need_swiotlb = is_swiotlb_active();
+   need_swiotlb = is_swiotlb_active(NULL);
 #endif
 
ret = ttm_device_init(&drm->ttm.bdev, &nouveau_bo_driver, drm->dev->dev,
diff --git a/drivers/pci/xen-pcifront.c b/drivers/pci/xen-pcifront.c
index b7a8f3a1921f..6d548ce53ce7 100644
--- a/drivers/pci/xen-pcifront.c
+++ b/drivers/pci/xen-pcifront.c
@@ -693,7 +693,7 @@ static int pcifront_connect_and_init_dma(struct 
pcifront_device *pdev)
 
spin_unlock(&pcifront_dev_lock);
 
-   if (!err && !is_swiotlb_active()) {
+   if (!err && !is_swiotlb_active(NULL)) {
err = pci_xen_swiotlb_init_late();
if (err)
dev_err(&pdev->xdev->dev, "Could not setup SWIOTLB!\n");
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 2a6cca07540b..c530c976d18b 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -123,7 +123,7 @@ static inline bool is_swiotlb_buffer(struct device *dev, 
phys_addr_t paddr)
 void __init swiotlb_exit(void);
 unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
-bool is_swiotlb_active(void);
+bool is_swiotlb_active(struct device *dev);
 void __init swiotlb_adjust_size(unsigned long size);
 #else
 #define swiotlb_force SWIOTLB_NO_FORCE
@@ -143,7 +143,7 @@ static inline size_t swiotlb_max_mapping_size(struct device 
*dev)
return SIZE_MAX;
 }
 
-static inline bool is_swiotlb_active(void)
+static inline bool is_swiotlb_active(struct device *dev)
 {
return false;
 }
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 84c9feb5474a..7a88c34d0867 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -495,7 +495,7 @@ int dma_direct_supported(struct device *dev, u64 mask)
 size_t dma_direct_max_mapping_size(struct device *dev)
 {
/* If SWIOTLB is active, use its maximum mapping size */
-   if (is_swiotlb_active() &&
+   if (is_swiotlb_active(dev) &&
(dma_addressing_limited(dev) || swiotlb_force == SWIOTLB_FORCE))
return swiotlb_max_mapping_size(dev);
return SIZE_MAX;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index d99b403144a8..b2b6503ecd88 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -662,9 +662,9 @@ size_t swiotlb_max_mapping_size(struct device *dev)
return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE;
 }
 
-bool is_swiotlb_active(void)
+bool is_swiotlb_active(struct device *dev)
 {
-   return io_tlb_default_mem != NULL;
+   return get_io_tlb_mem(dev) != NULL;
 }
 EXPORT_SYMBOL_GPL(is_swiotlb_active);
 
-- 
2.31.1.818.g46aad6cb9e-goog

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v8 06/15] swiotlb: Update is_swiotlb_buffer to add a struct device argument

2021-05-27 Thread Claire Chang

Update is_swiotlb_buffer to add a struct device argument. This will be
useful later to allow for restricted DMA pool.

Signed-off-by: Claire Chang 
---
 drivers/iommu/dma-iommu.c | 12 ++--
 drivers/xen/swiotlb-xen.c |  2 +-
 include/linux/swiotlb.h   |  6 +++---
 kernel/dma/direct.c   |  6 +++---
 kernel/dma/direct.h   |  6 +++---
 5 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7bcdd1205535..a5df35bfd150 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -504,7 +504,7 @@ static void __iommu_dma_unmap_swiotlb(struct device *dev, 
dma_addr_t dma_addr,
 
__iommu_dma_unmap(dev, dma_addr, size);
 
-   if (unlikely(is_swiotlb_buffer(phys)))
+   if (unlikely(is_swiotlb_buffer(dev, phys)))
swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs);
 }
 
@@ -575,7 +575,7 @@ static dma_addr_t __iommu_dma_map_swiotlb(struct device 
*dev, phys_addr_t phys,
}
 
iova = __iommu_dma_map(dev, phys, aligned_size, prot, dma_mask);
-   if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(phys))
+   if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(dev, phys))
swiotlb_tbl_unmap_single(dev, phys, org_size, dir, attrs);
return iova;
 }
@@ -781,7 +781,7 @@ static void iommu_dma_sync_single_for_cpu(struct device 
*dev,
if (!dev_is_dma_coherent(dev))
arch_sync_dma_for_cpu(phys, size, dir);
 
-   if (is_swiotlb_buffer(phys))
+   if (is_swiotlb_buffer(dev, phys))
swiotlb_sync_single_for_cpu(dev, phys, size, dir);
 }
 
@@ -794,7 +794,7 @@ static void iommu_dma_sync_single_for_device(struct device 
*dev,
return;
 
phys = iommu_iova_to_phys(iommu_get_dma_domain(dev), dma_handle);
-   if (is_swiotlb_buffer(phys))
+   if (is_swiotlb_buffer(dev, phys))
swiotlb_sync_single_for_device(dev, phys, size, dir);
 
if (!dev_is_dma_coherent(dev))
@@ -815,7 +815,7 @@ static void iommu_dma_sync_sg_for_cpu(struct device *dev,
if (!dev_is_dma_coherent(dev))
arch_sync_dma_for_cpu(sg_phys(sg), sg->length, dir);
 
-   if (is_swiotlb_buffer(sg_phys(sg)))
+   if (is_swiotlb_buffer(dev, sg_phys(sg)))
swiotlb_sync_single_for_cpu(dev, sg_phys(sg),
sg->length, dir);
}
@@ -832,7 +832,7 @@ static void iommu_dma_sync_sg_for_device(struct device *dev,
return;
 
for_each_sg(sgl, sg, nelems, i) {
-   if (is_swiotlb_buffer(sg_phys(sg)))
+   if (is_swiotlb_buffer(dev, sg_phys(sg)))
swiotlb_sync_single_for_device(dev, sg_phys(sg),
   sg->length, dir);
 
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 24d11861ac7d..0c4fb34f11ab 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -100,7 +100,7 @@ static int is_xen_swiotlb_buffer(struct device *dev, 
dma_addr_t dma_addr)
 * in our domain. Therefore _only_ check address within our domain.
 */
if (pfn_valid(PFN_DOWN(paddr)))
-   return is_swiotlb_buffer(paddr);
+   return is_swiotlb_buffer(dev, paddr);
return 0;
 }
 
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index b469f04cca26..2a6cca07540b 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -113,9 +113,9 @@ static inline struct io_tlb_mem *get_io_tlb_mem(struct 
device *dev)
return io_tlb_default_mem;
 }
 
-static inline bool is_swiotlb_buffer(phys_addr_t paddr)
+static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
 {
-   struct io_tlb_mem *mem = io_tlb_default_mem;
+   struct io_tlb_mem *mem = get_io_tlb_mem(dev);
 
return mem && paddr >= mem->start && paddr < mem->end;
 }
@@ -127,7 +127,7 @@ bool is_swiotlb_active(void);
 void __init swiotlb_adjust_size(unsigned long size);
 #else
 #define swiotlb_force SWIOTLB_NO_FORCE
-static inline bool is_swiotlb_buffer(phys_addr_t paddr)
+static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
 {
return false;
 }
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index f737e3347059..84c9feb5474a 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -343,7 +343,7 @@ void dma_direct_sync_sg_for_device(struct device *dev,
for_each_sg(sgl, sg, nents, i) {
phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
 
-   if (unlikely(is_swiotlb_buffer(paddr)))
+   if (unlikely(is_swiotlb_buffer(dev, paddr)))
swiotlb_sync_single_for_device(dev, paddr, sg->length,
   dir);
 
@@ -369,7 +369,7 @@ void dma_direct_sync_sg_for_cpu(struct device

[PATCH v8 05/15] swiotlb: Add a new get_io_tlb_mem getter

2021-05-27 Thread Claire Chang

Add a new getter, get_io_tlb_mem, to help select the io_tlb_mem struct.
The restricted DMA pool is preferred if available.

The reason it was done this way instead of assigning the active pool to
dev->dma_io_tlb_mem was because directly using dev->dma_io_tlb_mem might
cause memory allocation issues for existing devices. The pool can't
support atomic coherent allocation so swiotlb_alloc needs to distinguish
it from the default swiotlb pool.

Signed-off-by: Claire Chang 
---
 include/linux/swiotlb.h | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 03ad6e3b4056..b469f04cca26 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -2,6 +2,7 @@
 #ifndef __LINUX_SWIOTLB_H
 #define __LINUX_SWIOTLB_H
 
+#include 
 #include 
 #include 
 #include 
@@ -102,6 +103,16 @@ struct io_tlb_mem {
 };
 extern struct io_tlb_mem *io_tlb_default_mem;
 
+static inline struct io_tlb_mem *get_io_tlb_mem(struct device *dev)
+{
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+   if (dev && dev->dma_io_tlb_mem)
+   return dev->dma_io_tlb_mem;
+#endif /* CONFIG_DMA_RESTRICTED_POOL */
+
+   return io_tlb_default_mem;
+}
+
 static inline bool is_swiotlb_buffer(phys_addr_t paddr)
 {
struct io_tlb_mem *mem = io_tlb_default_mem;
-- 
2.31.1.818.g46aad6cb9e-goog

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v8 04/15] swiotlb: Add restricted DMA pool initialization

2021-05-27 Thread Claire Chang

Add the initialization function to create restricted DMA pools from
matching reserved-memory nodes.

Signed-off-by: Claire Chang 
---
 include/linux/device.h  |  4 +++
 include/linux/swiotlb.h |  3 +-
 kernel/dma/swiotlb.c| 76 +
 3 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index 959cb9d2c9ab..e78e1ce0b1b1 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -416,6 +416,7 @@ struct dev_links_info {
  * @dma_pools: Dma pools (if dma'ble device).
  * @dma_mem:   Internal for coherent mem override.
  * @cma_area:  Contiguous memory area for dma allocations
+ * @dma_io_tlb_mem: Internal for swiotlb io_tlb_mem override.
  * @archdata:  For arch-specific additions.
  * @of_node:   Associated device tree node.
  * @fwnode:Associated device node supplied by platform firmware.
@@ -521,6 +522,9 @@ struct device {
 #ifdef CONFIG_DMA_CMA
struct cma *cma_area;   /* contiguous memory area for dma
   allocations */
+#endif
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+   struct io_tlb_mem *dma_io_tlb_mem;
 #endif
/* arch specific additions */
struct dev_archdata archdata;
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 216854a5e513..03ad6e3b4056 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -72,7 +72,8 @@ extern enum swiotlb_force swiotlb_force;
  * range check to see if the memory was in fact allocated by this
  * API.
  * @nslabs:The number of IO TLB blocks (in groups of 64) between @start and
- * @end. This is command line adjustable via setup_io_tlb_npages.
+ * @end. For default swiotlb, this is command line adjustable via
+ * setup_io_tlb_npages.
  * @used:  The number of used IO TLB block.
  * @list:  The free list describing the number of free entries available
  * from each index.
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index b849b01a446f..d99b403144a8 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -39,6 +39,13 @@
 #ifdef CONFIG_DEBUG_FS
 #include 
 #endif
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+#include 
+#include 
+#include 
+#include 
+#include 
+#endif
 
 #include 
 #include 
@@ -690,3 +697,72 @@ static int __init swiotlb_create_default_debugfs(void)
 late_initcall(swiotlb_create_default_debugfs);
 
 #endif
+
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
+   struct device *dev)
+{
+   struct io_tlb_mem *mem = rmem->priv;
+   unsigned long nslabs = rmem->size >> IO_TLB_SHIFT;
+
+   if (dev->dma_io_tlb_mem)
+   return 0;
+
+   /*
+* Since multiple devices can share the same pool, the private data,
+* io_tlb_mem struct, will be initialized by the first device attached
+* to it.
+*/
+   if (!mem) {
+   mem = kzalloc(struct_size(mem, slots, nslabs), GFP_KERNEL);
+   if (!mem)
+   return -ENOMEM;
+
+   swiotlb_init_io_tlb_mem(mem, rmem->base, nslabs, false);
+
+   rmem->priv = mem;
+
+   if (IS_ENABLED(CONFIG_DEBUG_FS))
+   swiotlb_create_debugfs(mem, rmem->name);
+   }
+
+   dev->dma_io_tlb_mem = mem;
+
+   return 0;
+}
+
+static void rmem_swiotlb_device_release(struct reserved_mem *rmem,
+   struct device *dev)
+{
+   if (dev)
+   dev->dma_io_tlb_mem = NULL;
+}
+
+static const struct reserved_mem_ops rmem_swiotlb_ops = {
+   .device_init = rmem_swiotlb_device_init,
+   .device_release = rmem_swiotlb_device_release,
+};
+
+static int __init rmem_swiotlb_setup(struct reserved_mem *rmem)
+{
+   unsigned long node = rmem->fdt_node;
+
+   if (of_get_flat_dt_prop(node, "reusable", NULL) ||
+   of_get_flat_dt_prop(node, "linux,cma-default", NULL) ||
+   of_get_flat_dt_prop(node, "linux,dma-default", NULL) ||
+   of_get_flat_dt_prop(node, "no-map", NULL))
+   return -EINVAL;
+
+   if (PageHighMem(pfn_to_page(PHYS_PFN(rmem->base {
+   pr_err("Restricted DMA pool must be accessible within the 
linear mapping.");
+   return -EINVAL;
+   }
+
+   rmem->ops = &rmem_swiotlb_ops;
+   pr_info("Reserved memory: created restricted DMA pool at %pa, size %ld 
MiB\n",
+   &rmem->base, (unsigned long)rmem->size / SZ_1M);
+   return 0;
+}
+
+RESERVEDMEM_OF_DECLARE(dma, "restricted-dma-pool", rmem_swiotlb_setup);
+#endif /* CONFIG_DMA_RESTRICTED_POOL */
-- 
2.31.1.818.g46aad6cb9e-goog

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v8 03/15] swiotlb: Add DMA_RESTRICTED_POOL

2021-05-27 Thread Claire Chang

Add a new kconfig symbol, DMA_RESTRICTED_POOL, for restricted DMA pool.

Signed-off-by: Claire Chang 
---
 kernel/dma/Kconfig | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index 77b405508743..3e961dc39634 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -80,6 +80,20 @@ config SWIOTLB
bool
select NEED_DMA_MAP_STATE
 
+config DMA_RESTRICTED_POOL
+   bool "DMA Restricted Pool"
+   depends on OF && OF_RESERVED_MEM
+   select SWIOTLB
+   help
+ This enables support for restricted DMA pools which provide a level of
+ DMA memory protection on systems with limited hardware protection
+ capabilities, such as those lacking an IOMMU.
+
+ For more information see
+ 

+ and .
+ If unsure, say "n".
+
 #
 # Should be selected if we can mmap non-coherent mappings to userspace.
 # The only thing that is really required is a way to set an uncached bit
-- 
2.31.1.818.g46aad6cb9e-goog

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v8 02/15] swiotlb: Refactor swiotlb_create_debugfs

2021-05-27 Thread Claire Chang

Split the debugfs creation to make the code reusable for supporting
different bounce buffer pools, e.g. restricted DMA pool.

Signed-off-by: Claire Chang 
---
 kernel/dma/swiotlb.c | 25 +++--
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index d3232fc19385..b849b01a446f 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -64,6 +64,7 @@
 enum swiotlb_force swiotlb_force;
 
 struct io_tlb_mem *io_tlb_default_mem;
+static struct dentry *debugfs_dir;
 
 /*
  * Max segment that we can provide which (if pages are contingous) will
@@ -662,18 +663,30 @@ EXPORT_SYMBOL_GPL(is_swiotlb_active);
 
 #ifdef CONFIG_DEBUG_FS
 
-static int __init swiotlb_create_debugfs(void)
+static void swiotlb_create_debugfs(struct io_tlb_mem *mem, const char *name)
 {
-   struct io_tlb_mem *mem = io_tlb_default_mem;
-
if (!mem)
-   return 0;
-   mem->debugfs = debugfs_create_dir("swiotlb", NULL);
+   return;
+
+   mem->debugfs = debugfs_create_dir(name, debugfs_dir);
debugfs_create_ulong("io_tlb_nslabs", 0400, mem->debugfs, &mem->nslabs);
debugfs_create_ulong("io_tlb_used", 0400, mem->debugfs, &mem->used);
+}
+
+static int __init swiotlb_create_default_debugfs(void)
+{
+   struct io_tlb_mem *mem = io_tlb_default_mem;
+
+   if (mem) {
+   swiotlb_create_debugfs(mem, "swiotlb");
+   debugfs_dir = mem->debugfs;
+   } else {
+   debugfs_dir = debugfs_create_dir("swiotlb", NULL);
+   }
+
return 0;
 }
 
-late_initcall(swiotlb_create_debugfs);
+late_initcall(swiotlb_create_default_debugfs);
 
 #endif
-- 
2.31.1.818.g46aad6cb9e-goog

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v8 01/15] swiotlb: Refactor swiotlb init functions

2021-05-27 Thread Claire Chang

Add a new function, swiotlb_init_io_tlb_mem, for the io_tlb_mem struct
initialization to make the code reusable.

Note that we now also call set_memory_decrypted in swiotlb_init_with_tbl.

Signed-off-by: Claire Chang 
---
 kernel/dma/swiotlb.c | 51 ++--
 1 file changed, 25 insertions(+), 26 deletions(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 8ca7d505d61c..d3232fc19385 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -168,9 +168,30 @@ void __init swiotlb_update_mem_attributes(void)
memset(vaddr, 0, bytes);
 }
 
-int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
+static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
+   unsigned long nslabs, bool late_alloc)
 {
+   void *vaddr = phys_to_virt(start);
unsigned long bytes = nslabs << IO_TLB_SHIFT, i;
+
+   mem->nslabs = nslabs;
+   mem->start = start;
+   mem->end = mem->start + bytes;
+   mem->index = 0;
+   mem->late_alloc = late_alloc;
+   spin_lock_init(&mem->lock);
+   for (i = 0; i < mem->nslabs; i++) {
+   mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
+   mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
+   mem->slots[i].alloc_size = 0;
+   }
+
+   set_memory_decrypted((unsigned long)vaddr, bytes >> PAGE_SHIFT);
+   memset(vaddr, 0, bytes);
+}
+
+int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
+{
struct io_tlb_mem *mem;
size_t alloc_size;
 
@@ -186,16 +207,8 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
if (!mem)
panic("%s: Failed to allocate %zu bytes align=0x%lx\n",
  __func__, alloc_size, PAGE_SIZE);
-   mem->nslabs = nslabs;
-   mem->start = __pa(tlb);
-   mem->end = mem->start + bytes;
-   mem->index = 0;
-   spin_lock_init(&mem->lock);
-   for (i = 0; i < mem->nslabs; i++) {
-   mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
-   mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
-   mem->slots[i].alloc_size = 0;
-   }
+
+   swiotlb_init_io_tlb_mem(mem, __pa(tlb), nslabs, false);
 
io_tlb_default_mem = mem;
if (verbose)
@@ -282,7 +295,6 @@ swiotlb_late_init_with_default_size(size_t default_size)
 int
 swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
 {
-   unsigned long bytes = nslabs << IO_TLB_SHIFT, i;
struct io_tlb_mem *mem;
 
if (swiotlb_force == SWIOTLB_NO_FORCE)
@@ -297,20 +309,7 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
if (!mem)
return -ENOMEM;
 
-   mem->nslabs = nslabs;
-   mem->start = virt_to_phys(tlb);
-   mem->end = mem->start + bytes;
-   mem->index = 0;
-   mem->late_alloc = 1;
-   spin_lock_init(&mem->lock);
-   for (i = 0; i < mem->nslabs; i++) {
-   mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
-   mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
-   mem->slots[i].alloc_size = 0;
-   }
-
-   set_memory_decrypted((unsigned long)tlb, bytes >> PAGE_SHIFT);
-   memset(tlb, 0, bytes);
+   swiotlb_init_io_tlb_mem(mem, virt_to_phys(tlb), nslabs, true);
 
io_tlb_default_mem = mem;
swiotlb_print_info();
-- 
2.31.1.818.g46aad6cb9e-goog

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v8 00/15] Restricted DMA

2021-05-27 Thread Claire Chang

This series implements mitigations for lack of DMA access control on
systems without an IOMMU, which could result in the DMA accessing the
system memory at unexpected times and/or unexpected addresses, possibly
leading to data leakage or corruption.

For example, we plan to use the PCI-e bus for Wi-Fi and that PCI-e bus is
not behind an IOMMU. As PCI-e, by design, gives the device full access to
system memory, a vulnerability in the Wi-Fi firmware could easily escalate
to a full system exploit (remote wifi exploits: [1a], [1b] that shows a
full chain of exploits; [2], [3]).

To mitigate the security concerns, we introduce restricted DMA. Restricted
DMA utilizes the existing swiotlb to bounce streaming DMA in and out of a
specially allocated region and does memory allocation from the same region.
The feature on its own provides a basic level of protection against the DMA
overwriting buffer contents at unexpected times. However, to protect
against general data leakage and system memory corruption, the system needs
to provide a way to restrict the DMA to a predefined memory region (this is
usually done at firmware level, e.g. MPU in ATF on some ARM platforms [4]).

[1a] 
https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_4.html
[1b] 
https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_11.html
[2] https://blade.tencent.com/en/advisories/qualpwn/
[3] 
https://www.bleepingcomputer.com/news/security/vulnerabilities-found-in-highly-popular-firmware-for-wifi-chips/
[4] 
https://github.com/ARM-software/arm-trusted-firmware/blob/master/plat/mediatek/mt8183/drivers/emi_mpu/emi_mpu.c#L132

v8:
- Fix reserved-memory.txt and add the reg property in example.
- Fix sizeof for of_property_count_elems_of_size in
  drivers/of/address.c#of_dma_set_restricted_buffer.
- Apply Will's suggestion to try the OF node having DMA configuration in
  drivers/of/address.c#of_dma_set_restricted_buffer.
- Fix typo in the comment of drivers/of/address.c#of_dma_set_restricted_buffer.
- Add error message for PageHighMem in
  kernel/dma/swiotlb.c#rmem_swiotlb_device_init and move it to
  rmem_swiotlb_setup.
- Fix the message string in rmem_swiotlb_setup.

v7:
Fix debugfs, PageHighMem and comment style in rmem_swiotlb_device_init
https://lore.kernel.org/patchwork/cover/1431031/

v6:
Address the comments in v5
https://lore.kernel.org/patchwork/cover/1423201/

v5:
Rebase on latest linux-next
https://lore.kernel.org/patchwork/cover/1416899/

v4:
- Fix spinlock bad magic
- Use rmem->name for debugfs entry
- Address the comments in v3
https://lore.kernel.org/patchwork/cover/1378113/

v3:
Using only one reserved memory region for both streaming DMA and memory
allocation.
https://lore.kernel.org/patchwork/cover/1360992/

v2:
Building on top of swiotlb.
https://lore.kernel.org/patchwork/cover/1280705/

v1:
Using dma_map_ops.
https://lore.kernel.org/patchwork/cover/1271660/

Claire Chang (15):
  swiotlb: Refactor swiotlb init functions
  swiotlb: Refactor swiotlb_create_debugfs
  swiotlb: Add DMA_RESTRICTED_POOL
  swiotlb: Add restricted DMA pool initialization
  swiotlb: Add a new get_io_tlb_mem getter
  swiotlb: Update is_swiotlb_buffer to add a struct device argument
  swiotlb: Update is_swiotlb_active to add a struct device argument
  swiotlb: Bounce data from/to restricted DMA pool if available
  swiotlb: Move alloc_size to find_slots
  swiotlb: Refactor swiotlb_tbl_unmap_single
  dma-direct: Add a new wrapper __dma_direct_free_pages()
  swiotlb: Add restricted DMA alloc/free support.
  dma-direct: Allocate memory from restricted DMA pool if available
  dt-bindings: of: Add restricted DMA pool
  of: Add plumbing for restricted DMA pool

 .../reserved-memory/reserved-memory.txt   |  36 ++-
 drivers/gpu/drm/i915/gem/i915_gem_internal.c  |   2 +-
 drivers/gpu/drm/nouveau/nouveau_ttm.c |   2 +-
 drivers/iommu/dma-iommu.c |  12 +-
 drivers/of/address.c  |  25 ++
 drivers/of/device.c   |   3 +
 drivers/of/of_private.h   |   5 +
 drivers/pci/xen-pcifront.c|   2 +-
 drivers/xen/swiotlb-xen.c |   2 +-
 include/linux/device.h|   4 +
 include/linux/swiotlb.h   |  41 ++-
 kernel/dma/Kconfig|  14 +
 kernel/dma/direct.c   |  63 +++--
 kernel/dma/direct.h   |   9 +-
 kernel/dma/swiotlb.c  | 242 +-
 15 files changed, 362 insertions(+), 100 deletions(-)

-- 
2.31.1.818.g46aad6cb9e-goog

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 14/15] dt-bindings: of: Add restricted DMA pool

2021-05-27 Thread Claire Chang

On Thu, May 27, 2021 at 7:35 PM Will Deacon  wrote:
>
> On Thu, May 27, 2021 at 07:29:20PM +0800, Claire Chang wrote:
> > On Wed, May 26, 2021 at 11:53 PM Will Deacon  wrote:
> > >
> > > On Wed, May 26, 2021 at 01:13:22PM +0100, Will Deacon wrote:
> > > > On Tue, May 18, 2021 at 02:42:14PM +0800, Claire Chang wrote:
> > > > > @@ -138,4 +160,9 @@ one for multimedia processing (named 
> > > > > multimedia-memory@7700, 64MiB).
> > > > > memory-region = <&multimedia_reserved>;
> > > > > /* ... */
> > > > > };
> > > > > +
> > > > > +   pcie_device: pcie_device@0,0 {
> > > > > +   memory-region = <&restricted_dma_mem_reserved>;
> > > > > +   /* ... */
> > > > > +   };
> > > >
> > > > I still don't understand how this works for individual PCIe devices -- 
> > > > how
> > > > is dev->of_node set to point at the node you have above?
> > > >
> > > > I tried adding the memory-region to the host controller instead, and 
> > > > then
> > > > I see it crop up in dmesg:
> > > >
> > > >   | pci-host-generic 4000.pci: assigned reserved memory node 
> > > > restricted_dma_mem_reserved
> > > >
> > > > but none of the actual PCI devices end up with 'dma_io_tlb_mem' set, and
> > > > so the restricted DMA area is not used. In fact, swiotlb isn't used at 
> > > > all.
> > > >
> > > > What am I missing to make this work with PCIe devices?
> > >
> > > Aha, looks like we're just missing the logic to inherit the DMA
> > > configuration. The diff below gets things working for me.
> >
> > I guess what was missing is the reg property in the pcie_device node.
> > Will update the example dts.
>
> Thanks. I still think something like my diff makes sense, if you wouldn't 
> mind including
> it, as it allows restricted DMA to be used for situations where the PCIe
> topology is not static.
>
> Perhaps we should prefer dev->of_node if it exists, but then use the node
> of the host bridge's parent node otherwise?

Sure. Let me add in the next version.

>
> Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 14/15] dt-bindings: of: Add restricted DMA pool

2021-05-27 Thread Will Deacon

On Thu, May 27, 2021 at 08:48:59PM +0800, Claire Chang wrote:
> On Thu, May 27, 2021 at 7:35 PM Will Deacon  wrote:
> >
> > On Thu, May 27, 2021 at 07:29:20PM +0800, Claire Chang wrote:
> > > On Wed, May 26, 2021 at 11:53 PM Will Deacon  wrote:
> > > >
> > > > On Wed, May 26, 2021 at 01:13:22PM +0100, Will Deacon wrote:
> > > > > On Tue, May 18, 2021 at 02:42:14PM +0800, Claire Chang wrote:
> > > > > > @@ -138,4 +160,9 @@ one for multimedia processing (named 
> > > > > > multimedia-memory@7700, 64MiB).
> > > > > > memory-region = <&multimedia_reserved>;
> > > > > > /* ... */
> > > > > > };
> > > > > > +
> > > > > > +   pcie_device: pcie_device@0,0 {
> > > > > > +   memory-region = <&restricted_dma_mem_reserved>;
> > > > > > +   /* ... */
> > > > > > +   };
> > > > >
> > > > > I still don't understand how this works for individual PCIe devices 
> > > > > -- how
> > > > > is dev->of_node set to point at the node you have above?
> > > > >
> > > > > I tried adding the memory-region to the host controller instead, and 
> > > > > then
> > > > > I see it crop up in dmesg:
> > > > >
> > > > >   | pci-host-generic 4000.pci: assigned reserved memory node 
> > > > > restricted_dma_mem_reserved
> > > > >
> > > > > but none of the actual PCI devices end up with 'dma_io_tlb_mem' set, 
> > > > > and
> > > > > so the restricted DMA area is not used. In fact, swiotlb isn't used 
> > > > > at all.
> > > > >
> > > > > What am I missing to make this work with PCIe devices?
> > > >
> > > > Aha, looks like we're just missing the logic to inherit the DMA
> > > > configuration. The diff below gets things working for me.
> > >
> > > I guess what was missing is the reg property in the pcie_device node.
> > > Will update the example dts.
> >
> > Thanks. I still think something like my diff makes sense, if you wouldn't 
> > mind including
> > it, as it allows restricted DMA to be used for situations where the PCIe
> > topology is not static.
> >
> > Perhaps we should prefer dev->of_node if it exists, but then use the node
> > of the host bridge's parent node otherwise?
> 
> Sure. Let me add in the next version.

Brill, thanks! I'll take it for a spin once it lands on the list.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 14/15] dt-bindings: of: Add restricted DMA pool

2021-05-27 Thread Will Deacon

On Thu, May 27, 2021 at 07:29:20PM +0800, Claire Chang wrote:
> On Wed, May 26, 2021 at 11:53 PM Will Deacon  wrote:
> >
> > On Wed, May 26, 2021 at 01:13:22PM +0100, Will Deacon wrote:
> > > On Tue, May 18, 2021 at 02:42:14PM +0800, Claire Chang wrote:
> > > > @@ -138,4 +160,9 @@ one for multimedia processing (named 
> > > > multimedia-memory@7700, 64MiB).
> > > > memory-region = <&multimedia_reserved>;
> > > > /* ... */
> > > > };
> > > > +
> > > > +   pcie_device: pcie_device@0,0 {
> > > > +   memory-region = <&restricted_dma_mem_reserved>;
> > > > +   /* ... */
> > > > +   };
> > >
> > > I still don't understand how this works for individual PCIe devices -- how
> > > is dev->of_node set to point at the node you have above?
> > >
> > > I tried adding the memory-region to the host controller instead, and then
> > > I see it crop up in dmesg:
> > >
> > >   | pci-host-generic 4000.pci: assigned reserved memory node 
> > > restricted_dma_mem_reserved
> > >
> > > but none of the actual PCI devices end up with 'dma_io_tlb_mem' set, and
> > > so the restricted DMA area is not used. In fact, swiotlb isn't used at 
> > > all.
> > >
> > > What am I missing to make this work with PCIe devices?
> >
> > Aha, looks like we're just missing the logic to inherit the DMA
> > configuration. The diff below gets things working for me.
> 
> I guess what was missing is the reg property in the pcie_device node.
> Will update the example dts.

Thanks. I still think something like my diff makes sense, if you wouldn't mind 
including
it, as it allows restricted DMA to be used for situations where the PCIe
topology is not static.

Perhaps we should prefer dev->of_node if it exists, but then use the node
of the host bridge's parent node otherwise?

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 14/15] dt-bindings: of: Add restricted DMA pool

2021-05-27 Thread Claire Chang

On Wed, May 26, 2021 at 11:53 PM Will Deacon  wrote:
>
> On Wed, May 26, 2021 at 01:13:22PM +0100, Will Deacon wrote:
> > On Tue, May 18, 2021 at 02:42:14PM +0800, Claire Chang wrote:
> > > @@ -138,4 +160,9 @@ one for multimedia processing (named 
> > > multimedia-memory@7700, 64MiB).
> > > memory-region = <&multimedia_reserved>;
> > > /* ... */
> > > };
> > > +
> > > +   pcie_device: pcie_device@0,0 {
> > > +   memory-region = <&restricted_dma_mem_reserved>;
> > > +   /* ... */
> > > +   };
> >
> > I still don't understand how this works for individual PCIe devices -- how
> > is dev->of_node set to point at the node you have above?
> >
> > I tried adding the memory-region to the host controller instead, and then
> > I see it crop up in dmesg:
> >
> >   | pci-host-generic 4000.pci: assigned reserved memory node 
> > restricted_dma_mem_reserved
> >
> > but none of the actual PCI devices end up with 'dma_io_tlb_mem' set, and
> > so the restricted DMA area is not used. In fact, swiotlb isn't used at all.
> >
> > What am I missing to make this work with PCIe devices?
>
> Aha, looks like we're just missing the logic to inherit the DMA
> configuration. The diff below gets things working for me.

I guess what was missing is the reg property in the pcie_device node.
Will update the example dts.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH v3 0/8] Add IOPF support for VFIO passthrough

2021-05-27 Thread Shenming Lu

On 2021/5/25 6:11, Alex Williamson wrote:
> On Fri, 21 May 2021 14:37:21 +0800
> Shenming Lu  wrote:
> 
>> Hi Alex,
>>
>> On 2021/5/19 2:57, Alex Williamson wrote:
>>> On Fri, 9 Apr 2021 11:44:12 +0800
>>> Shenming Lu  wrote:
>>>   
 Hi,

 Requesting for your comments and suggestions. :-)

 The static pinning and mapping problem in VFIO and possible solutions
 have been discussed a lot [1, 2]. One of the solutions is to add I/O
 Page Fault support for VFIO devices. Different from those relatively
 complicated software approaches such as presenting a vIOMMU that provides
 the DMA buffer information (might include para-virtualized optimizations),
 IOPF mainly depends on the hardware faulting capability, such as the PCIe
 PRI extension or Arm SMMU stall model. What's more, the IOPF support in
 the IOMMU driver has already been implemented in SVA [3]. So we add IOPF
 support for VFIO passthrough based on the IOPF part of SVA in this series. 

>>>
>>> The SVA proposals are being reworked to make use of a new IOASID
>>> object, it's not clear to me that this shouldn't also make use of that
>>> work as it does a significant expansion of the type1 IOMMU with fault
>>> handlers that would duplicate new work using that new model.  
>>
>> It seems that the IOASID extension for guest SVA would not affect this 
>> series,
>> will we do any host-guest IOASID translation in the VFIO fault handler?
> 
> Surely it will, we don't currently have any IOMMU fault handling or
> forwarding of IOMMU faults through to the vfio bus driver, both of
> those would be included in an IOASID implementation.  I think Jason's
> vision is to use IOASID to deprecate type1 for all use cases, so even
> if we were to choose to implement IOPF in type1 we should agree on
> common interfaces with IOASID.

Yeah, the guest IOPF(L1) handling may include the host-guest IOASID translation,
which can be placed in the IOASID layer (in fact it can be placed in many places
such as the vfio pci driver since it don't really handle the fault event, it 
just
transfers the event to the vIOMMU).
But the host IOPF(L2) has no relationship with IOASID at all, it needs to have a
knowledge of the vfio_dma ranges.
Could we add the host IOPF support to type1 first (integrate it within the MAP 
ioctl)?
And we may migrate the generic iommu controls (such as MAP/UNMAP...) from type1 
to
IOASID in the future (it seems to be a huge work, I will be very happy if I 
could
help this)... :-)

>  
 We have measured its performance with UADK [4] (passthrough an accelerator
 to a VM(1U16G)) on Hisilicon Kunpeng920 board (and compared with host SVA):

 Run hisi_sec_test...
  - with varying sending times and message lengths
  - with/without IOPF enabled (speed slowdown)

 when msg_len = 1MB (and PREMAP_LEN (in Patch 4) = 1):
 slowdown (num of faults)
  times  VFIO IOPF  host SVA
  1  63.4% (518)82.8% (512)
  10022.9% (1058)   47.9% (1024)
  1000   2.6% (1071)8.5% (1024)

 when msg_len = 10MB (and PREMAP_LEN = 512):
 slowdown (num of faults)
  times  VFIO IOPF
  1  32.6% (13)
  1003.5% (26)
  1000   1.6% (26)  
>>>
>>> It seems like this is only an example that you can make a benchmark
>>> show anything you want.  The best results would be to pre-map
>>> everything, which is what we have without this series.  What is an
>>> acceptable overhead to incur to avoid page pinning?  What if userspace
>>> had more fine grained control over which mappings were available for
>>> faulting and which were statically mapped?  I don't really see what
>>> sense the pre-mapping range makes.  If I assume the user is QEMU in a
>>> non-vIOMMU configuration, pre-mapping the beginning of each RAM section
>>> doesn't make any logical sense relative to device DMA.  
>>
>> As you said in Patch 4, we can introduce full end-to-end functionality
>> before trying to improve performance, and I will drop the pre-mapping patch
>> in the current stage...
>>
>> Is there a need that userspace wants more fine grained control over which
>> mappings are available for faulting? If so, we may evolve the MAP ioctl
>> to support for specifying the faulting range.
> 
> You're essentially enabling this for a vfio bus driver via patch 7/8,
> pinning for selective DMA faulting.  How would a driver in userspace
> make equivalent requests?  In the case of performance, the user could
> mlock the page but they have no mechanism here to pre-fault it.  Should
> they?

Make sense.

It seems that we should additionally iommu_map the pages which are IOPF
enabled and pinned in vfio_iommu_type1_pin_pages, and there is no need
to add more tracking structures in Patch 7...

> 
>> As for the overhead of IOPF, it is unavoidable if enabling on-demand paging
>> (and page faults occur almost only when first accessing), and it se

Re: [RFC PATCH v3 8/8] vfio: Add nested IOPF support

2021-05-27 Thread Lu Baolu


Hi Shenming and Alex,

On 5/27/21 7:03 PM, Shenming Lu wrote:

I haven't fully read all the references, but who imposes the fact that
there's only one fault handler per device?  If type1 could register one
handler and the vfio-pci bus driver another for the other level, would
we need this path through vfio-core?

If we could register more than one handler per device, things would become
much more simple, and the path through vfio-core would not be needed.

Hi Baolu,
Is there any restriction for having more than one handler per device?



Currently, each device could only have one fault handler. But one device
might consume multiple page tables. From this point of view, it's more
reasonable to have one handler per page table.

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH v3 2/8] vfio/type1: Add a page fault handler

2021-05-27 Thread Shenming Lu

On 2021/5/25 6:11, Alex Williamson wrote:
> On Fri, 21 May 2021 14:38:52 +0800
> Shenming Lu  wrote:
> 
>> On 2021/5/19 2:58, Alex Williamson wrote:
>>> On Fri, 9 Apr 2021 11:44:14 +0800
>>> Shenming Lu  wrote:
>>>   
 VFIO manages the DMA mapping itself. To support IOPF (on-demand paging)
 for VFIO (IOMMU capable) devices, we add a VFIO page fault handler to
 serve the reported page faults from the IOMMU driver.

 Signed-off-by: Shenming Lu 
 ---
  drivers/vfio/vfio_iommu_type1.c | 114 
  1 file changed, 114 insertions(+)

 diff --git a/drivers/vfio/vfio_iommu_type1.c 
 b/drivers/vfio/vfio_iommu_type1.c
 index 45cbfd4879a5..ab0ff60ee207 100644
 --- a/drivers/vfio/vfio_iommu_type1.c
 +++ b/drivers/vfio/vfio_iommu_type1.c
 @@ -101,6 +101,7 @@ struct vfio_dma {
struct task_struct  *task;
struct rb_root  pfn_list;   /* Ex-user pinned pfn list */
unsigned long   *bitmap;
 +  unsigned long   *iopf_mapped_bitmap;
  };
  
  struct vfio_batch {
 @@ -141,6 +142,16 @@ struct vfio_regions {
size_t len;
  };
  
 +/* A global IOPF enabled group list */
 +static struct rb_root iopf_group_list = RB_ROOT;
 +static DEFINE_MUTEX(iopf_group_list_lock);
 +
 +struct vfio_iopf_group {
 +  struct rb_node  node;
 +  struct iommu_group  *iommu_group;
 +  struct vfio_iommu   *iommu;
 +};
 +
  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)   \
(!list_empty(&iommu->domain_list))
  
 @@ -157,6 +168,10 @@ struct vfio_regions {
  #define DIRTY_BITMAP_PAGES_MAX ((u64)INT_MAX)
  #define DIRTY_BITMAP_SIZE_MAX  
 DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
  
 +#define IOPF_MAPPED_BITMAP_GET(dma, i)\
 +((dma->iopf_mapped_bitmap[(i) / BITS_PER_LONG]
 \
 + >> ((i) % BITS_PER_LONG)) & 0x1)  
>>>
>>>
>>> Can't we just use test_bit()?  
>>
>> Yeah, we can use it.
>>
>>>
>>>   
 +
  #define WAITED 1
  
  static int put_pfn(unsigned long pfn, int prot);
 @@ -416,6 +431,34 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma 
 *dma, struct vfio_pfn *vpfn)
return ret;
  }
  
 +/*
 + * Helper functions for iopf_group_list
 + */
 +static struct vfio_iopf_group *
 +vfio_find_iopf_group(struct iommu_group *iommu_group)
 +{
 +  struct vfio_iopf_group *iopf_group;
 +  struct rb_node *node;
 +
 +  mutex_lock(&iopf_group_list_lock);
 +
 +  node = iopf_group_list.rb_node;
 +
 +  while (node) {
 +  iopf_group = rb_entry(node, struct vfio_iopf_group, node);
 +
 +  if (iommu_group < iopf_group->iommu_group)
 +  node = node->rb_left;
 +  else if (iommu_group > iopf_group->iommu_group)
 +  node = node->rb_right;
 +  else
 +  break;
 +  }
 +
 +  mutex_unlock(&iopf_group_list_lock);
 +  return node ? iopf_group : NULL;
 +}  
>>>
>>> This looks like a pretty heavy weight operation per DMA fault.
>>>
>>> I'm also suspicious of this validity of this iopf_group after we've
>>> dropped the locking, the ordering of patches makes this very confusing.  
>>
>> My thought was to include the handling of DMA faults completely in the type1
>> backend by introducing the vfio_iopf_group struct. But it seems that 
>> introducing
>> a struct with an unknown lifecycle causes more problems...
>> I will use the path from vfio-core as in the v2 for simplicity and validity.
>>
>> Sorry for the confusing, I will reconstruct the series later. :-)
>>
>>>   
 +
  static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
  {
struct mm_struct *mm;
 @@ -3106,6 +3149,77 @@ static int vfio_iommu_type1_dirty_pages(struct 
 vfio_iommu *iommu,
return -EINVAL;
  }
  
 +/* VFIO I/O Page Fault handler */
 +static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void 
 *data)  
>>>   
>>> >From the comment, this seems like the IOMMU fault handler (the  
>>> construction of this series makes this difficult to follow) and
>>> eventually it handles more than DMA mapping, for example transferring
>>> faults to the device driver.  "dma_map_iopf" seems like a poorly scoped
>>> name.  
>>
>> Maybe just call it dev_fault_handler?
> 
> Better.
> 
 +{
 +  struct device *dev = (struct device *)data;
 +  struct iommu_group *iommu_group;
 +  struct vfio_iopf_group *iopf_group;
 +  struct vfio_iommu *iommu;
 +  struct vfio_dma *dma;
 +  dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
 +  int access_flags = 0;
 +  unsigned long bit_offset, vaddr, pfn;
 +  int ret;
 +  enum iommu_

Re: [RFC PATCH v3 5/8] vfio/type1: VFIO_IOMMU_ENABLE_IOPF

2021-05-27 Thread Shenming Lu

On 2021/5/25 6:11, Alex Williamson wrote:
> On Fri, 21 May 2021 14:38:25 +0800
> Shenming Lu  wrote:
> 
>> On 2021/5/19 2:58, Alex Williamson wrote:
>>> On Fri, 9 Apr 2021 11:44:17 +0800
>>> Shenming Lu  wrote:
>>>   
 Since enabling IOPF for devices may lead to a slow ramp up of performance,
 we add an ioctl VFIO_IOMMU_ENABLE_IOPF to make it configurable. And the
 IOPF enabling of a VFIO device includes setting IOMMU_DEV_FEAT_IOPF and
 registering the VFIO IOPF handler.

 Note that VFIO_IOMMU_DISABLE_IOPF is not supported since there may be
 inflight page faults when disabling.

 Signed-off-by: Shenming Lu 
 ---
  drivers/vfio/vfio_iommu_type1.c | 223 +++-
  include/uapi/linux/vfio.h   |   6 +
  2 files changed, 226 insertions(+), 3 deletions(-)

 diff --git a/drivers/vfio/vfio_iommu_type1.c 
 b/drivers/vfio/vfio_iommu_type1.c
 index 01e296c6dc9e..7df5711e743a 100644
 --- a/drivers/vfio/vfio_iommu_type1.c
 +++ b/drivers/vfio/vfio_iommu_type1.c
 @@ -71,6 +71,7 @@ struct vfio_iommu {
struct rb_root  dma_list;
struct blocking_notifier_head notifier;
struct mmu_notifier mn;
 +  struct mm_struct*mm;  
>>>
>>> We currently have no requirement that a single mm is used for all
>>> container mappings.  Does enabling IOPF impose such a requirement?
>>> Shouldn't MAP/UNMAP enforce that?  
>>
>> Did you mean that there is a possibility that each vfio_dma in a
>> container has a different mm? If so, could we register one MMU
>> notifier for each vfio_dma in the MAP ioctl?
> 
> We don't prevent it currently.  There needs to be some balance,
> typically we'll have one mm, so would it make sense to have potentially
> thousands of mmu notifiers registered against the same mm?

If we have one MMU notifier per vfio_dma, there is no need to search the
iommu->dma_list when receiving an address invalidation event in
mn_invalidate_range().
Or could we divide the vfio_dma ranges into parts with different mm, and
each part would have one MMU notifier?

>  
unsigned intdma_avail;
unsigned intvaddr_invalid_count;
uint64_tpgsize_bitmap;
 @@ -81,6 +82,7 @@ struct vfio_iommu {
booldirty_page_tracking;
boolpinned_page_dirty_scope;
boolcontainer_open;
 +  booliopf_enabled;
  };
  
  struct vfio_domain {
 @@ -461,6 +463,38 @@ vfio_find_iopf_group(struct iommu_group *iommu_group)
return node ? iopf_group : NULL;
  }
  
 +static void vfio_link_iopf_group(struct vfio_iopf_group *new)
 +{
 +  struct rb_node **link, *parent = NULL;
 +  struct vfio_iopf_group *iopf_group;
 +
 +  mutex_lock(&iopf_group_list_lock);
 +
 +  link = &iopf_group_list.rb_node;
 +
 +  while (*link) {
 +  parent = *link;
 +  iopf_group = rb_entry(parent, struct vfio_iopf_group, node);
 +
 +  if (new->iommu_group < iopf_group->iommu_group)
 +  link = &(*link)->rb_left;
 +  else
 +  link = &(*link)->rb_right;
 +  }
 +
 +  rb_link_node(&new->node, parent, link);
 +  rb_insert_color(&new->node, &iopf_group_list);
 +
 +  mutex_unlock(&iopf_group_list_lock);
 +}
 +
 +static void vfio_unlink_iopf_group(struct vfio_iopf_group *old)
 +{
 +  mutex_lock(&iopf_group_list_lock);
 +  rb_erase(&old->node, &iopf_group_list);
 +  mutex_unlock(&iopf_group_list_lock);
 +}
 +
  static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
  {
struct mm_struct *mm;
 @@ -2363,6 +2397,68 @@ static void vfio_iommu_iova_insert_copy(struct 
 vfio_iommu *iommu,
list_splice_tail(iova_copy, iova);
  }
  
 +static int vfio_dev_domian_nested(struct device *dev, int *nested)  
>>>
>>>
>>> s/domian/domain/
>>>
>>>   
 +{
 +  struct iommu_domain *domain;
 +
 +  domain = iommu_get_domain_for_dev(dev);
 +  if (!domain)
 +  return -ENODEV;
 +
 +  return iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, nested);  
>>>
>>>
>>> Wouldn't this be easier to use if it returned bool, false on -errno?  
>>
>> Make sense.
>>
>>>
>>>   
 +}
 +
 +static int vfio_iommu_type1_dma_map_iopf(struct iommu_fault *fault, void 
 *data);
 +
 +static int dev_enable_iopf(struct device *dev, void *data)
 +{
 +  int *enabled_dev_cnt = data;
 +  int nested;
 +  u32 flags;
 +  int ret;
 +
 +  ret = iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_IOPF);
 +  if (ret)
 +  return ret;
 +
 +  ret = vfio_dev_domian_nested(dev, &nested);
 +  if (ret)
 +  goto out_disable;
 +
 +  if (nested)
 +

Re: [RFC PATCH v3 8/8] vfio: Add nested IOPF support

2021-05-27 Thread Shenming Lu

On 2021/5/25 6:11, Alex Williamson wrote:
> On Mon, 24 May 2021 21:11:11 +0800
> Shenming Lu  wrote:
> 
>> On 2021/5/21 15:59, Shenming Lu wrote:
>>> On 2021/5/19 2:58, Alex Williamson wrote:  
 On Fri, 9 Apr 2021 11:44:20 +0800
 Shenming Lu  wrote:
  
> To set up nested mode, drivers such as vfio_pci need to register a
> handler to receive stage/level 1 faults from the IOMMU, but since
> currently each device can only have one iommu dev fault handler,
> and if stage 2 IOPF is already enabled (VFIO_IOMMU_ENABLE_IOPF),
> we choose to update the registered handler (a consolidated one) via
> flags (set FAULT_REPORT_NESTED_L1), and further deliver the received
> stage 1 faults in the handler to the guest through a newly added
> vfio_device_ops callback.  

 Are there proposed in-kernel drivers that would use any of these
 symbols?  
>>>
>>> I hope that such as Eric's SMMUv3 Nested Stage Setup series [1] can
>>> use these symbols to consolidate the two page fault handlers into one.
>>>
>>> [1] 
>>> https://patchwork.kernel.org/project/kvm/cover/2021044659.15051-1-eric.au...@redhat.com/
>>>   
  
> Signed-off-by: Shenming Lu 
> ---
>  drivers/vfio/vfio.c | 81 +
>  drivers/vfio/vfio_iommu_type1.c | 49 +++-
>  include/linux/vfio.h| 12 +
>  3 files changed, 141 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 44c8dfabf7de..4245f15914bf 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -2356,6 +2356,87 @@ struct iommu_domain 
> *vfio_group_iommu_domain(struct vfio_group *group)
>  }
>  EXPORT_SYMBOL_GPL(vfio_group_iommu_domain);
>  
> +/*
> + * Register/Update the VFIO IOPF handler to receive
> + * nested stage/level 1 faults.
> + */
> +int vfio_iommu_dev_fault_handler_register_nested(struct device *dev)
> +{
> + struct vfio_container *container;
> + struct vfio_group *group;
> + struct vfio_iommu_driver *driver;
> + int ret;
> +
> + if (!dev)
> + return -EINVAL;
> +
> + group = vfio_group_get_from_dev(dev);
> + if (!group)
> + return -ENODEV;
> +
> + ret = vfio_group_add_container_user(group);
> + if (ret)
> + goto out;
> +
> + container = group->container;
> + driver = container->iommu_driver;
> + if (likely(driver && driver->ops->register_handler))
> + ret = driver->ops->register_handler(container->iommu_data, dev);
> + else
> + ret = -ENOTTY;
> +
> + vfio_group_try_dissolve_container(group);
> +
> +out:
> + vfio_group_put(group);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_register_nested);
> +
> +int vfio_iommu_dev_fault_handler_unregister_nested(struct device *dev)
> +{
> + struct vfio_container *container;
> + struct vfio_group *group;
> + struct vfio_iommu_driver *driver;
> + int ret;
> +
> + if (!dev)
> + return -EINVAL;
> +
> + group = vfio_group_get_from_dev(dev);
> + if (!group)
> + return -ENODEV;
> +
> + ret = vfio_group_add_container_user(group);
> + if (ret)
> + goto out;
> +
> + container = group->container;
> + driver = container->iommu_driver;
> + if (likely(driver && driver->ops->unregister_handler))
> + ret = driver->ops->unregister_handler(container->iommu_data, 
> dev);
> + else
> + ret = -ENOTTY;
> +
> + vfio_group_try_dissolve_container(group);
> +
> +out:
> + vfio_group_put(group);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_iommu_dev_fault_handler_unregister_nested);
> +
> +int vfio_transfer_iommu_fault(struct device *dev, struct iommu_fault 
> *fault)
> +{
> + struct vfio_device *device = dev_get_drvdata(dev);
> +
> + if (unlikely(!device->ops->transfer))
> + return -EOPNOTSUPP;
> +
> + return device->ops->transfer(device->device_data, fault);
> +}
> +EXPORT_SYMBOL_GPL(vfio_transfer_iommu_fault);
> +
>  /**
>   * Module/class support
>   */
> diff --git a/drivers/vfio/vfio_iommu_type1.c 
> b/drivers/vfio/vfio_iommu_type1.c
> index ba2b5a1cf6e9..9d1adeddb303 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -3821,13 +3821,32 @@ static int vfio_iommu_type1_dma_map_iopf(struct 
> iommu_fault *fault, void *data)
>   struct vfio_batch batch;
>   struct vfio_range *range;
>   dma_addr_t iova = ALIGN_DOWN(fault->prm.addr, PAGE_SIZE);
> - int access_flags = 0;
> + int access_flags = 0, nested;
>   size_t premap_len, map_len, mapped_len = 0;
>   unsigned long bit_offset, vaddr, pfn, i

Re: Re: [PATCH v7 11/12] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-05-27 Thread Yongji Xie

On Thu, May 27, 2021 at 4:43 PM Jason Wang  wrote:
>
>
> 在 2021/5/27 下午4:41, Jason Wang 写道:
> >
> > 在 2021/5/27 下午3:34, Yongji Xie 写道:
> >> On Thu, May 27, 2021 at 1:40 PM Jason Wang  wrote:
> >>>
> >>> 在 2021/5/27 下午1:08, Yongji Xie 写道:
>  On Thu, May 27, 2021 at 1:00 PM Jason Wang 
>  wrote:
> > 在 2021/5/27 下午12:57, Yongji Xie 写道:
> >> On Thu, May 27, 2021 at 12:13 PM Jason Wang 
> >> wrote:
> >>> 在 2021/5/17 下午5:55, Xie Yongji 写道:
>  +
>  +static int vduse_dev_msg_sync(struct vduse_dev *dev,
>  +   struct vduse_dev_msg *msg)
>  +{
>  + init_waitqueue_head(&msg->waitq);
>  + spin_lock(&dev->msg_lock);
>  + vduse_enqueue_msg(&dev->send_list, msg);
>  + wake_up(&dev->waitq);
>  + spin_unlock(&dev->msg_lock);
>  + wait_event_killable(msg->waitq, msg->completed);
> >>> What happens if the userspace(malicous) doesn't give a response
> >>> forever?
> >>>
> >>> It looks like a DOS. If yes, we need to consider a way to fix that.
> >>>
> >> How about using wait_event_killable_timeout() instead?
> > Probably, and then we need choose a suitable timeout and more
> > important,
> > need to report the failure to virtio.
> >
>  Makes sense to me. But it looks like some
>  vdpa_config_ops/virtio_config_ops such as set_status() didn't have a
>  return value.  Now I add a WARN_ON() for the failure. Do you mean we
>  need to add some change for virtio core to handle the failure?
> >>>
> >>> Maybe, but I'm not sure how hard we can do that.
> >>>
> >> We need to change all virtio device drivers in this way.
> >
> >
> > Probably.
> >
> >
> >>
> >>> We had NEEDS_RESET but it looks we don't implement it.
> >>>
> >> Could it handle the failure of get_feature() and get/set_config()?
> >
> >
> > Looks not:
> >
> > "
> >
> > The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
> > that a reset is needed. If DRIVER_OK is set, after it sets
> > DEVICE_NEEDS_RESET, the device MUST send a device configuration change
> > notification to the driver.
> >
> > "
> >
> > This looks implies that NEEDS_RESET may only work after device is
> > probed. But in the current design, even the reset() is not reliable.
> >
> >
> >>
> >>> Or a rough idea is that maybe need some relaxing to be coupled loosely
> >>> with userspace. E.g the device (control path) is implemented in the
> >>> kernel but the datapath is implemented in the userspace like TUN/TAP.
> >>>
> >> I think it can work for most cases. One problem is that the set_config
> >> might change the behavior of the data path at runtime, e.g.
> >> virtnet_set_mac_address() in the virtio-net driver and
> >> cache_type_store() in the virtio-blk driver. Not sure if this path is
> >> able to return before the datapath is aware of this change.
> >
> >
> > Good point.
> >
> > But set_config() should be rare:
> >
> > E.g in the case of virtio-net with VERSION_1, config space is read
> > only, and it was set via control vq.
> >
> > For block, we can
> >
> > 1) start from without WCE or
> > 2) we add a config change notification to userspace or
> > 3) extend the spec to use vq instead of config space
> >
> > Thanks
>
>
> Another thing if we want to go this way:
>
> We need find a way to terminate the data path from the kernel side, to
> implement to reset semantic.
>

Do you mean terminate the data path in vdpa_reset(). Is it ok to just
notify userspace to stop data path asynchronously? Userspace should
not be able to do any I/O at that time because the iotlb mapping is
already removed.

Thanks,
Yongji
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 11/12] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-05-27 Thread Jason Wang



在 2021/5/27 下午4:41, Jason Wang 写道:


在 2021/5/27 下午3:34, Yongji Xie 写道:

On Thu, May 27, 2021 at 1:40 PM Jason Wang  wrote:


在 2021/5/27 下午1:08, Yongji Xie 写道:
On Thu, May 27, 2021 at 1:00 PM Jason Wang  
wrote:

在 2021/5/27 下午12:57, Yongji Xie 写道:
On Thu, May 27, 2021 at 12:13 PM Jason Wang  
wrote:

在 2021/5/17 下午5:55, Xie Yongji 写道:

+
+static int vduse_dev_msg_sync(struct vduse_dev *dev,
+   struct vduse_dev_msg *msg)
+{
+ init_waitqueue_head(&msg->waitq);
+ spin_lock(&dev->msg_lock);
+ vduse_enqueue_msg(&dev->send_list, msg);
+ wake_up(&dev->waitq);
+ spin_unlock(&dev->msg_lock);
+ wait_event_killable(msg->waitq, msg->completed);
What happens if the userspace(malicous) doesn't give a response 
forever?


It looks like a DOS. If yes, we need to consider a way to fix that.


How about using wait_event_killable_timeout() instead?
Probably, and then we need choose a suitable timeout and more 
important,

need to report the failure to virtio.


Makes sense to me. But it looks like some
vdpa_config_ops/virtio_config_ops such as set_status() didn't have a
return value.  Now I add a WARN_ON() for the failure. Do you mean we
need to add some change for virtio core to handle the failure?


Maybe, but I'm not sure how hard we can do that.


We need to change all virtio device drivers in this way.



Probably.





We had NEEDS_RESET but it looks we don't implement it.


Could it handle the failure of get_feature() and get/set_config()?



Looks not:

"

The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state 
that a reset is needed. If DRIVER_OK is set, after it sets 
DEVICE_NEEDS_RESET, the device MUST send a device configuration change 
notification to the driver.


"

This looks implies that NEEDS_RESET may only work after device is 
probed. But in the current design, even the reset() is not reliable.






Or a rough idea is that maybe need some relaxing to be coupled loosely
with userspace. E.g the device (control path) is implemented in the
kernel but the datapath is implemented in the userspace like TUN/TAP.


I think it can work for most cases. One problem is that the set_config
might change the behavior of the data path at runtime, e.g.
virtnet_set_mac_address() in the virtio-net driver and
cache_type_store() in the virtio-blk driver. Not sure if this path is
able to return before the datapath is aware of this change.



Good point.

But set_config() should be rare:

E.g in the case of virtio-net with VERSION_1, config space is read 
only, and it was set via control vq.


For block, we can

1) start from without WCE or
2) we add a config change notification to userspace or
3) extend the spec to use vq instead of config space

Thanks



Another thing if we want to go this way:

We need find a way to terminate the data path from the kernel side, to 
implement to reset semantic.


Thanks







Thanks,
Yongji



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 11/12] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-05-27 Thread Jason Wang



在 2021/5/27 下午3:34, Yongji Xie 写道:

On Thu, May 27, 2021 at 1:40 PM Jason Wang  wrote:


在 2021/5/27 下午1:08, Yongji Xie 写道:

On Thu, May 27, 2021 at 1:00 PM Jason Wang  wrote:

在 2021/5/27 下午12:57, Yongji Xie 写道:

On Thu, May 27, 2021 at 12:13 PM Jason Wang  wrote:

在 2021/5/17 下午5:55, Xie Yongji 写道:

+
+static int vduse_dev_msg_sync(struct vduse_dev *dev,
+   struct vduse_dev_msg *msg)
+{
+ init_waitqueue_head(&msg->waitq);
+ spin_lock(&dev->msg_lock);
+ vduse_enqueue_msg(&dev->send_list, msg);
+ wake_up(&dev->waitq);
+ spin_unlock(&dev->msg_lock);
+ wait_event_killable(msg->waitq, msg->completed);

What happens if the userspace(malicous) doesn't give a response forever?

It looks like a DOS. If yes, we need to consider a way to fix that.


How about using wait_event_killable_timeout() instead?

Probably, and then we need choose a suitable timeout and more important,
need to report the failure to virtio.


Makes sense to me. But it looks like some
vdpa_config_ops/virtio_config_ops such as set_status() didn't have a
return value.  Now I add a WARN_ON() for the failure. Do you mean we
need to add some change for virtio core to handle the failure?


Maybe, but I'm not sure how hard we can do that.


We need to change all virtio device drivers in this way.



Probably.





We had NEEDS_RESET but it looks we don't implement it.


Could it handle the failure of get_feature() and get/set_config()?



Looks not:

"

The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state 
that a reset is needed. If DRIVER_OK is set, after it sets 
DEVICE_NEEDS_RESET, the device MUST send a device configuration change 
notification to the driver.


"

This looks implies that NEEDS_RESET may only work after device is 
probed. But in the current design, even the reset() is not reliable.






Or a rough idea is that maybe need some relaxing to be coupled loosely
with userspace. E.g the device (control path) is implemented in the
kernel but the datapath is implemented in the userspace like TUN/TAP.


I think it can work for most cases. One problem is that the set_config
might change the behavior of the data path at runtime, e.g.
virtnet_set_mac_address() in the virtio-net driver and
cache_type_store() in the virtio-blk driver. Not sure if this path is
able to return before the datapath is aware of this change.



Good point.

But set_config() should be rare:

E.g in the case of virtio-net with VERSION_1, config space is read only, 
and it was set via control vq.


For block, we can

1) start from without WCE or
2) we add a config change notification to userspace or
3) extend the spec to use vq instead of config space

Thanks




Thanks,
Yongji



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[RFC] /dev/ioasid uAPI proposal

2021-05-27 Thread Tian, Kevin

/dev/ioasid provides an unified interface for managing I/O page tables for 
devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA, 
etc.) are expected to use this interface instead of creating their own logic to 
isolate untrusted device DMAs initiated by userspace. 

This proposal describes the uAPI of /dev/ioasid and also sample sequences 
with VFIO as example in typical usages. The driver-facing kernel API provided 
by the iommu layer is still TBD, which can be discussed after consensus is 
made on this uAPI.

It's based on a lengthy discussion starting from here:

https://lore.kernel.org/linux-iommu/20210330132830.go2356...@nvidia.com/ 

It ends up to be a long writing due to many things to be summarized and
non-trivial effort required to connect them into a complete proposal.
Hope it provides a clean base to converge.

TOC

1. Terminologies and Concepts
2. uAPI Proposal
2.1. /dev/ioasid uAPI
2.2. /dev/vfio uAPI
2.3. /dev/kvm uAPI
3. Sample structures and helper functions
4. PASID virtualization
5. Use Cases and Flows
5.1. A simple example
5.2. Multiple IOASIDs (no nesting)
5.3. IOASID nesting (software)
5.4. IOASID nesting (hardware)
5.5. Guest SVA (vSVA)
5.6. I/O page fault
5.7. BIND_PASID_TABLE


1. Terminologies and Concepts
-

IOASID FD is the container holding multiple I/O address spaces. User 
manages those address spaces through FD operations. Multiple FD's are 
allowed per process, but with this proposal one FD should be sufficient for 
all intended usages.

IOASID is the FD-local software handle representing an I/O address space. 
Each IOASID is associated with a single I/O page table. IOASIDs can be 
nested together, implying the output address from one I/O page table 
(represented by child IOASID) must be further translated by another I/O 
page table (represented by parent IOASID).

I/O address space can be managed through two protocols, according to 
whether the corresponding I/O page table is constructed by the kernel or 
the user. When kernel-managed, a dma mapping protocol (similar to 
existing VFIO iommu type1) is provided for the user to explicitly specify 
how the I/O address space is mapped. Otherwise, a different protocol is 
provided for the user to bind an user-managed I/O page table to the 
IOMMU, plus necessary commands for iotlb invalidation and I/O fault 
handling. 

Pgtable binding protocol can be used only on the child IOASID's, implying 
IOASID nesting must be enabled. This is because the kernel doesn't trust 
userspace. Nesting allows the kernel to enforce its DMA isolation policy 
through the parent IOASID.

IOASID nesting can be implemented in two ways: hardware nesting and 
software nesting. With hardware support the child and parent I/O page 
tables are walked consecutively by the IOMMU to form a nested translation. 
When it's implemented in software, the ioasid driver is responsible for 
merging the two-level mappings into a single-level shadow I/O page table. 
Software nesting requires both child/parent page tables operated through 
the dma mapping protocol, so any change in either level can be captured 
by the kernel to update the corresponding shadow mapping.

An I/O address space takes effect in the IOMMU only after it is attached 
to a device. The device in the /dev/ioasid context always refers to a 
physical one or 'pdev' (PF or VF). 

One I/O address space could be attached to multiple devices. In this case, 
/dev/ioasid uAPI applies to all attached devices under the specified IOASID.

Based on the underlying IOMMU capability one device might be allowed 
to attach to multiple I/O address spaces, with DMAs accessing them by 
carrying different routing information. One of them is the default I/O 
address space routed by PCI Requestor ID (RID) or ARM Stream ID. The 
remaining are routed by RID + Process Address Space ID (PASID) or 
Stream+Substream ID. For simplicity the following context uses RID and
PASID when talking about the routing information for I/O address spaces.

Device attachment is initiated through passthrough framework uAPI (use
VFIO for simplicity in following context). VFIO is responsible for identifying 
the routing information and registering it to the ioasid driver when calling 
ioasid attach helper function. It could be RID if the assigned device is 
pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition, 
user might also provide its view of virtual routing information (vPASID) in 
the attach call, e.g. when multiple user-managed I/O address spaces are 
attached to the vfio_device. In this case VFIO must figure out whether 
vPASID should be directly used (for pdev) or converted to a kernel-
allocated one (pPASID, for mdev) for physical routing (see section 4).

Device must be bound to an IOASID FD before attach operation can be
conducted. This is also through VFIO uAPI. In this proposal one device 
should not be

Re: Re: [PATCH v7 11/12] vduse: Introduce VDUSE - vDPA Device in Userspace

2021-05-27 Thread Yongji Xie

On Thu, May 27, 2021 at 1:40 PM Jason Wang  wrote:
>
>
> 在 2021/5/27 下午1:08, Yongji Xie 写道:
> > On Thu, May 27, 2021 at 1:00 PM Jason Wang  wrote:
> >>
> >> 在 2021/5/27 下午12:57, Yongji Xie 写道:
> >>> On Thu, May 27, 2021 at 12:13 PM Jason Wang  wrote:
>  在 2021/5/17 下午5:55, Xie Yongji 写道:
> > +
> > +static int vduse_dev_msg_sync(struct vduse_dev *dev,
> > +   struct vduse_dev_msg *msg)
> > +{
> > + init_waitqueue_head(&msg->waitq);
> > + spin_lock(&dev->msg_lock);
> > + vduse_enqueue_msg(&dev->send_list, msg);
> > + wake_up(&dev->waitq);
> > + spin_unlock(&dev->msg_lock);
> > + wait_event_killable(msg->waitq, msg->completed);
>  What happens if the userspace(malicous) doesn't give a response forever?
> 
>  It looks like a DOS. If yes, we need to consider a way to fix that.
> 
> >>> How about using wait_event_killable_timeout() instead?
> >>
> >> Probably, and then we need choose a suitable timeout and more important,
> >> need to report the failure to virtio.
> >>
> > Makes sense to me. But it looks like some
> > vdpa_config_ops/virtio_config_ops such as set_status() didn't have a
> > return value.  Now I add a WARN_ON() for the failure. Do you mean we
> > need to add some change for virtio core to handle the failure?
>
>
> Maybe, but I'm not sure how hard we can do that.
>

We need to change all virtio device drivers in this way.

> We had NEEDS_RESET but it looks we don't implement it.
>

Could it handle the failure of get_feature() and get/set_config()?

> Or a rough idea is that maybe need some relaxing to be coupled loosely
> with userspace. E.g the device (control path) is implemented in the
> kernel but the datapath is implemented in the userspace like TUN/TAP.
>

I think it can work for most cases. One problem is that the set_config
might change the behavior of the data path at runtime, e.g.
virtnet_set_mac_address() in the virtio-net driver and
cache_type_store() in the virtio-blk driver. Not sure if this path is
able to return before the datapath is aware of this change.

Thanks,
Yongji
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-27 Thread David Gibson

On Mon, May 24, 2021 at 08:37:44PM -0300, Jason Gunthorpe wrote:
> On Mon, May 24, 2021 at 05:52:58PM +1000, David Gibson wrote:
> 
> > > > I don't really see a semantic distinction between "always one-device
> > > > groups" and "groups don't matter".  Really the only way you can afford
> > > > to not care about groups is if they're singletons.
> > > 
> > > The kernel driver under the mdev may not be in an "always one-device"
> > > group.
> > 
> > I don't really understand what you mean by that.
> 
> I mean the group of the mdev's actual DMA device may have multiple
> things in it.
>  
> > > It is a kernel driver so the only thing we know and care about is that
> > > all devices in the HW group are bound to kernel drivers.
> > > 
> > > The vfio device that spawns from this kernel driver is really a
> > > "groups don't matter" vfio device because at the IOMMU layer it should
> > > be riding on the physical group of the kernel driver.  At the VFIO
> > > layer we no longer care about the group abstraction because the system
> > > guarentees isolation in some other way.
> > 
> > Uh.. I don't really know how mdevs are isolated from each other.  I
> > thought it was because the physical device providing the mdevs
> > effectively had an internal IOMMU (or at least DMA permissioning) to
> > isolate the mdevs, even though the physical device may not be fully
> > isolated.
> > 
> > In that case the virtual mdev is effectively in a singleton group,
> > which is different from the group of its parent device.
> 
> That is one way to view it, but it means creating a whole group
> infrastructure and abusing the IOMMU stack just to create this
> nonsense fiction.

It's a nonsense fiction until it's not, at which point it will bite
you in the arse.

> We also abuse the VFIO container stuff to hackily
> create several different types pf IOMMU uAPIs for the mdev - all of
> which are unrelated to drivers/iommu.
> 
> Basically, there is no drivers/iommu thing involved, thus is no really
> iommu group, for mdev it is all a big hacky lie.

Well, "iommu" group might not be the best name, but hardware isolation
is still a real concern here, even if it's not entirely related to the
IOMMU.

> > If the physical device had a bug which meant the mdevs *weren't*
> > properly isolated from each other, then those mdevs would share a
> > group, and you *would* care about it.  Depending on how the isolation
> > failed the mdevs might or might not also share a group with the parent
> > physical device.
> 
> That isn't a real scenario.. mdevs that can't be isolated just
> wouldn't be useful to exist

Really?  So what do you do when you discover some mdevs you thought
were isolated actually aren't due to a hardware bug?  Drop support
from the driver entirely?  In which case what do you say to the people
who understandably complain "but... we had all the mdevs in one guest
anyway, we don't care if they're not isolated"?

> > > This is today's model, yes. When you run dpdk on a multi-group device
> > > vfio already ensures that all the device groups remained parked and
> > > inaccessible.
> > 
> > I'm not really following what you're saying there.
> > 
> > If you have a multi-device group, and dpdk is using one device in it,
> > VFIO *does not* (and cannot) ensure that other devices in the group
> > are parked and inaccessible.  
> 
> I mean in the sense that no other user space can open those devices
> and no kernel driver can later be attached to them.

Ok.

> > It ensures that they're parked at the moment the group moves from
> > kernel to userspace ownership, but it can't prevent dpdk from
> > accessing and unparking those devices via peer to peer DMA.
> 
> Right, and adding all this group stuff did nothing to alert the poor
> admin that is running DPDK to this risk.

Didn't it?  Seems to me the admin that in order to give the group to
DPDK, the admin had to find and unbind all the things in it... so is
therefore aware that they're giving everything in it to DPDK.

> > > If the administator configures the system with different security
> > > labels for different VFIO devices then yes removing groups makes this
> > > more tricky as all devices in the group should have the same label.
> > 
> > That seems a bigger problem than "more tricky".  How would you propose
> > addressing this with your device-first model?
> 
> You put the same security labels you'd put on the group to the devices
> that consitute the group. It is only more tricky in the sense that the
> script that would have to do this will need to do more than ID the
> group to label but also ID the device members of the group and label
> their char nodes.

Well, I guess, if you take the view that root is allowed to break the
kernel.  I tend to prefer that although root can obviously break the
kernel if they intend do, we should make it hard to do by accident -
which in this case would mean the kernel *enforcing* that the devices
in the group have the same security labels, which I can

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-27 Thread David Gibson

On Tue, May 25, 2021 at 04:52:57PM -0300, Jason Gunthorpe wrote:
> On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote:
> 
> > 2. iommu backed mdev devices for SRIOV where mdev device is created per
> > VF (mdev device == VF device) then that mdev device has same iommu
> > protection scope as VF associated to it. 
> 
> This doesn't require, and certainly shouldn't create, a fake group.

It's only fake if you start with a narrow view of what a group is.  A
group is a set of devices (in the kernel sense of "device", not
necessarily the hardware sense) which can't be isolated from each
other.  The mdev device is a kernel device, and if working as intended
it can be isolated from everything else, and is therefore in an
absolute bona fide group of its own.

> Only the VF's real IOMMU group should be used to model an iommu domain
> linked to a VF. Injecting fake groups that are proxies for real groups
> only opens the possibility of security problems like David is
> concerned with.

It's not a proxy for a real group, it's a group of its own.  If you
discover that (due to a hardware bug, for example) the mdev is *not*
properly isolated from its parent PCI device, then both the mdev
virtual device *and* the physical PCI device are in the same group.
Groups including devices of different types and on different buses
were considered from the start, and are precedented, if rare.

> Max's series approaches this properly by fully linking the struct
> pci_device of the VF throughout the entire VFIO scheme, including the
> group and container, while still allowing override of various VFIO
> operations.
> 
> Jason
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-05-27 Thread David Gibson

On Wed, May 26, 2021 at 02:48:03AM +0530, Kirti Wankhede wrote:
> 
> 
> On 5/26/2021 1:22 AM, Jason Gunthorpe wrote:
> > On Wed, May 26, 2021 at 12:56:30AM +0530, Kirti Wankhede wrote:
> > 
> > > 2. iommu backed mdev devices for SRIOV where mdev device is created per
> > > VF (mdev device == VF device) then that mdev device has same iommu
> > > protection scope as VF associated to it.
> > 
> > This doesn't require, and certainly shouldn't create, a fake group.
> > 
> > Only the VF's real IOMMU group should be used to model an iommu domain
> > linked to a VF. Injecting fake groups that are proxies for real groups
> > only opens the possibility of security problems like David is
> > concerned with.
> > 
> 
> I think this security issue should be addressed by letting mdev device
> inherit its parent's iommu_group, i.e. VF's iommu_group here.

No, that doesn't work.  AIUI part of the whole point of mdevs is to
allow chunks of a single PCI function to be handed out to different
places, because they're isolated from each other not by the system
IOMMU, but by a combination of MMU hardware in the hardware (e.g. in a
GPU card) and software in the mdev driver.  If mdevs inherited the
group of their parent device they wouldn't count as isolated from each
other, which they should.

> 
> Kirti
> 
> > Max's series approaches this properly by fully linking the struct
> > pci_device of the VF throughout the entire VFIO scheme, including the
> > group and container, while still allowing override of various VFIO
> > operations.
> > 
> > Jason
> > 
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

58 matches

Mail list logo