[PATCH v8 00/13] DMA Mapping P2PDMA Pages

2022-07-10 Thread Logan Gunthorpe
Hi,

This series is the first 13 patches from my previous Userspace P2P
with the CMB series[1]. These patches just do the first part which is to
allow passing P2PDMA pages into dma_map_sg() and support it with both
dma-direct and dma-iommu. The series ends with dropping the old
pci_p2pdma_[un]map_sg() interface.

This will be followed up later with the rest of the series which has
more issues to work out.

This series is based on v5.19-rc5. A git branch is available here:

  https://github.com/sbates130272/linux-p2pmem/  p2pdma_map_v8

Logan

[1] https://lkml.kernel.org/r/20220615161233.17527-1-log...@deltatee.com

--

Changes since v7 (first half of the series only):
  - Rebase onto v5.19-rc5
  - Collect reviewed-by tags from Christoph
  - Change the first patch to use a dma_flags field in the scatterlist
instead of a bit in the page link, per an idea from Robin
  - Rework the iommu-dma implementation to store the bus addresses
in iommu_dma_map_sg() and only copy them in __finalise_sg. This
allows the code to use the same pci_p2pdma_map_segment() helper
that the dma-direct code uses and we can drop the two specialized
helpers. (Per Robin)
  - Instead of using a zero length segment which __iommu_map_sg()
ignores, explicitly skip segments marked with as a dma bus address.
(per Robin)
  - Rework the iommu_dma_unmap_sg() function per a suggestion from
Robin.

Changes since v6:
  - Rebase onto v5.19-rc1
  - Rework how the pages are stored in the VMA per Jason's suggestion

Changes since v5:
  - Rebased onto v5.18-rc1 which includes Christophs cleanup to
free_zone_device_page() (similar to Ralph's patch).
  - Fix bug with concurrent first calls to pci_p2pdma_vma_fault()
that caused a double allocation and lost p2p memory. Noticed
by Andrew Maier.
  - Collected a Reviewed-by tag from Chaitanya.
  - Numerous minor fixes to commit messages

Changes since v4:
  - Rebase onto v5.17-rc1.
  - Included Ralph Cambell's patches which removes the ZONE_DEVICE page
reference count offset. This is just to demonstrate that this
series is compatible with that direction.
  - Added a comment in pci_p2pdma_map_sg_attrs(), per Chaitanya and
included his Reviewed-by tags.
  - Patch 1 in the last series which cleaned up scatterlist.h
has been upstreamed.
  - Dropped NEED_SG_DMA_BUS_ADDR_FLAG seeing depends on doesn't
work with selected symbols, per Christoph.
  - Switched iov_iter_get_pages_[alloc_]flags to be exported with
EXPORT_SYMBOL_GPL, per Christoph.
  - Renamed zone_device_pages_are_mergeable() to
zone_device_pages_have_same_pgmap(), per Christoph.
  - Renamed .mmap_file_open operation in nvme_ctrl_ops to
cdev_file_open(), per Christoph.

Changes since v3:
  - Add some comment and commit message cleanups I had missed for v3,
also moved the prototypes for some of the p2pdma helpers to
dma-map-ops.h (which I missed in v3 and was suggested in v2).
  - Add separate cleanup patch for scatterlist.h and change the macros
to functions. (Suggested by Chaitanya and Jason, respectively)
  - Rename sg_dma_mark_pci_p2pdma() and sg_is_dma_pci_p2pdma() to
sg_dma_mark_bus_address() and sg_is_dma_bus_address() which
is a more generic name (As requested by Jason)
  - Fixes to some comments and commit messages as suggested by Bjorn
and Jason.
  - Ensure swiotlb is not used with P2PDMA pages. (Per Jason)
  - The sgtable coversion in RDMA was split out and sent upstream
separately, the new patch is only the removal. (Per Jason)
  - Moved the FOLL_PCI_P2PDMA check outside of get_dev_pagemap() as
Jason suggested this will be removed in the near term.
  - Add two patches to ensure that zone device pages with different
pgmaps are never merged in the block layer or
sg_alloc_append_table_from_pages() (Per Jason)
  - Ensure synchronize_rcu() or call_rcu() is used before returning
pages to the genalloc. (Jason pointed out that pages are not
gauranteed to be unused in all architectures until at least
after an RCU grace period, and that synchronize_rcu() was likely
too slow to use in the vma close operation.
  - Collected Acks and Reviews by Bjorn, Jason and Max.

--

Logan Gunthorpe (13):
  lib/scatterlist: add flag for indicating P2PDMA segments in an SGL
  PCI/P2PDMA: Attempt to set map_type if it has not been set
  PCI/P2PDMA: Introduce helpers for dma_map_sg implementations
  dma-mapping: allow EREMOTEIO return code for P2PDMA transfers
  dma-direct: support PCI P2PDMA pages in dma-direct map_sg
  dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support
  iommu: Explicitly skip bus address marked segments  in
__iommu_map_sg()
  iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg
  nvme-pci: check DMA ops when indicating support for PCI P2PDMA
  nvme-pci: convert to using dma_map_sgtable()
  RDMA/core: introduce ib_dma_pci_p2p_dma_supported()
  RDMA/rw: drop pci_p2pdma_[un]map_sg()
  PCI/P2PDMA

[PATCH v8 08/13] iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg

2022-07-10 Thread Logan Gunthorpe
Call pci_p2pdma_map_segment() when a PCI P2PDMA page is seen so the bus
address is set in the dma address and the segment is marked with
sg_dma_mark_bus_address(). iommu_map_sg() will then skip these segments.
Then, in __finalise_sg(), copy the dma address from the input segment
to the output segment. __invalidate_sg() must also learn to skip these
segments.

A P2PDMA page may have three possible outcomes when being mapped:
  1) If the data path between the two devices doesn't go through
 the root port, then it should be mapped with a PCI bus address
  2) If the data path goes through the host bridge, it should be mapped
 normally with an IOMMU IOVA.
  3) It is not possible for the two devices to communicate and thus
 the mapping operation should fail (and it will return -EREMOTEIO).

Similar to dma-direct, the sg_dma_mark_pci_p2pdma() flag is used to
indicate bus address segments. On unmap, P2PDMA segments are skipped
over when determining the start and end IOVA addresses.

With this change, the flags variable in the dma_map_ops is set to
DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for P2PDMA pages.

Signed-off-by: Logan Gunthorpe 
---
 drivers/iommu/dma-iommu.c | 99 +--
 1 file changed, 85 insertions(+), 14 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index f90251572a5d..c079836ed2fc 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1053,15 +1054,30 @@ static int __finalise_sg(struct device *dev, struct 
scatterlist *sg, int nents,
 
for_each_sg(sg, s, nents, i) {
/* Restore this segment's original unaligned fields first */
+   dma_addr_t s_dma_addr = sg_dma_address(s);
unsigned int s_iova_off = sg_dma_address(s);
unsigned int s_length = sg_dma_len(s);
unsigned int s_iova_len = s->length;
 
-   s->offset += s_iova_off;
-   s->length = s_length;
sg_dma_address(s) = DMA_MAPPING_ERROR;
sg_dma_len(s) = 0;
 
+   if (sg_is_dma_bus_address(s)) {
+   if (i > 0)
+   cur = sg_next(cur);
+
+   sg_dma_unmark_bus_address(s);
+   sg_dma_address(cur) = s_dma_addr;
+   sg_dma_len(cur) = s_length;
+   sg_dma_mark_bus_address(cur);
+   count++;
+   cur_len = 0;
+   continue;
+   }
+
+   s->offset += s_iova_off;
+   s->length = s_length;
+
/*
 * Now fill in the real DMA data. If...
 * - there is a valid output segment to append to
@@ -1102,10 +1118,14 @@ static void __invalidate_sg(struct scatterlist *sg, int 
nents)
int i;
 
for_each_sg(sg, s, nents, i) {
-   if (sg_dma_address(s) != DMA_MAPPING_ERROR)
-   s->offset += sg_dma_address(s);
-   if (sg_dma_len(s))
-   s->length = sg_dma_len(s);
+   if (sg_is_dma_bus_address(s)) {
+   sg_dma_unmark_bus_address(s);
+   } else {
+   if (sg_dma_address(s) != DMA_MAPPING_ERROR)
+   s->offset += sg_dma_address(s);
+   if (sg_dma_len(s))
+   s->length = sg_dma_len(s);
+   }
sg_dma_address(s) = DMA_MAPPING_ERROR;
sg_dma_len(s) = 0;
}
@@ -1158,6 +1178,8 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
struct iova_domain *iovad = >iovad;
struct scatterlist *s, *prev = NULL;
int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
+   struct pci_p2pdma_map_state p2pdma_state = {};
+   enum pci_p2pdma_map_type map;
dma_addr_t iova;
size_t iova_len = 0;
unsigned long mask = dma_get_seg_boundary(dev);
@@ -1187,6 +1209,30 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
size_t s_length = s->length;
size_t pad_len = (mask - iova_len + 1) & mask;
 
+   if (is_pci_p2pdma_page(sg_page(s))) {
+   map = pci_p2pdma_map_segment(_state, dev, s);
+   switch (map) {
+   case PCI_P2PDMA_MAP_BUS_ADDR:
+   /*
+* iommu_map_sg() will skip this segment as
+* it is marked as a bus address,
+* __finalise_sg() will copy the dma address
+* into the output segment.
+*/

[PATCH v8 03/13] PCI/P2PDMA: Introduce helpers for dma_map_sg implementations

2022-07-10 Thread Logan Gunthorpe
Add pci_p2pdma_map_segment() as a helper for dma_map_sg()
implementations. It takes an scatterlist segment that must point to a
pci_p2pdma struct page and will map it if the mapping requires a bus
address.

The return value indicates whether the mapping required a bus address
or whether the caller still needs to map the segment normally. If the
segment should not be mapped, -EREMOTEIO is returned.

This helper uses a state structure to track the changes to the
pgmap across calls and avoid needing to lookup into the xarray for
every page.

The prototype for the helper is added to dma-map-ops.h as it is only
useful to dma map implementations and don't need to pollute the public
pci-p2pdma header.

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
Reviewed-by: Christoph Hellwig 
---
 drivers/pci/p2pdma.c| 44 +-
 include/linux/dma-map-ops.h | 53 +
 2 files changed, 90 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4e8bc457e29a..5d2538aa0778 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -10,6 +10,7 @@
 
 #define pr_fmt(fmt) "pci-p2pdma: " fmt
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -20,13 +21,6 @@
 #include 
 #include 
 
-enum pci_p2pdma_map_type {
-   PCI_P2PDMA_MAP_UNKNOWN = 0,
-   PCI_P2PDMA_MAP_NOT_SUPPORTED,
-   PCI_P2PDMA_MAP_BUS_ADDR,
-   PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
-};
-
 struct pci_p2pdma {
struct gen_pool *pool;
bool p2pmem_published;
@@ -944,6 +938,42 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct 
scatterlist *sg,
 }
 EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
 
+/**
+ * pci_p2pdma_map_segment - map an sg segment determining the mapping type
+ * @state: State structure that should be declared outside of the for_each_sg()
+ * loop and initialized to zero.
+ * @dev: DMA device that's doing the mapping operation
+ * @sg: scatterlist segment to map
+ *
+ * This is a helper to be used by non-IOMMU dma_map_sg() implementations where
+ * the sg segment is the same for the page_link and the dma_address.
+ *
+ * Attempt to map a single segment in an SGL with the PCI bus address.
+ * The segment must point to a PCI P2PDMA page and thus must be
+ * wrapped in a is_pci_p2pdma_page(sg_page(sg)) check.
+ *
+ * Returns the type of mapping used and maps the page if the type is
+ * PCI_P2PDMA_MAP_BUS_ADDR.
+ */
+enum pci_p2pdma_map_type
+pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
+  struct scatterlist *sg)
+{
+   if (state->pgmap != sg_page(sg)->pgmap) {
+   state->pgmap = sg_page(sg)->pgmap;
+   state->map = pci_p2pdma_map_type(state->pgmap, dev);
+   state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
+   }
+
+   if (state->map == PCI_P2PDMA_MAP_BUS_ADDR) {
+   sg->dma_address = sg_phys(sg) + state->bus_off;
+   sg_dma_len(sg) = sg->length;
+   sg_dma_mark_bus_address(sg);
+   }
+
+   return state->map;
+}
+
 /**
  * pci_p2pdma_enable_store - parse a configfs/sysfs attribute store
  * to enable p2pdma
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 0d5b06b3a4a6..df27ee3c9afc 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -379,4 +379,57 @@ static inline void debug_dma_dump_mappings(struct device 
*dev)
 
 extern const struct dma_map_ops dma_dummy_ops;
 
+enum pci_p2pdma_map_type {
+   /*
+* PCI_P2PDMA_MAP_UNKNOWN: Used internally for indicating the mapping
+* type hasn't been calculated yet. Functions that return this enum
+* never return this value.
+*/
+   PCI_P2PDMA_MAP_UNKNOWN = 0,
+
+   /*
+* PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
+* traverse the host bridge and the host bridge is not in the
+* allowlist. DMA Mapping routines should return an error when
+* this is returned.
+*/
+   PCI_P2PDMA_MAP_NOT_SUPPORTED,
+
+   /*
+* PCI_P2PDMA_BUS_ADDR: Indicates that two devices can talk to
+* each other directly through a PCI switch and the transaction will
+* not traverse the host bridge. Such a mapping should program
+* the DMA engine with PCI bus addresses.
+*/
+   PCI_P2PDMA_MAP_BUS_ADDR,
+
+   /*
+* PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
+* to each other, but the transaction traverses a host bridge on the
+* allowlist. In this case, a normal mapping either with CPU physical
+* addresses (in the case of dma-direct) or IOVA addresses (in the
+* case of IOMMUs) should be used to program the DMA engine.
+*/
+   PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
+};
+
+struct pci_p2pdma_map_state {
+   str

[PATCH v8 06/13] dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support

2022-07-10 Thread Logan Gunthorpe
Add a flags member to the dma_map_ops structure with one flag to
indicate support for PCI P2PDMA.

Also, add a helper to check if a device supports PCI P2PDMA.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Christoph Hellwig 
---
 include/linux/dma-map-ops.h | 10 ++
 include/linux/dma-mapping.h |  5 +
 kernel/dma/mapping.c| 18 ++
 3 files changed, 33 insertions(+)

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index df27ee3c9afc..a349dd761189 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -11,7 +11,17 @@
 
 struct cma;
 
+/*
+ * Values for struct dma_map_ops.flags:
+ *
+ * DMA_F_PCI_P2PDMA_SUPPORTED: Indicates the dma_map_ops implementation can
+ * handle PCI P2PDMA pages in the map_sg/unmap_sg operation.
+ */
+#define DMA_F_PCI_P2PDMA_SUPPORTED (1 << 0)
+
 struct dma_map_ops {
+   unsigned int flags;
+
void *(*alloc)(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t gfp,
unsigned long attrs);
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index dca2b1355bb1..f7c61b2b4b5e 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -140,6 +140,7 @@ int dma_mmap_attrs(struct device *dev, struct 
vm_area_struct *vma,
unsigned long attrs);
 bool dma_can_mmap(struct device *dev);
 int dma_supported(struct device *dev, u64 mask);
+bool dma_pci_p2pdma_supported(struct device *dev);
 int dma_set_mask(struct device *dev, u64 mask);
 int dma_set_coherent_mask(struct device *dev, u64 mask);
 u64 dma_get_required_mask(struct device *dev);
@@ -250,6 +251,10 @@ static inline int dma_supported(struct device *dev, u64 
mask)
 {
return 0;
 }
+static inline bool dma_pci_p2pdma_supported(struct device *dev)
+{
+   return false;
+}
 static inline int dma_set_mask(struct device *dev, u64 mask)
 {
return -EIO;
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 20e70fa71091..147357586f90 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -723,6 +723,24 @@ int dma_supported(struct device *dev, u64 mask)
 }
 EXPORT_SYMBOL(dma_supported);
 
+bool dma_pci_p2pdma_supported(struct device *dev)
+{
+   const struct dma_map_ops *ops = get_dma_ops(dev);
+
+   /* if ops is not set, dma direct will be used which supports P2PDMA */
+   if (!ops)
+   return true;
+
+   /*
+* Note: dma_ops_bypass is not checked here because P2PDMA should
+* not be used with dma mapping ops that do not have support even
+* if the specific device is bypassing them.
+*/
+
+   return ops->flags & DMA_F_PCI_P2PDMA_SUPPORTED;
+}
+EXPORT_SYMBOL_GPL(dma_pci_p2pdma_supported);
+
 #ifdef CONFIG_ARCH_HAS_DMA_SET_MASK
 void arch_dma_set_mask(struct device *dev, u64 mask);
 #else
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v8 04/13] dma-mapping: allow EREMOTEIO return code for P2PDMA transfers

2022-07-10 Thread Logan Gunthorpe
Add EREMOTEIO error return to dma_map_sgtable() which will be used
by .map_sg() implementations that detect P2PDMA pages that the
underlying DMA device cannot access.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Christoph Hellwig 
---
 kernel/dma/mapping.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index db7244291b74..20e70fa71091 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -197,7 +197,7 @@ static int __dma_map_sg_attrs(struct device *dev, struct 
scatterlist *sg,
if (ents > 0)
debug_dma_map_sg(dev, sg, nents, ents, dir, attrs);
else if (WARN_ON_ONCE(ents != -EINVAL && ents != -ENOMEM &&
- ents != -EIO))
+ ents != -EIO && ents != -EREMOTEIO))
return -EIO;
 
return ents;
@@ -255,6 +255,9 @@ EXPORT_SYMBOL(dma_map_sg_attrs);
  * complete the mapping. Should succeed if retried later.
  *   -EIO  Legacy error code with an unknown meaning. eg. this is
  * returned if a lower level call returned DMA_MAPPING_ERROR.
+ *   -EREMOTEIOThe DMA device cannot access P2PDMA memory specified in
+ * the sg_table. This will not succeed if retried.
+ *
  */
 int dma_map_sgtable(struct device *dev, struct sg_table *sgt,
enum dma_data_direction dir, unsigned long attrs)
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v8 10/13] nvme-pci: convert to using dma_map_sgtable()

2022-07-10 Thread Logan Gunthorpe
The dma_map operations now support P2PDMA pages directly. So remove
the calls to pci_p2pdma_[un]map_sg_attrs() and replace them with calls
to dma_map_sgtable().

dma_map_sgtable() returns more complete error codes than dma_map_sg()
and allows differentiating EREMOTEIO errors in case an unsupported
P2PDMA transfer is requested. When this happens, return BLK_STS_TARGET
so the request isn't retried.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Max Gurtovoy 
Reviewed-by: Chaitanya Kulkarni 
Reviewed-by: Christoph Hellwig 
---
 drivers/nvme/host/pci.c | 69 +
 1 file changed, 29 insertions(+), 40 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 0f556e954ffc..696434658e53 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -230,11 +230,10 @@ struct nvme_iod {
bool use_sgl;
int aborted;
int npages; /* In the PRP list. 0 means small pool in use */
-   int nents;  /* Used in scatterlist */
dma_addr_t first_dma;
unsigned int dma_len;   /* length of single DMA segment mapping */
dma_addr_t meta_dma;
-   struct scatterlist *sg;
+   struct sg_table sgt;
 };
 
 static inline unsigned int nvme_dbbuf_size(struct nvme_dev *dev)
@@ -524,7 +523,7 @@ static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
 static void **nvme_pci_iod_list(struct request *req)
 {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-   return (void **)(iod->sg + blk_rq_nr_phys_segments(req));
+   return (void **)(iod->sgt.sgl + blk_rq_nr_phys_segments(req));
 }
 
 static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req)
@@ -576,17 +575,6 @@ static void nvme_free_sgls(struct nvme_dev *dev, struct 
request *req)
}
 }
 
-static void nvme_unmap_sg(struct nvme_dev *dev, struct request *req)
-{
-   struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-
-   if (is_pci_p2pdma_page(sg_page(iod->sg)))
-   pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
-   rq_dma_dir(req));
-   else
-   dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req));
-}
-
 static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -597,9 +585,10 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct 
request *req)
return;
}
 
-   WARN_ON_ONCE(!iod->nents);
+   WARN_ON_ONCE(!iod->sgt.nents);
+
+   dma_unmap_sgtable(dev->dev, >sgt, rq_dma_dir(req), 0);
 
-   nvme_unmap_sg(dev, req);
if (iod->npages == 0)
dma_pool_free(dev->prp_small_pool, nvme_pci_iod_list(req)[0],
  iod->first_dma);
@@ -607,7 +596,7 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct 
request *req)
nvme_free_sgls(dev, req);
else
nvme_free_prps(dev, req);
-   mempool_free(iod->sg, dev->iod_mempool);
+   mempool_free(iod->sgt.sgl, dev->iod_mempool);
 }
 
 static void nvme_print_sgl(struct scatterlist *sgl, int nents)
@@ -630,7 +619,7 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev 
*dev,
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct dma_pool *pool;
int length = blk_rq_payload_bytes(req);
-   struct scatterlist *sg = iod->sg;
+   struct scatterlist *sg = iod->sgt.sgl;
int dma_len = sg_dma_len(sg);
u64 dma_addr = sg_dma_address(sg);
int offset = dma_addr & (NVME_CTRL_PAGE_SIZE - 1);
@@ -703,16 +692,16 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev 
*dev,
dma_len = sg_dma_len(sg);
}
 done:
-   cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
+   cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sgt.sgl));
cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
return BLK_STS_OK;
 free_prps:
nvme_free_prps(dev, req);
return BLK_STS_RESOURCE;
 bad_sgl:
-   WARN(DO_ONCE(nvme_print_sgl, iod->sg, iod->nents),
+   WARN(DO_ONCE(nvme_print_sgl, iod->sgt.sgl, iod->sgt.nents),
"Invalid SGL for payload:%d nents:%d\n",
-   blk_rq_payload_bytes(req), iod->nents);
+   blk_rq_payload_bytes(req), iod->sgt.nents);
return BLK_STS_IOERR;
 }
 
@@ -738,12 +727,13 @@ static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc 
*sge,
 }
 
 static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
-   struct request *req, struct nvme_rw_command *cmd, int entries)
+   struct request *req, struct nvme_rw_command *cmd)
 {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct dma_pool *pool;
struct nvme_sgl_desc *sg_list;
-   struct scatterlist *sg = iod-&

[PATCH v8 05/13] dma-direct: support PCI P2PDMA pages in dma-direct map_sg

2022-07-10 Thread Logan Gunthorpe
Add PCI P2PDMA support for dma_direct_map_sg() so that it can map
PCI P2PDMA pages directly without a hack in the callers. This allows
for heterogeneous SGLs that contain both P2PDMA and regular pages.

A P2PDMA page may have three possible outcomes when being mapped:
  1) If the data path between the two devices doesn't go through the
 root port, then it should be mapped with a PCI bus address
  2) If the data path goes through the host bridge, it should be mapped
 normally, as though it were a CPU physical address
  3) It is not possible for the two devices to communicate and thus
 the mapping operation should fail (and it will return -EREMOTEIO).

SGL segments that contain PCI bus addresses are marked with
sg_dma_mark_pci_p2pdma() and are ignored when unmapped.

P2PDMA mappings are also failed if swiotlb needs to be used on the
mapping.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Christoph Hellwig 
---
 kernel/dma/direct.c | 43 +--
 kernel/dma/direct.h |  8 +++-
 2 files changed, 44 insertions(+), 7 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 8d0b68a17042..63859a101ed8 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -453,29 +453,60 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
arch_sync_dma_for_cpu_all();
 }
 
+/*
+ * Unmaps segments, except for ones marked as pci_p2pdma which do not
+ * require any further action as they contain a bus address.
+ */
 void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
int nents, enum dma_data_direction dir, unsigned long attrs)
 {
struct scatterlist *sg;
int i;
 
-   for_each_sg(sgl, sg, nents, i)
-   dma_direct_unmap_page(dev, sg->dma_address, sg_dma_len(sg), dir,
-attrs);
+   for_each_sg(sgl,  sg, nents, i) {
+   if (sg_is_dma_bus_address(sg))
+   sg_dma_unmark_bus_address(sg);
+   else
+   dma_direct_unmap_page(dev, sg->dma_address,
+ sg_dma_len(sg), dir, attrs);
+   }
 }
 #endif
 
 int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
enum dma_data_direction dir, unsigned long attrs)
 {
-   int i;
+   struct pci_p2pdma_map_state p2pdma_state = {};
+   enum pci_p2pdma_map_type map;
struct scatterlist *sg;
+   int i, ret;
 
for_each_sg(sgl, sg, nents, i) {
+   if (is_pci_p2pdma_page(sg_page(sg))) {
+   map = pci_p2pdma_map_segment(_state, dev, sg);
+   switch (map) {
+   case PCI_P2PDMA_MAP_BUS_ADDR:
+   continue;
+   case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+   /*
+* Any P2P mapping that traverses the PCI
+* host bridge must be mapped with CPU physical
+* address and not PCI bus addresses. This is
+* done with dma_direct_map_page() below.
+*/
+   break;
+   default:
+   ret = -EREMOTEIO;
+   goto out_unmap;
+   }
+   }
+
sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
sg->offset, sg->length, dir, attrs);
-   if (sg->dma_address == DMA_MAPPING_ERROR)
+   if (sg->dma_address == DMA_MAPPING_ERROR) {
+   ret = -EIO;
goto out_unmap;
+   }
sg_dma_len(sg) = sg->length;
}
 
@@ -483,7 +514,7 @@ int dma_direct_map_sg(struct device *dev, struct 
scatterlist *sgl, int nents,
 
 out_unmap:
dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
-   return -EIO;
+   return ret;
 }
 
 dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index a78c0ba70645..e38ffc5e6bdd 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -8,6 +8,7 @@
 #define _KERNEL_DMA_DIRECT_H
 
 #include 
+#include 
 
 int dma_direct_get_sgtable(struct device *dev, struct sg_table *sgt,
void *cpu_addr, dma_addr_t dma_addr, size_t size,
@@ -87,10 +88,15 @@ static inline dma_addr_t dma_direct_map_page(struct device 
*dev,
phys_addr_t phys = page_to_phys(page) + offset;
dma_addr_t dma_addr = phys_to_dma(dev, phys);
 
-   if (is_swiotlb_force_bounce(dev))
+   if (is_swiotlb_force_bounce(dev)) {
+   if (is_pci_p2pdma_page(page))
+   return DMA_MAPPING_ERROR;
return swiotlb_map(dev

[PATCH v8 09/13] nvme-pci: check DMA ops when indicating support for PCI P2PDMA

2022-07-10 Thread Logan Gunthorpe
Introduce a supports_pci_p2pdma() operation in nvme_ctrl_ops to
replace the fixed NVME_F_PCI_P2PDMA flag such that the dma_map_ops
flags can be checked for PCI P2PDMA support.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Chaitanya Kulkarni 
Reviewed-by: Christoph Hellwig 
---
 drivers/nvme/host/core.c |  3 ++-
 drivers/nvme/host/nvme.h |  2 +-
 drivers/nvme/host/pci.c  | 12 +---
 3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ec6ac298d8de..2831e248dd71 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3996,7 +3996,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
unsigned nsid,
blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, ns->queue);
 
blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
-   if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
+   if (ctrl->ops->supports_pci_p2pdma &&
+   ctrl->ops->supports_pci_p2pdma(ctrl))
blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
 
ns->ctrl = ctrl;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 5558f8812157..ea097bb1edde 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -495,7 +495,6 @@ struct nvme_ctrl_ops {
unsigned int flags;
 #define NVME_F_FABRICS (1 << 0)
 #define NVME_F_METADATA_SUPPORTED  (1 << 1)
-#define NVME_F_PCI_P2PDMA  (1 << 2)
int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
@@ -505,6 +504,7 @@ struct nvme_ctrl_ops {
void (*stop_ctrl)(struct nvme_ctrl *ctrl);
int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
void (*print_device_info)(struct nvme_ctrl *ctrl);
+   bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e7af2234e53b..0f556e954ffc 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2984,7 +2984,6 @@ static int nvme_pci_get_address(struct nvme_ctrl *ctrl, 
char *buf, int size)
return snprintf(buf, size, "%s\n", dev_name(>dev));
 }
 
-
 static void nvme_pci_print_device_info(struct nvme_ctrl *ctrl)
 {
struct pci_dev *pdev = to_pci_dev(to_nvme_dev(ctrl)->dev);
@@ -2999,11 +2998,17 @@ static void nvme_pci_print_device_info(struct nvme_ctrl 
*ctrl)
subsys->firmware_rev);
 }
 
+static bool nvme_pci_supports_pci_p2pdma(struct nvme_ctrl *ctrl)
+{
+   struct nvme_dev *dev = to_nvme_dev(ctrl);
+
+   return dma_pci_p2pdma_supported(dev->dev);
+}
+
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.name   = "pcie",
.module = THIS_MODULE,
-   .flags  = NVME_F_METADATA_SUPPORTED |
- NVME_F_PCI_P2PDMA,
+   .flags  = NVME_F_METADATA_SUPPORTED,
.reg_read32 = nvme_pci_reg_read32,
.reg_write32= nvme_pci_reg_write32,
.reg_read64 = nvme_pci_reg_read64,
@@ -3011,6 +3016,7 @@ static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.submit_async_event = nvme_pci_submit_async_event,
.get_address= nvme_pci_get_address,
.print_device_info  = nvme_pci_print_device_info,
+   .supports_pci_p2pdma= nvme_pci_supports_pci_p2pdma,
 };
 
 static int nvme_dev_map(struct nvme_dev *dev)
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v8 12/13] RDMA/rw: drop pci_p2pdma_[un]map_sg()

2022-07-10 Thread Logan Gunthorpe
dma_map_sg() now supports the use of P2PDMA pages so pci_p2pdma_map_sg()
is no longer necessary and may be dropped. This means the
rdma_rw_[un]map_sg() helpers are no longer necessary. Remove it all.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Christoph Hellwig 
---
 drivers/infiniband/core/rw.c | 45 
 1 file changed, 9 insertions(+), 36 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index 4d98f931a13d..8367974b7998 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -274,33 +274,6 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
return 1;
 }
 
-static void rdma_rw_unmap_sg(struct ib_device *dev, struct scatterlist *sg,
-u32 sg_cnt, enum dma_data_direction dir)
-{
-   if (is_pci_p2pdma_page(sg_page(sg)))
-   pci_p2pdma_unmap_sg(dev->dma_device, sg, sg_cnt, dir);
-   else
-   ib_dma_unmap_sg(dev, sg, sg_cnt, dir);
-}
-
-static int rdma_rw_map_sgtable(struct ib_device *dev, struct sg_table *sgt,
-  enum dma_data_direction dir)
-{
-   int nents;
-
-   if (is_pci_p2pdma_page(sg_page(sgt->sgl))) {
-   if (WARN_ON_ONCE(ib_uses_virt_dma(dev)))
-   return 0;
-   nents = pci_p2pdma_map_sg(dev->dma_device, sgt->sgl,
- sgt->orig_nents, dir);
-   if (!nents)
-   return -EIO;
-   sgt->nents = nents;
-   return 0;
-   }
-   return ib_dma_map_sgtable_attrs(dev, sgt, dir, 0);
-}
-
 /**
  * rdma_rw_ctx_init - initialize a RDMA READ/WRITE context
  * @ctx:   context to initialize
@@ -327,7 +300,7 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp 
*qp, u32 port_num,
};
int ret;
 
-   ret = rdma_rw_map_sgtable(dev, , dir);
+   ret = ib_dma_map_sgtable_attrs(dev, , dir, 0);
if (ret)
return ret;
sg_cnt = sgt.nents;
@@ -366,7 +339,7 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp 
*qp, u32 port_num,
return ret;
 
 out_unmap_sg:
-   rdma_rw_unmap_sg(dev, sgt.sgl, sgt.orig_nents, dir);
+   ib_dma_unmap_sgtable_attrs(dev, , dir, 0);
return ret;
 }
 EXPORT_SYMBOL(rdma_rw_ctx_init);
@@ -414,12 +387,12 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
return -EINVAL;
}
 
-   ret = rdma_rw_map_sgtable(dev, , dir);
+   ret = ib_dma_map_sgtable_attrs(dev, , dir, 0);
if (ret)
return ret;
 
if (prot_sg_cnt) {
-   ret = rdma_rw_map_sgtable(dev, _sgt, dir);
+   ret = ib_dma_map_sgtable_attrs(dev, _sgt, dir, 0);
if (ret)
goto out_unmap_sg;
}
@@ -486,9 +459,9 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
kfree(ctx->reg);
 out_unmap_prot_sg:
if (prot_sgt.nents)
-   rdma_rw_unmap_sg(dev, prot_sgt.sgl, prot_sgt.orig_nents, dir);
+   ib_dma_unmap_sgtable_attrs(dev, _sgt, dir, 0);
 out_unmap_sg:
-   rdma_rw_unmap_sg(dev, sgt.sgl, sgt.orig_nents, dir);
+   ib_dma_unmap_sgtable_attrs(dev, , dir, 0);
return ret;
 }
 EXPORT_SYMBOL(rdma_rw_ctx_signature_init);
@@ -621,7 +594,7 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct 
ib_qp *qp,
break;
}
 
-   rdma_rw_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+   ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy);
 
@@ -649,8 +622,8 @@ void rdma_rw_ctx_destroy_signature(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
kfree(ctx->reg);
 
if (prot_sg_cnt)
-   rdma_rw_unmap_sg(qp->pd->device, prot_sg, prot_sg_cnt, dir);
-   rdma_rw_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+   ib_dma_unmap_sg(qp->pd->device, prot_sg, prot_sg_cnt, dir);
+   ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy_signature);
 
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v8 01/13] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL

2022-07-10 Thread Logan Gunthorpe
Introduce a dma_flags field in struct scatterlist. These flags will be
used by dma_[un]map_sg_p2pdma() to determine when a given SGL segments
dma_address points to a PCI bus address. dma_unmap_sg_p2pdma() will need
to perform different cleanup when a segment is marked as a bus address.

The dma_flags field will fit in the existing padding on 64BIT systems
(assuming CONFIG_NEED_SG_DMA_LENGTH is also set).

The new bit will only be used when CONFIG_PCI_P2PDMA is set; this means
PCI P2PDMA will require CONFIG_64BIT. This should be acceptable as the
majority of P2PDMA use cases are restricted to newer root complexes and
roughly require the extra address space for memory BARs used in the
transactions.

Signed-off-by: Logan Gunthorpe 
---
 drivers/pci/Kconfig |  5 +++
 include/linux/scatterlist.h | 69 +
 2 files changed, 74 insertions(+)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 133c73207782..5cc7cba1941f 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -164,6 +164,11 @@ config PCI_PASID
 config PCI_P2PDMA
bool "PCI peer-to-peer transfer support"
depends on ZONE_DEVICE
+   #
+   # The need for the scatterlist DMA bus address flag means PCI P2PDMA
+   # requires 64bit
+   #
+   depends on 64BIT
select GENERIC_ALLOCATOR
help
  Enableѕ drivers to do PCI peer-to-peer transactions to and from
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 7ff9d6386c12..375a5e90d86a 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -16,6 +16,9 @@ struct scatterlist {
 #ifdef CONFIG_NEED_SG_DMA_LENGTH
unsigned intdma_length;
 #endif
+#ifdef CONFIG_PCI_P2PDMA
+   unsigned intdma_flags;
+#endif
 };
 
 /*
@@ -245,6 +248,72 @@ static inline void sg_unmark_end(struct scatterlist *sg)
sg->page_link &= ~SG_END;
 }
 
+/*
+ * CONFGI_PCI_P2PDMA depends on CONFIG_64BIT which means there is 4 bytes
+ * in struct scatterlist (assuming also CONFIG_NEED_SG_DMA_LENGTH is set).
+ * Use this padding for DMA flags bits to indicate when a specific
+ * dma address is a bus address.
+ */
+#ifdef CONFIG_PCI_P2PDMA
+
+#define SG_DMA_BUS_ADDRESS (1 << 0)
+
+/**
+ * sg_dma_is_bus address - Return whether a given segment was marked
+ *as a bus address
+ * @sg: SG entry
+ *
+ * Description:
+ *   Returns true if sg_dma_mark_bus_address() has been called on
+ *   this segment.
+ **/
+static inline bool sg_is_dma_bus_address(struct scatterlist *sg)
+{
+   return sg->dma_flags & SG_DMA_BUS_ADDRESS;
+}
+
+/**
+ * sg_dma_mark_bus address - Mark the scatterlist entry as a bus address
+ * @sg: SG entry
+ *
+ * Description:
+ *   Marks the passed in sg entry to indicate that the dma_address is
+ *   a bus address and doesn't need to be unmapped. This should only be
+ *   used by dma_map_sg() implementations to mark bus addresses
+ *   so they can be properly cleaned up in dma_unmap_sg().
+ **/
+static inline void sg_dma_mark_bus_address(struct scatterlist *sg)
+{
+   sg->dma_flags |= SG_DMA_BUS_ADDRESS;
+}
+
+/**
+ * sg_unmark_bus_address - Unmark the scatterlist entry as a bus address
+ * @sg: SG entry
+ *
+ * Description:
+ *   Clears the bus address mark.
+ **/
+static inline void sg_dma_unmark_bus_address(struct scatterlist *sg)
+{
+   sg->dma_flags &= ~SG_DMA_BUS_ADDRESS;
+}
+
+#else
+
+static inline bool sg_is_dma_bus_address(struct scatterlist *sg)
+{
+   return false;
+}
+static inline void sg_dma_mark_bus_address(struct scatterlist *sg)
+{
+}
+static inline void sg_dma_unmark_bus_address(struct scatterlist *sg)
+{
+}
+
+#endif
+
 /**
  * sg_phys - Return physical address of an sg entry
  * @sg: SG entry
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v8 07/13] iommu: Explicitly skip bus address marked segments in __iommu_map_sg()

2022-07-10 Thread Logan Gunthorpe
In order to support PCI P2PDMA mappings with dma-iommu, explicitly skip
any segments marked with sg_dma_mark_bus_address() in __iommu_map_sg().

These segments should not be mapped into the IOVA and will be handled
separately in as subsequent patch for dma-iommu.

Signed-off-by: Logan Gunthorpe 
---
 drivers/iommu/iommu.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 847ad47a2dfd..2844a3e02a89 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2457,6 +2457,9 @@ static ssize_t __iommu_map_sg(struct iommu_domain 
*domain, unsigned long iova,
len = 0;
}
 
+   if (sg_is_dma_bus_address(sg))
+   goto next;
+
if (len) {
len += sg->length;
} else {
@@ -2464,6 +2467,7 @@ static ssize_t __iommu_map_sg(struct iommu_domain 
*domain, unsigned long iova,
start = s_phys;
}
 
+next:
if (++i < nents)
sg = sg_next(sg);
}
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v8 13/13] PCI/P2PDMA: Remove pci_p2pdma_[un]map_sg()

2022-07-10 Thread Logan Gunthorpe
This interface is superseded by support in dma_map_sg() which now supports
heterogeneous scatterlists. There are no longer any users, so remove it.

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Max Gurtovoy 
Reviewed-by: Christoph Hellwig 
---
 drivers/pci/p2pdma.c   | 66 --
 include/linux/pci-p2pdma.h | 27 
 2 files changed, 93 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 5d2538aa0778..4496a7c5c478 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -872,72 +872,6 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct 
dev_pagemap *pgmap,
return type;
 }
 
-static int __pci_p2pdma_map_sg(struct pci_p2pdma_pagemap *p2p_pgmap,
-   struct device *dev, struct scatterlist *sg, int nents)
-{
-   struct scatterlist *s;
-   int i;
-
-   for_each_sg(sg, s, nents, i) {
-   s->dma_address = sg_phys(s) + p2p_pgmap->bus_offset;
-   sg_dma_len(s) = s->length;
-   }
-
-   return nents;
-}
-
-/**
- * pci_p2pdma_map_sg_attrs - map a PCI peer-to-peer scatterlist for DMA
- * @dev: device doing the DMA request
- * @sg: scatter list to map
- * @nents: elements in the scatterlist
- * @dir: DMA direction
- * @attrs: DMA attributes passed to dma_map_sg() (if called)
- *
- * Scatterlists mapped with this function should be unmapped using
- * pci_p2pdma_unmap_sg_attrs().
- *
- * Returns the number of SG entries mapped or 0 on error.
- */
-int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs)
-{
-   struct pci_p2pdma_pagemap *p2p_pgmap =
-   to_p2p_pgmap(sg_page(sg)->pgmap);
-
-   switch (pci_p2pdma_map_type(sg_page(sg)->pgmap, dev)) {
-   case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
-   return dma_map_sg_attrs(dev, sg, nents, dir, attrs);
-   case PCI_P2PDMA_MAP_BUS_ADDR:
-   return __pci_p2pdma_map_sg(p2p_pgmap, dev, sg, nents);
-   default:
-   /* Mapping is not Supported */
-   return 0;
-   }
-}
-EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg_attrs);
-
-/**
- * pci_p2pdma_unmap_sg_attrs - unmap a PCI peer-to-peer scatterlist that was
- * mapped with pci_p2pdma_map_sg()
- * @dev: device doing the DMA request
- * @sg: scatter list to map
- * @nents: number of elements returned by pci_p2pdma_map_sg()
- * @dir: DMA direction
- * @attrs: DMA attributes passed to dma_unmap_sg() (if called)
- */
-void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs)
-{
-   enum pci_p2pdma_map_type map_type;
-
-   map_type = pci_p2pdma_map_type(sg_page(sg)->pgmap, dev);
-
-   if (map_type == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE)
-   dma_unmap_sg_attrs(dev, sg, nents, dir, attrs);
-}
-EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
-
 /**
  * pci_p2pdma_map_segment - map an sg segment determining the mapping type
  * @state: State structure that should be declared outside of the for_each_sg()
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 8318a97c9c61..2c07aa6b7665 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -30,10 +30,6 @@ struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev 
*pdev,
 unsigned int *nents, u32 length);
 void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
 void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
-int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs);
-void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs);
 int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
bool *use_p2pdma);
 ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
@@ -83,17 +79,6 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
 static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 {
 }
-static inline int pci_p2pdma_map_sg_attrs(struct device *dev,
-   struct scatterlist *sg, int nents, enum dma_data_direction dir,
-   unsigned long attrs)
-{
-   return 0;
-}
-static inline void pci_p2pdma_unmap_sg_attrs(struct device *dev,
-   struct scatterlist *sg, int nents, enum dma_data_direction dir,
-   unsigned long attrs)
-{
-}
 static inline int pci_p2pdma_enable_store(const char *page,
struct pci_dev **p2p_dev, bool *use_p2pdma)
 {
@@ -119,16 +104,4 @@ static inline struct pci_dev *pci_p2pmem_find(struct 
device *client)
return pci_p2pmem_find_many(, 1);
 }
 

[PATCH v8 02/13] PCI/P2PDMA: Attempt to set map_type if it has not been set

2022-07-10 Thread Logan Gunthorpe
Attempt to find the mapping type for P2PDMA pages on the first
DMA map attempt if it has not been done ahead of time.

Previously, the mapping type was expected to be calculated ahead of
time, but if pages are to come from userspace then there's no
way to ensure the path was checked ahead of time.

This change will calculate the mapping type if it hasn't pre-calculated
so it is no longer invalid to call pci_p2pdma_map_sg() before the mapping
type is calculated, so drop the WARN_ON when that is the case.

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
Reviewed-by: Chaitanya Kulkarni 
Reviewed-by: Christoph Hellwig 
---
 drivers/pci/p2pdma.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 462b429ad243..4e8bc457e29a 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -854,6 +854,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct 
dev_pagemap *pgmap,
struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
struct pci_dev *client;
struct pci_p2pdma *p2pdma;
+   int dist;
 
if (!provider->p2pdma)
return PCI_P2PDMA_MAP_NOT_SUPPORTED;
@@ -870,6 +871,10 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct 
dev_pagemap *pgmap,
type = xa_to_value(xa_load(>map_types,
   map_types_idx(client)));
rcu_read_unlock();
+
+   if (type == PCI_P2PDMA_MAP_UNKNOWN)
+   return calc_map_type_and_dist(provider, client, , true);
+
return type;
 }
 
@@ -912,7 +917,7 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct 
scatterlist *sg,
case PCI_P2PDMA_MAP_BUS_ADDR:
return __pci_p2pdma_map_sg(p2p_pgmap, dev, sg, nents);
default:
-   WARN_ON_ONCE(1);
+   /* Mapping is not Supported */
return 0;
}
 }
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v8 11/13] RDMA/core: introduce ib_dma_pci_p2p_dma_supported()

2022-07-10 Thread Logan Gunthorpe
Introduce the helper function ib_dma_pci_p2p_dma_supported() to check
if a given ib_device can be used in P2PDMA transfers. This ensures
the ib_device is not using virt_dma and also that the underlying
dma_device supports P2PDMA.

Use the new helper in nvme-rdma to replace the existing check for
ib_uses_virt_dma(). Adding the dma_pci_p2pdma_supported() check allows
switching away from pci_p2pdma_[un]map_sg().

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Max Gurtovoy 
Reviewed-by: Christoph Hellwig 
---
 drivers/nvme/target/rdma.c |  2 +-
 include/rdma/ib_verbs.h| 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 09fdcac87d17..4597bca43a6d 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -415,7 +415,7 @@ static int nvmet_rdma_alloc_rsp(struct nvmet_rdma_device 
*ndev,
if (ib_dma_mapping_error(ndev->device, r->send_sge.addr))
goto out_free_rsp;
 
-   if (!ib_uses_virt_dma(ndev->device))
+   if (ib_dma_pci_p2p_dma_supported(ndev->device))
r->req.p2p_client = >device->dev;
r->send_sge.length = sizeof(*r->req.cqe);
r->send_sge.lkey = ndev->pd->local_dma_lkey;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 9c6317cf80d5..523843d9ed6c 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -4013,6 +4013,17 @@ static inline bool ib_uses_virt_dma(struct ib_device 
*dev)
return IS_ENABLED(CONFIG_INFINIBAND_VIRT_DMA) && !dev->dma_device;
 }
 
+/*
+ * Check if a IB device's underlying DMA mapping supports P2PDMA transfers.
+ */
+static inline bool ib_dma_pci_p2p_dma_supported(struct ib_device *dev)
+{
+   if (ib_uses_virt_dma(dev))
+   return false;
+
+   return dma_pci_p2pdma_supported(dev->dma_device);
+}
+
 /**
  * ib_dma_mapping_error - check a DMA addr for error
  * @dev: The device for which the dma_addr was created
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 20/21] PCI/P2PDMA: Introduce pci_mmap_p2pmem()

2022-07-06 Thread Logan Gunthorpe



On 2022-07-06 01:04, Greg Kroah-Hartman wrote:
> On Wed, Jul 06, 2022 at 08:51:27AM +0200, Christoph Hellwig wrote:
>> On Tue, Jul 05, 2022 at 12:16:45PM -0600, Logan Gunthorpe wrote:
>>> The current version does it through a char device, but that requires
>>> creating a simple_fs and anon_inode for teardown on driver removal, plus
>>> a bunch of hooks through the driver that exposes it (NVMe, in this case)
>>> to set this all up.
>>>
>>> Christoph is suggesting a sysfs interface which could potentially avoid
>>> the anon_inode and all of the extra hooks. It has some significant
>>> benefits and maybe some small downsides, but I wouldn't describe it as
>>> horrid.
>>
>> Yeah, I don't think is is horrible, it fits in with the resource files
>> for the BARs, and solves a lot of problems.  Greg, can you explain
>> what would be so bad about it?
> 
> As you mention, you will have to pass different things down into sysfs
> in order for that to be possible.  If it matches the resource files like
> we currently have today, that might not be that bad, but it still feels
> odd to me.  Let's see an implementation and a Documentation/ABI/ entry
> first though.

I'll work something up in the coming weeks.

Thanks,

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 20/21] PCI/P2PDMA: Introduce pci_mmap_p2pmem()

2022-07-05 Thread Logan Gunthorpe



On 2022-07-05 11:42, Greg Kroah-Hartman wrote:
> On Tue, Jul 05, 2022 at 11:32:23AM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2022-07-05 11:21, Greg Kroah-Hartman wrote:
>>> On Tue, Jul 05, 2022 at 06:50:39PM +0200, Christoph Hellwig wrote:
>>>> [note for the newcomers, this is about allowing mmap()ing the PCIe
>>>> P2P memory from the generic PCI P2P code through sysfs, and more
>>>> importantly how to revoke it on device removal]
>>>
>>> We allow mmap on PCIe config space today, right?  Why is this different
>>> from what pci_create_legacy_files() does today?
>>>
>>>> On Tue, Jul 05, 2022 at 10:44:49AM -0600, Logan Gunthorpe wrote:
>>>>> We might be able to. I'm not sure. I'll have to figure out how to find
>>>>> that inode from the p2pdma code. I haven't found an obvious interface to
>>>>> do that.
>>>>
>>>> I think the right way to approach this would be a new sysfs API
>>>> that internally calls unmap_mapping_range internally instead of
>>>> exposing the inode. I suspect that might actually be the right thing
>>>> to do for iomem_inode as well.
>>>
>>> Why do we need something new and how is this any different from the PCI
>>> binary files I mention above?  We have supported PCI hotplug for a very
>>> long time, do the current PCI binary sysfs files not work properly with
>>> mmap and removing a device?
>>
>> The P2PDMA code allocates and hands out struct pages to userspace that
>> are backed with ZONE_DEVICE memory from a device's BAR. This is quite
>> different from the existing binary files mentioned above which neither
>> support struct pages nor allocation.
> 
> Why would you want to do this through a sysfs interface?  that feels
> horrid...

The current version does it through a char device, but that requires
creating a simple_fs and anon_inode for teardown on driver removal, plus
a bunch of hooks through the driver that exposes it (NVMe, in this case)
to set this all up.

Christoph is suggesting a sysfs interface which could potentially avoid
the anon_inode and all of the extra hooks. It has some significant
benefits and maybe some small downsides, but I wouldn't describe it as
horrid.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 20/21] PCI/P2PDMA: Introduce pci_mmap_p2pmem()

2022-07-05 Thread Logan Gunthorpe



On 2022-07-05 11:21, Greg Kroah-Hartman wrote:
> On Tue, Jul 05, 2022 at 06:50:39PM +0200, Christoph Hellwig wrote:
>> [note for the newcomers, this is about allowing mmap()ing the PCIe
>> P2P memory from the generic PCI P2P code through sysfs, and more
>> importantly how to revoke it on device removal]
> 
> We allow mmap on PCIe config space today, right?  Why is this different
> from what pci_create_legacy_files() does today?
> 
>> On Tue, Jul 05, 2022 at 10:44:49AM -0600, Logan Gunthorpe wrote:
>>> We might be able to. I'm not sure. I'll have to figure out how to find
>>> that inode from the p2pdma code. I haven't found an obvious interface to
>>> do that.
>>
>> I think the right way to approach this would be a new sysfs API
>> that internally calls unmap_mapping_range internally instead of
>> exposing the inode. I suspect that might actually be the right thing
>> to do for iomem_inode as well.
> 
> Why do we need something new and how is this any different from the PCI
> binary files I mention above?  We have supported PCI hotplug for a very
> long time, do the current PCI binary sysfs files not work properly with
> mmap and removing a device?

The P2PDMA code allocates and hands out struct pages to userspace that
are backed with ZONE_DEVICE memory from a device's BAR. This is quite
different from the existing binary files mentioned above which neither
support struct pages nor allocation.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 20/21] PCI/P2PDMA: Introduce pci_mmap_p2pmem()

2022-07-05 Thread Logan Gunthorpe



On 2022-07-05 10:43, Christoph Hellwig wrote:
> On Tue, Jul 05, 2022 at 10:41:52AM -0600, Logan Gunthorpe wrote:
>> Using sysfs means we don't need all the messy callbacks from the nvme
>> driver, which is a plus. But I'm not sure how we'd get or unmap the
>> mapping of a sysfs file or avoid the anonymous inode. Seems with the
>> existing PCI resources, it uses an bin_attribute->f_mapping() callback
>> to pass back the iomem_get_mapping() mapping on file open.
>> revoke_iomem() is then used to nuke the VMAs. I don't think we can use
>> the same infrastructure here as that would add a dependency on
>> CONFIG_IO_STRICT_DEVMEM; which would be odd. And I'm not sure whether
>> there is a better way.
> 
> Why can't we do the revoke on the actual sysfs inode?

We might be able to. I'm not sure. I'll have to figure out how to find
that inode from the p2pdma code. I haven't found an obvious interface to
do that.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 20/21] PCI/P2PDMA: Introduce pci_mmap_p2pmem()

2022-07-05 Thread Logan Gunthorpe



On 2022-07-05 10:12, Christoph Hellwig wrote:
> On Tue, Jul 05, 2022 at 10:51:02AM -0300, Jason Gunthorpe wrote:
>>> In fact I'm not even sure this should be a character device, it seems
>>> to fit it way better with the PCI sysfs hierchacy, just like how we
>>> map MMIO resources, which these are anyway.  And once it is on sysfs
>>> we do have a uniqueue inode and need none of the pseudofs stuff, and
>>> don't need all the glue code in nvme either.
>>
>> Shouldn't there be an allocator here? It feels a bit weird that the
>> entire CMB is given to a single process, it is a sharable resource,
>> isn't it?
> 
> Making the entire area given by the device to the p2p allocator available
> to user space seems sensible to me.  That is what the current series does,
> and what a sysfs interface would do as well.

Yes, I think Jason is assuming the sysfs file would behave like the
existing mmio resource files where the process doing the mapping
specifies the offset and length into the BAR. That is not what we want
here, but I don't see why I don't see why we can't do the same thing in
sysfs as we do with the char device with a bin_attribute->mmap() callback.

mmapping the char device was convenient in user space, but it's not much
more work to dig through sysfs and mmap an attribute from there.

Using sysfs means we don't need all the messy callbacks from the nvme
driver, which is a plus. But I'm not sure how we'd get or unmap the
mapping of a sysfs file or avoid the anonymous inode. Seems with the
existing PCI resources, it uses an bin_attribute->f_mapping() callback
to pass back the iomem_get_mapping() mapping on file open.
revoke_iomem() is then used to nuke the VMAs. I don't think we can use
the same infrastructure here as that would add a dependency on
CONFIG_IO_STRICT_DEVMEM; which would be odd. And I'm not sure whether
there is a better way.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 16/21] block: add check when merging zone device pages

2022-06-30 Thread Logan Gunthorpe




On 2022-06-29 10:06, Logan Gunthorpe wrote:
> 
> 
> 
> On 2022-06-29 00:46, Christoph Hellwig wrote:
>> On Wed, Jun 15, 2022 at 10:12:28AM -0600, Logan Gunthorpe wrote:
>>> Consecutive zone device pages should not be merged into the same sgl
>>> or bvec segment with other types of pages or if they belong to different
>>> pgmaps. Otherwise getting the pgmap of a given segment is not possible
>>> without scanning the entire segment. This helper returns true either if
>>> both pages are not zone device pages or both pages are zone device
>>> pages with the same pgmap.
>>>
>>> Add a helper to determine if zone device pages are mergeable and use
>>> this helper in page_is_mergeable().
>>
>> Any reason not to simply set REQ_NOMERGE for these requests?  We
>> can't merge for passthrough requests anyway, and genrally don't merge
>> for direct I/O either, so adding all this overhead seems a bit pointless.
> 
> Hmm, I suppose we could also ensure that REQ_NOMERGE is set in a bio
> before setting FOLL_PCI_P2PDMA in bio_map_user_iov() and
> __bio_iov_iter_get_pages(). Assuming it's always set for any direct I/O.
> 

Oh, it turns out this code has nothing to do with REQ_NOMERGE. It's used
indirectly in bio_map_user_iov() and __bio_iov_iter_get_pages() when
adding pages to the bio via page_is_mergeable(). So it's not about
requests being merged it's about pages being merged.

So I'm not sure how we can avoid this, but it only happens when two
adjacent pages are added to the same bio in a row, so I don't think it's
that common, but the check can probably be moved down so it happens
after the same_page check to make it a little less common.

Logan

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 08/21] iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg

2022-06-30 Thread Logan Gunthorpe


On 2022-06-30 08:56, Robin Murphy wrote:
> On 2022-06-29 23:41, Logan Gunthorpe wrote:
>>
>>
>> On 2022-06-29 13:15, Robin Murphy wrote:
>>> On 2022-06-29 16:57, Logan Gunthorpe wrote:
>>>>
>>>>
>>>>
>>>> On 2022-06-29 06:07, Robin Murphy wrote:
>>>>> On 2022-06-15 17:12, Logan Gunthorpe wrote:
>>>>>> When a PCI P2PDMA page is seen, set the IOVA length of the segment
>>>>>> to zero so that it is not mapped into the IOVA. Then, in
>>>>>> finalise_sg(),
>>>>>> apply the appropriate bus address to the segment. The IOVA is not
>>>>>> created if the scatterlist only consists of P2PDMA pages.
>>>>>>
>>>>>> A P2PDMA page may have three possible outcomes when being mapped:
>>>>>>  1) If the data path between the two devices doesn't go through
>>>>>>     the root port, then it should be mapped with a PCI bus
>>>>>> address
>>>>>>  2) If the data path goes through the host bridge, it should be
>>>>>> mapped
>>>>>>     normally with an IOMMU IOVA.
>>>>>>  3) It is not possible for the two devices to communicate and
>>>>>> thus
>>>>>>     the mapping operation should fail (and it will return
>>>>>> -EREMOTEIO).
>>>>>>
>>>>>> Similar to dma-direct, the sg_dma_mark_pci_p2pdma() flag is used to
>>>>>> indicate bus address segments. On unmap, P2PDMA segments are skipped
>>>>>> over when determining the start and end IOVA addresses.
>>>>>>
>>>>>> With this change, the flags variable in the dma_map_ops is set to
>>>>>> DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for P2PDMA pages.
>>>>>>
>>>>>> Signed-off-by: Logan Gunthorpe 
>>>>>> Reviewed-by: Jason Gunthorpe 
>>>>>> ---
>>>>>>     drivers/iommu/dma-iommu.c | 68
>>>>>> +++
>>>>>>     1 file changed, 61 insertions(+), 7 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>>>>>> index f90251572a5d..b01ca0c6a7ab 100644
>>>>>> --- a/drivers/iommu/dma-iommu.c
>>>>>> +++ b/drivers/iommu/dma-iommu.c
>>>>>> @@ -21,6 +21,7 @@
>>>>>>     #include 
>>>>>>     #include 
>>>>>>     #include 
>>>>>> +#include 
>>>>>>     #include 
>>>>>>     #include 
>>>>>>     #include 
>>>>>> @@ -1062,6 +1063,16 @@ static int __finalise_sg(struct device *dev,
>>>>>> struct scatterlist *sg, int nents,
>>>>>>     sg_dma_address(s) = DMA_MAPPING_ERROR;
>>>>>>     sg_dma_len(s) = 0;
>>>>>>     +    if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {
>>>>>
>>>>> Logically, should we not be able to use sg_is_dma_bus_address()
>>>>> here? I
>>>>> think it should be feasible, and simpler, to prepare the p2p segments
>>>>> up-front, such that at this point all we need to do is restore the
>>>>> original length (if even that, see below).
>>>>
>>>> Per my previous email, no, because sg_is_dma_bus_address() is not set
>>>> yet and not meant to tell you something about the page. That flag will
>>>> be set below by pci_p2pdma_map_bus_segment() and then checkd in
>>>> iommu_dma_unmap_sg() to determine if the dma_address in the segment
>>>> needs to be unmapped.
>>>
>>> I know it's not set yet as-is; I'm suggesting things should be
>>> restructured so that it *would be*. In the logical design of this code,
>>> the DMA addresses are effectively determined in iommu_dma_map_sg(), and
>>> __finalise_sg() merely converts them from a relative to an absolute form
>>> (along with undoing the other trickery). Thus the call to
>>> pci_p2pdma_map_bus_segment() absolutely belongs in the main
>>> iommu_map_sg() loop.
>>
>> I don't see how that can work: __finalise_sg() does more than convert
>> them from relative to absolute, it also figures out which SG entry will
>> contain which dma_address segment. Which segment a P2PDMA address needs
>> to be programmed into depends on 

Re: [PATCH v7 08/21] iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg

2022-06-29 Thread Logan Gunthorpe


On 2022-06-29 13:15, Robin Murphy wrote:
> On 2022-06-29 16:57, Logan Gunthorpe wrote:
>>
>>
>>
>> On 2022-06-29 06:07, Robin Murphy wrote:
>>> On 2022-06-15 17:12, Logan Gunthorpe wrote:
>>>> When a PCI P2PDMA page is seen, set the IOVA length of the segment
>>>> to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
>>>> apply the appropriate bus address to the segment. The IOVA is not
>>>> created if the scatterlist only consists of P2PDMA pages.
>>>>
>>>> A P2PDMA page may have three possible outcomes when being mapped:
>>>>     1) If the data path between the two devices doesn't go through
>>>>    the root port, then it should be mapped with a PCI bus address
>>>>     2) If the data path goes through the host bridge, it should be
>>>> mapped
>>>>    normally with an IOMMU IOVA.
>>>>     3) It is not possible for the two devices to communicate and thus
>>>>    the mapping operation should fail (and it will return
>>>> -EREMOTEIO).
>>>>
>>>> Similar to dma-direct, the sg_dma_mark_pci_p2pdma() flag is used to
>>>> indicate bus address segments. On unmap, P2PDMA segments are skipped
>>>> over when determining the start and end IOVA addresses.
>>>>
>>>> With this change, the flags variable in the dma_map_ops is set to
>>>> DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for P2PDMA pages.
>>>>
>>>> Signed-off-by: Logan Gunthorpe 
>>>> Reviewed-by: Jason Gunthorpe 
>>>> ---
>>>>    drivers/iommu/dma-iommu.c | 68
>>>> +++
>>>>    1 file changed, 61 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>>>> index f90251572a5d..b01ca0c6a7ab 100644
>>>> --- a/drivers/iommu/dma-iommu.c
>>>> +++ b/drivers/iommu/dma-iommu.c
>>>> @@ -21,6 +21,7 @@
>>>>    #include 
>>>>    #include 
>>>>    #include 
>>>> +#include 
>>>>    #include 
>>>>    #include 
>>>>    #include 
>>>> @@ -1062,6 +1063,16 @@ static int __finalise_sg(struct device *dev,
>>>> struct scatterlist *sg, int nents,
>>>>    sg_dma_address(s) = DMA_MAPPING_ERROR;
>>>>    sg_dma_len(s) = 0;
>>>>    +    if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {
>>>
>>> Logically, should we not be able to use sg_is_dma_bus_address() here? I
>>> think it should be feasible, and simpler, to prepare the p2p segments
>>> up-front, such that at this point all we need to do is restore the
>>> original length (if even that, see below).
>>
>> Per my previous email, no, because sg_is_dma_bus_address() is not set
>> yet and not meant to tell you something about the page. That flag will
>> be set below by pci_p2pdma_map_bus_segment() and then checkd in
>> iommu_dma_unmap_sg() to determine if the dma_address in the segment
>> needs to be unmapped.
> 
> I know it's not set yet as-is; I'm suggesting things should be
> restructured so that it *would be*. In the logical design of this code,
> the DMA addresses are effectively determined in iommu_dma_map_sg(), and
> __finalise_sg() merely converts them from a relative to an absolute form
> (along with undoing the other trickery). Thus the call to
> pci_p2pdma_map_bus_segment() absolutely belongs in the main
> iommu_map_sg() loop.

I don't see how that can work: __finalise_sg() does more than convert
them from relative to absolute, it also figures out which SG entry will
contain which dma_address segment. Which segment a P2PDMA address needs
to be programmed into depends on the how 'cur' is calculated which in
turn depends on things like seg_mask and max_len. This calculation is
not done in iommu_dma_map_sg() so I don't see how there's any hope of
assigning the bus address for the P2P segments in that function.

If there's a way to restructure things so that's possible that I'm not
seeing, I'm open to it but it's certainly not immediately obvious.

>>>> +
>>>> +    switch (map_type) {
>>>> +    case PCI_P2PDMA_MAP_BUS_ADDR:
>>>> +    /*
>>>> + * A zero length will be ignored by
>>>> + * iommu_map_sg() and then can be detected
>>>
>>> If that is required behaviour then it needs an explicit check in
>>> iommu_map_sg() to guarantee (and document) it.

Re: [PATCH v7 01/21] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL

2022-06-29 Thread Logan Gunthorpe



On 2022-06-29 12:02, Robin Murphy wrote:
> On 2022-06-29 16:39, Logan Gunthorpe wrote:
>> On 2022-06-29 03:05, Robin Murphy wrote:
>>> On 2022-06-15 17:12, Logan Gunthorpe wrote:
>>> Does this serve any useful purpose? If a page is determined to be device
>>> memory, it's not going to suddenly stop being device memory, and if the
>>> underlying sg is recycled to point elsewhere then sg_assign_page() will
>>> still (correctly) clear this flag anyway. Trying to reason about this
>>> beyond superficial API symmetry - i.e. why exactly would a caller need
>>> to call it, and what would the implications be of failing to do so -
>>> seems to lead straight to confusion.
>>>
>>> In fact I'd be inclined to have sg_assign_page() be responsible for
>>> setting the flag automatically as well, and thus not need
>>> sg_dma_mark_bus_address() either, however I can see the argument for
>>> doing it this way round to not entangle the APIs too much, so I don't
>>> have any great objection to that.
>>
>> Yes, I think you misunderstand what this is for. The SG_DMA_BUS_ADDDRESS
>> flag doesn't mark the segment for the page, but for the dma address. It
>> cannot be set in sg_assign_page() seeing it's not a property of the page
>> but a property of the dma_address in the sgl.
>>
>> It's not meant for use by regular SG users, it's only meant for use
>> inside DMA mapping implementations. The purpose is to know whether a
>> given dma_address in the SGL is a bus address or regular memory because
>> the two different types must be unmapped differently. We can't rely on
>> the page because, as you know, many dma_map_sg() the dma_address entry
>> in the sgl does not map to the same memory as the page. Or to put it
>> another way: is_pci_p2pdma_page(sg->page) does not imply that
>> sg->dma_address points to a bus address.
>>
>> Does that make sense?
> 
> Ah, you're quite right, in trying to take in the whole series at once
> first thing in the morning I did fail to properly grasp that detail, so
> indeed the sg_assign_page() thing couldn't possibly work, but as I said
> that's fine anyway. I still think the lifecycle management is a bit off
> though - equivalently, a bus address doesn't stop being a bus address,
> so it would seem appropriate to update this flag appropriately whenever
> sg_dma_address() is assigned to, and not when it isn't.

Yes, that's pretty much the way the code is now. The only two places
sg_dma_mark_bus_address() is called are in the two pci_p2pdma helpers
that set the dma address to the bus address. The lines before both calls
set the dma_address and dma_len.

Logan

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 16/21] block: add check when merging zone device pages

2022-06-29 Thread Logan Gunthorpe




On 2022-06-29 00:46, Christoph Hellwig wrote:
> On Wed, Jun 15, 2022 at 10:12:28AM -0600, Logan Gunthorpe wrote:
>> Consecutive zone device pages should not be merged into the same sgl
>> or bvec segment with other types of pages or if they belong to different
>> pgmaps. Otherwise getting the pgmap of a given segment is not possible
>> without scanning the entire segment. This helper returns true either if
>> both pages are not zone device pages or both pages are zone device
>> pages with the same pgmap.
>>
>> Add a helper to determine if zone device pages are mergeable and use
>> this helper in page_is_mergeable().
> 
> Any reason not to simply set REQ_NOMERGE for these requests?  We
> can't merge for passthrough requests anyway, and genrally don't merge
> for direct I/O either, so adding all this overhead seems a bit pointless.

Hmm, I suppose we could also ensure that REQ_NOMERGE is set in a bio
before setting FOLL_PCI_P2PDMA in bio_map_user_iov() and
__bio_iov_iter_get_pages(). Assuming it's always set for any direct I/O.

I'll look into it.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 20/21] PCI/P2PDMA: Introduce pci_mmap_p2pmem()

2022-06-29 Thread Logan Gunthorpe




On 2022-06-29 00:48, Christoph Hellwig wrote:
> On Wed, Jun 15, 2022 at 10:12:32AM -0600, Logan Gunthorpe wrote:
>> A pseudo mount is used to allocate an inode for each PCI device. The
>> inode's address_space is used in the file doing the mmap so that all
>> VMAs are collected and can be unmapped if the PCI device is unbound.
>> After unmapping, the VMAs are iterated through and their pages are
>> put so the device can continue to be unbound. An active flag is used
>> to signal to VMAs not to allocate any further P2P memory once the
>> removal process starts. The flag is synchronized with concurrent
>> access with an RCU lock.
> 
> Can't we come up with a way of doing this without all the pseudo-fs
> garbagage?  I really hate all the overhead for that in the next
> nvme patch as well.

I assume you still want to be able to unmap the VMAs on unbind and not
just hang?

I'll see if I can come up with something to do the a similar thing using
vm_private data or some such.

I was not a fan of the extra code for this either, but I was given to
understand that it was the standard way to collect and cleanup VMAs.

Thanks for the reviews,

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 08/21] iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg

2022-06-29 Thread Logan Gunthorpe



On 2022-06-29 06:07, Robin Murphy wrote:
> On 2022-06-15 17:12, Logan Gunthorpe wrote:
>> When a PCI P2PDMA page is seen, set the IOVA length of the segment
>> to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
>> apply the appropriate bus address to the segment. The IOVA is not
>> created if the scatterlist only consists of P2PDMA pages.
>>
>> A P2PDMA page may have three possible outcomes when being mapped:
>>    1) If the data path between the two devices doesn't go through
>>   the root port, then it should be mapped with a PCI bus address
>>    2) If the data path goes through the host bridge, it should be mapped
>>   normally with an IOMMU IOVA.
>>    3) It is not possible for the two devices to communicate and thus
>>   the mapping operation should fail (and it will return -EREMOTEIO).
>>
>> Similar to dma-direct, the sg_dma_mark_pci_p2pdma() flag is used to
>> indicate bus address segments. On unmap, P2PDMA segments are skipped
>> over when determining the start and end IOVA addresses.
>>
>> With this change, the flags variable in the dma_map_ops is set to
>> DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for P2PDMA pages.
>>
>> Signed-off-by: Logan Gunthorpe 
>> Reviewed-by: Jason Gunthorpe 
>> ---
>>   drivers/iommu/dma-iommu.c | 68 +++
>>   1 file changed, 61 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
>> index f90251572a5d..b01ca0c6a7ab 100644
>> --- a/drivers/iommu/dma-iommu.c
>> +++ b/drivers/iommu/dma-iommu.c
>> @@ -21,6 +21,7 @@
>>   #include 
>>   #include 
>>   #include 
>> +#include 
>>   #include 
>>   #include 
>>   #include 
>> @@ -1062,6 +1063,16 @@ static int __finalise_sg(struct device *dev,
>> struct scatterlist *sg, int nents,
>>   sg_dma_address(s) = DMA_MAPPING_ERROR;
>>   sg_dma_len(s) = 0;
>>   +    if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {
> 
> Logically, should we not be able to use sg_is_dma_bus_address() here? I
> think it should be feasible, and simpler, to prepare the p2p segments
> up-front, such that at this point all we need to do is restore the
> original length (if even that, see below).

Per my previous email, no, because sg_is_dma_bus_address() is not set
yet and not meant to tell you something about the page. That flag will
be set below by pci_p2pdma_map_bus_segment() and then checkd in
iommu_dma_unmap_sg() to determine if the dma_address in the segment
needs to be unmapped.

> 
>> +    if (i > 0)
>> +    cur = sg_next(cur);
>> +
>> +    pci_p2pdma_map_bus_segment(s, cur);
>> +    count++;
>> +    cur_len = 0;
>> +    continue;
>> +    }
>> +
>>   /*
>>    * Now fill in the real DMA data. If...
>>    * - there is a valid output segment to append to
>> @@ -1158,6 +1169,8 @@ static int iommu_dma_map_sg(struct device *dev,
>> struct scatterlist *sg,
>>   struct iova_domain *iovad = >iovad;
>>   struct scatterlist *s, *prev = NULL;
>>   int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
>> +    struct dev_pagemap *pgmap = NULL;
>> +    enum pci_p2pdma_map_type map_type;
>>   dma_addr_t iova;
>>   size_t iova_len = 0;
>>   unsigned long mask = dma_get_seg_boundary(dev);
>> @@ -1193,6 +1206,35 @@ static int iommu_dma_map_sg(struct device *dev,
>> struct scatterlist *sg,
>>   s_length = iova_align(iovad, s_length + s_iova_off);
>>   s->length = s_length;
>>   +    if (is_pci_p2pdma_page(sg_page(s))) {
>> +    if (sg_page(s)->pgmap != pgmap) {
>> +    pgmap = sg_page(s)->pgmap;
>> +    map_type = pci_p2pdma_map_type(pgmap, dev);
>> +    }
> 
> There's a definite code smell here, but per above and below I think we
> *should* actually call the new helper instead of copy-pasting half of it.


> 
>> +
>> +    switch (map_type) {
>> +    case PCI_P2PDMA_MAP_BUS_ADDR:
>> +    /*
>> + * A zero length will be ignored by
>> + * iommu_map_sg() and then can be detected
> 
> If that is required behaviour then it needs an explicit check in
> iommu_map_sg() to guarantee (and document) it. It's only by chance that
> __iommu_map() happens to return success for size == 0 *if* all the other
> arguments still line up, which is a far cry from a safe no-

Re: [PATCH v7 01/21] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL

2022-06-29 Thread Logan Gunthorpe



On 2022-06-29 03:05, Robin Murphy wrote:
> On 2022-06-15 17:12, Logan Gunthorpe wrote:
>> Make use of the third free LSB in scatterlist's page_link on 64bit
>> systems.
>>
>> The extra bit will be used by dma_[un]map_sg_p2pdma() to determine when a
>> given SGL segments dma_address points to a PCI bus address.
>> dma_unmap_sg_p2pdma() will need to perform different cleanup when a
>> segment is marked as a bus address.
>>
>> The new bit will only be used when CONFIG_PCI_P2PDMA is set; this means
>> PCI P2PDMA will require CONFIG_64BIT. This should be acceptable as the
>> majority of P2PDMA use cases are restricted to newer root complexes and
>> roughly require the extra address space for memory BARs used in the
>> transactions.
>>
>> Signed-off-by: Logan Gunthorpe 
>> Reviewed-by: Chaitanya Kulkarni 
>> ---
>>   drivers/pci/Kconfig |  5 +
>>   include/linux/scatterlist.h | 44 -
>>   2 files changed, 48 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
>> index 133c73207782..5cc7cba1941f 100644
>> --- a/drivers/pci/Kconfig
>> +++ b/drivers/pci/Kconfig
>> @@ -164,6 +164,11 @@ config PCI_PASID
>>   config PCI_P2PDMA
>>   bool "PCI peer-to-peer transfer support"
>>   depends on ZONE_DEVICE
>> +    #
>> +    # The need for the scatterlist DMA bus address flag means PCI P2PDMA
>> +    # requires 64bit
>> +    #
>> +    depends on 64BIT
>>   select GENERIC_ALLOCATOR
>>   help
>>     Enableѕ drivers to do PCI peer-to-peer transactions to and from
>> diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
>> index 7ff9d6386c12..6561ca8aead8 100644
>> --- a/include/linux/scatterlist.h
>> +++ b/include/linux/scatterlist.h
>> @@ -64,12 +64,24 @@ struct sg_append_table {
>>   #define SG_CHAIN    0x01UL
>>   #define SG_END    0x02UL
>>   +/*
>> + * bit 2 is the third free bit in the page_link on 64bit systems which
>> + * is used by dma_unmap_sg() to determine if the dma_address is a
>> + * bus address when doing P2PDMA.
>> + */
>> +#ifdef CONFIG_PCI_P2PDMA
>> +#define SG_DMA_BUS_ADDRESS    0x04UL
>> +static_assert(__alignof__(struct page) >= 8);
>> +#else
>> +#define SG_DMA_BUS_ADDRESS    0x00UL
>> +#endif
>> +
>>   /*
>>    * We overload the LSB of the page pointer to indicate whether it's
>>    * a valid sg entry, or whether it points to the start of a new
>> scatterlist.
>>    * Those low bits are there for everyone! (thanks mason :-)
>>    */
>> -#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END)
>> +#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END | SG_DMA_BUS_ADDRESS)
>>     static inline unsigned int __sg_flags(struct scatterlist *sg)
>>   {
>> @@ -91,6 +103,11 @@ static inline bool sg_is_last(struct scatterlist *sg)
>>   return __sg_flags(sg) & SG_END;
>>   }
>>   +static inline bool sg_is_dma_bus_address(struct scatterlist *sg)
>> +{
>> +    return __sg_flags(sg) & SG_DMA_BUS_ADDRESS;
>> +}
>> +
>>   /**
>>    * sg_assign_page - Assign a given page to an SG entry
>>    * @sg:    SG entry
>> @@ -245,6 +262,31 @@ static inline void sg_unmark_end(struct
>> scatterlist *sg)
>>   sg->page_link &= ~SG_END;
>>   }
>>   +/**
>> + * sg_dma_mark_bus address - Mark the scatterlist entry as a bus address
>> + * @sg: SG entryScatterlist
> 
> entryScatterlist?
> 
>> + *
>> + * Description:
>> + *   Marks the passed in sg entry to indicate that the dma_address is
>> + *   a bus address and doesn't need to be unmapped.
>> + **/
>> +static inline void sg_dma_mark_bus_address(struct scatterlist *sg)
>> +{
>> +    sg->page_link |= SG_DMA_BUS_ADDRESS;
>> +}
>> +
>> +/**
>> + * sg_unmark_pci_p2pdma - Unmark the scatterlist entry as a bus address
>> + * @sg: SG entryScatterlist
>> + *
>> + * Description:
>> + *   Clears the bus address mark.
>> + **/
>> +static inline void sg_dma_unmark_bus_address(struct scatterlist *sg)
>> +{
>> +    sg->page_link &= ~SG_DMA_BUS_ADDRESS;
>> +}
> 
> Does this serve any useful purpose? If a page is determined to be device
> memory, it's not going to suddenly stop being device memory, and if the
> underlying sg is recycled to point elsewhere then sg_assign_page() will
> still (correctly) clear this flag anyway. Trying to reason about this
> beyond superf

[PATCH v7 05/21] dma-mapping: allow EREMOTEIO return code for P2PDMA transfers

2022-06-15 Thread Logan Gunthorpe
Add EREMOTEIO error return to dma_map_sgtable() which will be used
by .map_sg() implementations that detect P2PDMA pages that the
underlying DMA device cannot access.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
---
 kernel/dma/mapping.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index db7244291b74..9f65d1041638 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -197,7 +197,7 @@ static int __dma_map_sg_attrs(struct device *dev, struct 
scatterlist *sg,
if (ents > 0)
debug_dma_map_sg(dev, sg, nents, ents, dir, attrs);
else if (WARN_ON_ONCE(ents != -EINVAL && ents != -ENOMEM &&
- ents != -EIO))
+ ents != -EIO && ents != -EREMOTEIO))
return -EIO;
 
return ents;
@@ -255,6 +255,8 @@ EXPORT_SYMBOL(dma_map_sg_attrs);
  * complete the mapping. Should succeed if retried later.
  *   -EIO  Legacy error code with an unknown meaning. eg. this is
  * returned if a lower level call returned DMA_MAPPING_ERROR.
+ *   -EREMOTEIOThe DMA device cannot access P2PDMA memory specified in
+ * the sg_table. This will not succeed if retried.
  */
 int dma_map_sgtable(struct device *dev, struct sg_table *sgt,
enum dma_data_direction dir, unsigned long attrs)
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v7 15/21] iov_iter: introduce iov_iter_get_pages_[alloc_]flags()

2022-06-15 Thread Logan Gunthorpe
Add iov_iter_get_pages_flags() and iov_iter_get_pages_alloc_flags()
which take a flags argument that is passed to get_user_pages_fast().

This is so that FOLL_PCI_P2PDMA can be passed when appropriate.

Signed-off-by: Logan Gunthorpe 
---
 include/linux/uio.h |  6 ++
 lib/iov_iter.c  | 25 +++--
 2 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 739285fe5a2f..ddf9e4cf4a59 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -232,8 +232,14 @@ void iov_iter_pipe(struct iov_iter *i, unsigned int 
direction, struct pipe_inode
 void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t 
count);
 void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray 
*xarray,
 loff_t start, size_t count);
+ssize_t iov_iter_get_pages_flags(struct iov_iter *i, struct page **pages,
+   size_t maxsize, unsigned maxpages, size_t *start,
+   unsigned int gup_flags);
 ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
size_t maxsize, unsigned maxpages, size_t *start);
+ssize_t iov_iter_get_pages_alloc_flags(struct iov_iter *i,
+   struct page ***pages, size_t maxsize, size_t *start,
+   unsigned int gup_flags);
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
size_t maxsize, size_t *start);
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 6dd5330f7a99..9bf6e3af5120 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1515,9 +1515,9 @@ static struct page *first_bvec_segment(const struct 
iov_iter *i,
return page;
 }
 
-ssize_t iov_iter_get_pages(struct iov_iter *i,
+ssize_t iov_iter_get_pages_flags(struct iov_iter *i,
   struct page **pages, size_t maxsize, unsigned maxpages,
-  size_t *start)
+  size_t *start, unsigned int gup_flags)
 {
size_t len;
int n, res;
@@ -1528,7 +1528,6 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
return 0;
 
if (likely(iter_is_iovec(i))) {
-   unsigned int gup_flags = 0;
unsigned long addr;
 
if (iov_iter_rw(i) != WRITE)
@@ -1558,6 +1557,13 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
return iter_xarray_get_pages(i, pages, maxsize, maxpages, 
start);
return -EFAULT;
 }
+EXPORT_SYMBOL_GPL(iov_iter_get_pages_flags);
+
+ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
+  size_t maxsize, unsigned maxpages, size_t *start)
+{
+   return iov_iter_get_pages_flags(i, pages, maxsize, maxpages, start, 0);
+}
 EXPORT_SYMBOL(iov_iter_get_pages);
 
 static struct page **get_pages_array(size_t n)
@@ -1640,9 +1646,9 @@ static ssize_t iter_xarray_get_pages_alloc(struct 
iov_iter *i,
return actual;
 }
 
-ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
+ssize_t iov_iter_get_pages_alloc_flags(struct iov_iter *i,
   struct page ***pages, size_t maxsize,
-  size_t *start)
+  size_t *start, unsigned int gup_flags)
 {
struct page **p;
size_t len;
@@ -1654,7 +1660,6 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
return 0;
 
if (likely(iter_is_iovec(i))) {
-   unsigned int gup_flags = 0;
unsigned long addr;
 
if (iov_iter_rw(i) != WRITE)
@@ -1667,6 +1672,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
p = get_pages_array(n);
if (!p)
return -ENOMEM;
+
res = get_user_pages_fast(addr, n, gup_flags, p);
if (unlikely(res <= 0)) {
kvfree(p);
@@ -1694,6 +1700,13 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
return iter_xarray_get_pages_alloc(i, pages, maxsize, start);
return -EFAULT;
 }
+EXPORT_SYMBOL_GPL(iov_iter_get_pages_alloc_flags);
+
+ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
+size_t maxsize, size_t *start)
+{
+   return iov_iter_get_pages_alloc_flags(i, pages, maxsize, start, 0);
+}
 EXPORT_SYMBOL(iov_iter_get_pages_alloc);
 
 size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v7 03/21] PCI/P2PDMA: Expose pci_p2pdma_map_type()

2022-06-15 Thread Logan Gunthorpe
pci_p2pdma_map_type() will be needed by the dma-iommu map_sg
implementation because it will need to determine the mapping type
ahead of actually doing the mapping to create the actual IOMMU mapping.

Prototypes for this helper are added to dma-map-ops.h as they are only
useful to dma map implementations and don't need to pollute the public
pci-p2pdma header

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Chaitanya Kulkarni 
---
 drivers/pci/p2pdma.c| 25 +
 include/linux/dma-map-ops.h | 45 +
 2 files changed, 61 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4e8bc457e29a..10b1d5c2b5de 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -10,6 +10,7 @@
 
 #define pr_fmt(fmt) "pci-p2pdma: " fmt
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -20,13 +21,6 @@
 #include 
 #include 
 
-enum pci_p2pdma_map_type {
-   PCI_P2PDMA_MAP_UNKNOWN = 0,
-   PCI_P2PDMA_MAP_NOT_SUPPORTED,
-   PCI_P2PDMA_MAP_BUS_ADDR,
-   PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
-};
-
 struct pci_p2pdma {
struct gen_pool *pool;
bool p2pmem_published;
@@ -847,8 +841,21 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
 
-static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
-   struct device *dev)
+/**
+ * pci_p2pdma_map_type - return the type of mapping that should be used for
+ * a given device and pgmap
+ * @pgmap: the pagemap of a page to determine the mapping type for
+ * @dev: device that is mapping the page
+ *
+ * Returns one of:
+ * PCI_P2PDMA_MAP_NOT_SUPPORTED - The mapping should not be done
+ * PCI_P2PDMA_MAP_BUS_ADDR - The mapping should use the PCI bus address
+ * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE - The mapping should be done normally
+ * using the CPU physical address (in dma-direct) or an IOVA
+ * mapping for the IOMMU.
+ */
+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
+struct device *dev)
 {
enum pci_p2pdma_map_type type = PCI_P2PDMA_MAP_NOT_SUPPORTED;
struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 0d5b06b3a4a6..d693a0e33bac 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -379,4 +379,49 @@ static inline void debug_dma_dump_mappings(struct device 
*dev)
 
 extern const struct dma_map_ops dma_dummy_ops;
 
+enum pci_p2pdma_map_type {
+   /*
+* PCI_P2PDMA_MAP_UNKNOWN: Used internally for indicating the mapping
+* type hasn't been calculated yet. Functions that return this enum
+* never return this value.
+*/
+   PCI_P2PDMA_MAP_UNKNOWN = 0,
+
+   /*
+* PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
+* traverse the host bridge and the host bridge is not in the
+* allowlist. DMA Mapping routines should return an error when
+* this is returned.
+*/
+   PCI_P2PDMA_MAP_NOT_SUPPORTED,
+
+   /*
+* PCI_P2PDMA_BUS_ADDR: Indicates that two devices can talk to
+* each other directly through a PCI switch and the transaction will
+* not traverse the host bridge. Such a mapping should program
+* the DMA engine with PCI bus addresses.
+*/
+   PCI_P2PDMA_MAP_BUS_ADDR,
+
+   /*
+* PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
+* to each other, but the transaction traverses a host bridge on the
+* allowlist. In this case, a normal mapping either with CPU physical
+* addresses (in the case of dma-direct) or IOVA addresses (in the
+* case of IOMMUs) should be used to program the DMA engine.
+*/
+   PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
+};
+
+#ifdef CONFIG_PCI_P2PDMA
+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
+struct device *dev);
+#else /* CONFIG_PCI_P2PDMA */
+static inline enum pci_p2pdma_map_type
+pci_p2pdma_map_type(struct dev_pagemap *pgmap, struct device *dev)
+{
+   return PCI_P2PDMA_MAP_NOT_SUPPORTED;
+}
+#endif /* CONFIG_PCI_P2PDMA */
+
 #endif /* _LINUX_DMA_MAP_OPS_H */
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v7 09/21] nvme-pci: check DMA ops when indicating support for PCI P2PDMA

2022-06-15 Thread Logan Gunthorpe
Introduce a supports_pci_p2pdma() operation in nvme_ctrl_ops to
replace the fixed NVME_F_PCI_P2PDMA flag such that the dma_map_ops
flags can be checked for PCI P2PDMA support.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Chaitanya Kulkarni 
---
 drivers/nvme/host/core.c |  3 ++-
 drivers/nvme/host/nvme.h |  2 +-
 drivers/nvme/host/pci.c  | 11 +--
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 24165daee3c8..d6e76f2dc293 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3981,7 +3981,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
unsigned nsid,
blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, ns->queue);
 
blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
-   if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
+   if (ctrl->ops->supports_pci_p2pdma &&
+   ctrl->ops->supports_pci_p2pdma(ctrl))
blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
 
ns->ctrl = ctrl;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9b72b6ecf33c..957f79420cf3 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -495,7 +495,6 @@ struct nvme_ctrl_ops {
unsigned int flags;
 #define NVME_F_FABRICS (1 << 0)
 #define NVME_F_METADATA_SUPPORTED  (1 << 1)
-#define NVME_F_PCI_P2PDMA  (1 << 2)
int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
@@ -503,6 +502,7 @@ struct nvme_ctrl_ops {
void (*submit_async_event)(struct nvme_ctrl *ctrl);
void (*delete_ctrl)(struct nvme_ctrl *ctrl);
int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
+   bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 48f4f6eb877b..e5e032ab1c71 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2976,17 +2976,24 @@ static int nvme_pci_get_address(struct nvme_ctrl *ctrl, 
char *buf, int size)
return snprintf(buf, size, "%s\n", dev_name(>dev));
 }
 
+static bool nvme_pci_supports_pci_p2pdma(struct nvme_ctrl *ctrl)
+{
+   struct nvme_dev *dev = to_nvme_dev(ctrl);
+
+   return dma_pci_p2pdma_supported(dev->dev);
+}
+
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.name   = "pcie",
.module = THIS_MODULE,
-   .flags  = NVME_F_METADATA_SUPPORTED |
- NVME_F_PCI_P2PDMA,
+   .flags  = NVME_F_METADATA_SUPPORTED,
.reg_read32 = nvme_pci_reg_read32,
.reg_write32= nvme_pci_reg_write32,
.reg_read64 = nvme_pci_reg_read64,
.free_ctrl  = nvme_pci_free_ctrl,
.submit_async_event = nvme_pci_submit_async_event,
.get_address= nvme_pci_get_address,
+   .supports_pci_p2pdma= nvme_pci_supports_pci_p2pdma,
 };
 
 static int nvme_dev_map(struct nvme_dev *dev)
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v7 10/21] nvme-pci: convert to using dma_map_sgtable()

2022-06-15 Thread Logan Gunthorpe
The dma_map operations now support P2PDMA pages directly. So remove
the calls to pci_p2pdma_[un]map_sg_attrs() and replace them with calls
to dma_map_sgtable().

dma_map_sgtable() returns more complete error codes than dma_map_sg()
and allows differentiating EREMOTEIO errors in case an unsupported
P2PDMA transfer is requested. When this happens, return BLK_STS_TARGET
so the request isn't retried.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Max Gurtovoy 
Reviewed-by: Chaitanya Kulkarni 
---
 drivers/nvme/host/pci.c | 69 +
 1 file changed, 29 insertions(+), 40 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e5e032ab1c71..52b52a7efa9a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -230,11 +230,10 @@ struct nvme_iod {
bool use_sgl;
int aborted;
int npages; /* In the PRP list. 0 means small pool in use */
-   int nents;  /* Used in scatterlist */
dma_addr_t first_dma;
unsigned int dma_len;   /* length of single DMA segment mapping */
dma_addr_t meta_dma;
-   struct scatterlist *sg;
+   struct sg_table sgt;
 };
 
 static inline unsigned int nvme_dbbuf_size(struct nvme_dev *dev)
@@ -524,7 +523,7 @@ static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
 static void **nvme_pci_iod_list(struct request *req)
 {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-   return (void **)(iod->sg + blk_rq_nr_phys_segments(req));
+   return (void **)(iod->sgt.sgl + blk_rq_nr_phys_segments(req));
 }
 
 static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req)
@@ -576,17 +575,6 @@ static void nvme_free_sgls(struct nvme_dev *dev, struct 
request *req)
}
 }
 
-static void nvme_unmap_sg(struct nvme_dev *dev, struct request *req)
-{
-   struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-
-   if (is_pci_p2pdma_page(sg_page(iod->sg)))
-   pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
-   rq_dma_dir(req));
-   else
-   dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req));
-}
-
 static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -597,9 +585,10 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct 
request *req)
return;
}
 
-   WARN_ON_ONCE(!iod->nents);
+   WARN_ON_ONCE(!iod->sgt.nents);
+
+   dma_unmap_sgtable(dev->dev, >sgt, rq_dma_dir(req), 0);
 
-   nvme_unmap_sg(dev, req);
if (iod->npages == 0)
dma_pool_free(dev->prp_small_pool, nvme_pci_iod_list(req)[0],
  iod->first_dma);
@@ -607,7 +596,7 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct 
request *req)
nvme_free_sgls(dev, req);
else
nvme_free_prps(dev, req);
-   mempool_free(iod->sg, dev->iod_mempool);
+   mempool_free(iod->sgt.sgl, dev->iod_mempool);
 }
 
 static void nvme_print_sgl(struct scatterlist *sgl, int nents)
@@ -630,7 +619,7 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev 
*dev,
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct dma_pool *pool;
int length = blk_rq_payload_bytes(req);
-   struct scatterlist *sg = iod->sg;
+   struct scatterlist *sg = iod->sgt.sgl;
int dma_len = sg_dma_len(sg);
u64 dma_addr = sg_dma_address(sg);
int offset = dma_addr & (NVME_CTRL_PAGE_SIZE - 1);
@@ -703,16 +692,16 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev 
*dev,
dma_len = sg_dma_len(sg);
}
 done:
-   cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
+   cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sgt.sgl));
cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
return BLK_STS_OK;
 free_prps:
nvme_free_prps(dev, req);
return BLK_STS_RESOURCE;
 bad_sgl:
-   WARN(DO_ONCE(nvme_print_sgl, iod->sg, iod->nents),
+   WARN(DO_ONCE(nvme_print_sgl, iod->sgt.sgl, iod->sgt.nents),
"Invalid SGL for payload:%d nents:%d\n",
-   blk_rq_payload_bytes(req), iod->nents);
+   blk_rq_payload_bytes(req), iod->sgt.nents);
return BLK_STS_IOERR;
 }
 
@@ -738,12 +727,13 @@ static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc 
*sge,
 }
 
 static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
-   struct request *req, struct nvme_rw_command *cmd, int entries)
+   struct request *req, struct nvme_rw_command *cmd)
 {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct dma_pool *pool;
struct nvme_sgl_desc *sg_list;
-   struct scatterlist *sg = iod->sg;
+   struct scatt

[PATCH v7 08/21] iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg

2022-06-15 Thread Logan Gunthorpe
When a PCI P2PDMA page is seen, set the IOVA length of the segment
to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
apply the appropriate bus address to the segment. The IOVA is not
created if the scatterlist only consists of P2PDMA pages.

A P2PDMA page may have three possible outcomes when being mapped:
  1) If the data path between the two devices doesn't go through
 the root port, then it should be mapped with a PCI bus address
  2) If the data path goes through the host bridge, it should be mapped
 normally with an IOMMU IOVA.
  3) It is not possible for the two devices to communicate and thus
 the mapping operation should fail (and it will return -EREMOTEIO).

Similar to dma-direct, the sg_dma_mark_pci_p2pdma() flag is used to
indicate bus address segments. On unmap, P2PDMA segments are skipped
over when determining the start and end IOVA addresses.

With this change, the flags variable in the dma_map_ops is set to
DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for P2PDMA pages.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
---
 drivers/iommu/dma-iommu.c | 68 +++
 1 file changed, 61 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index f90251572a5d..b01ca0c6a7ab 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1062,6 +1063,16 @@ static int __finalise_sg(struct device *dev, struct 
scatterlist *sg, int nents,
sg_dma_address(s) = DMA_MAPPING_ERROR;
sg_dma_len(s) = 0;
 
+   if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {
+   if (i > 0)
+   cur = sg_next(cur);
+
+   pci_p2pdma_map_bus_segment(s, cur);
+   count++;
+   cur_len = 0;
+   continue;
+   }
+
/*
 * Now fill in the real DMA data. If...
 * - there is a valid output segment to append to
@@ -1158,6 +1169,8 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
struct iova_domain *iovad = >iovad;
struct scatterlist *s, *prev = NULL;
int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
+   struct dev_pagemap *pgmap = NULL;
+   enum pci_p2pdma_map_type map_type;
dma_addr_t iova;
size_t iova_len = 0;
unsigned long mask = dma_get_seg_boundary(dev);
@@ -1193,6 +1206,35 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
s_length = iova_align(iovad, s_length + s_iova_off);
s->length = s_length;
 
+   if (is_pci_p2pdma_page(sg_page(s))) {
+   if (sg_page(s)->pgmap != pgmap) {
+   pgmap = sg_page(s)->pgmap;
+   map_type = pci_p2pdma_map_type(pgmap, dev);
+   }
+
+   switch (map_type) {
+   case PCI_P2PDMA_MAP_BUS_ADDR:
+   /*
+* A zero length will be ignored by
+* iommu_map_sg() and then can be detected
+* in __finalise_sg() to actually map the
+* bus address.
+*/
+   s->length = 0;
+   continue;
+   case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+   /*
+* Mapping through host bridge should be
+* mapped with regular IOVAs, thus we
+* do nothing here and continue below.
+*/
+   break;
+   default:
+   ret = -EREMOTEIO;
+   goto out_restore_sg;
+   }
+   }
+
/*
 * Due to the alignment of our single IOVA allocation, we can
 * depend on these assumptions about the segment boundary mask:
@@ -1215,6 +1257,9 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
prev = s;
}
 
+   if (!iova_len)
+   return __finalise_sg(dev, sg, nents, 0);
+
iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
if (!iova) {
ret = -ENOMEM;
@@ -1236,7 +1281,7 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
 out_restore_sg:
__invalidate_sg(sg, nents);
 out:
-   if (ret != -ENOMEM)
+   if (ret != -ENOMEM && ret != -EREMOTEIO)

[PATCH v7 17/21] lib/scatterlist: add check when merging zone device pages

2022-06-15 Thread Logan Gunthorpe
Consecutive zone device pages should not be merged into the same sgl
or bvec segment with other types of pages or if they belong to different
pgmaps. Otherwise getting the pgmap of a given segment is not possible
without scanning the entire segment. This helper returns true either if
both pages are not zone device pages or both pages are zone device
pages with the same pgmap.

Factor out the check for page mergability into a pages_are_mergable()
helper and add a check with zone_device_pages_are_mergeable().

Signed-off-by: Logan Gunthorpe 
---
 lib/scatterlist.c | 25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index d5e82e4a57ad..af53a0984f76 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -410,6 +410,15 @@ static struct scatterlist *get_next_sg(struct 
sg_append_table *table,
return new_sg;
 }
 
+static bool pages_are_mergeable(struct page *a, struct page *b)
+{
+   if (page_to_pfn(a) != page_to_pfn(b) + 1)
+   return false;
+   if (!zone_device_pages_have_same_pgmap(a, b))
+   return false;
+   return true;
+}
+
 /**
  * sg_alloc_append_table_from_pages - Allocate and initialize an append sg
  *table from an array of pages
@@ -447,6 +456,7 @@ int sg_alloc_append_table_from_pages(struct sg_append_table 
*sgt_append,
unsigned int chunks, cur_page, seg_len, i, prv_len = 0;
unsigned int added_nents = 0;
struct scatterlist *s = sgt_append->prv;
+   struct page *last_pg;
 
/*
 * The algorithm below requires max_segment to be aligned to PAGE_SIZE
@@ -460,21 +470,17 @@ int sg_alloc_append_table_from_pages(struct 
sg_append_table *sgt_append,
return -EOPNOTSUPP;
 
if (sgt_append->prv) {
-   unsigned long paddr =
-   (page_to_pfn(sg_page(sgt_append->prv)) * PAGE_SIZE +
-sgt_append->prv->offset + sgt_append->prv->length) /
-   PAGE_SIZE;
-
if (WARN_ON(offset))
return -EINVAL;
 
/* Merge contiguous pages into the last SG */
prv_len = sgt_append->prv->length;
-   while (n_pages && page_to_pfn(pages[0]) == paddr) {
+   last_pg = sg_page(sgt_append->prv);
+   while (n_pages && pages_are_mergeable(last_pg, pages[0])) {
if (sgt_append->prv->length + PAGE_SIZE > max_segment)
break;
sgt_append->prv->length += PAGE_SIZE;
-   paddr++;
+   last_pg = pages[0];
pages++;
n_pages--;
}
@@ -488,7 +494,7 @@ int sg_alloc_append_table_from_pages(struct sg_append_table 
*sgt_append,
for (i = 1; i < n_pages; i++) {
seg_len += PAGE_SIZE;
if (seg_len >= max_segment ||
-   page_to_pfn(pages[i]) != page_to_pfn(pages[i - 1]) + 1) {
+   !pages_are_mergeable(pages[i], pages[i - 1])) {
chunks++;
seg_len = 0;
}
@@ -504,8 +510,7 @@ int sg_alloc_append_table_from_pages(struct sg_append_table 
*sgt_append,
for (j = cur_page + 1; j < n_pages; j++) {
seg_len += PAGE_SIZE;
if (seg_len >= max_segment ||
-   page_to_pfn(pages[j]) !=
-   page_to_pfn(pages[j - 1]) + 1)
+   !pages_are_mergeable(pages[j], pages[j - 1]))
break;
}
 
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v7 06/21] dma-direct: support PCI P2PDMA pages in dma-direct map_sg

2022-06-15 Thread Logan Gunthorpe
Add PCI P2PDMA support for dma_direct_map_sg() so that it can map
PCI P2PDMA pages directly without a hack in the callers. This allows
for heterogeneous SGLs that contain both P2PDMA and regular pages.

A P2PDMA page may have three possible outcomes when being mapped:
  1) If the data path between the two devices doesn't go through the
 root port, then it should be mapped with a PCI bus address
  2) If the data path goes through the host bridge, it should be mapped
 normally, as though it were a CPU physical address
  3) It is not possible for the two devices to communicate and thus
 the mapping operation should fail (and it will return -EREMOTEIO).

SGL segments that contain PCI bus addresses are marked with
sg_dma_mark_pci_p2pdma() and are ignored when unmapped.

P2PDMA mappings are also failed if swiotlb needs to be used on the
mapping.

Signed-off-by: Logan Gunthorpe 
---
 kernel/dma/direct.c | 43 +--
 kernel/dma/direct.h |  8 +++-
 2 files changed, 44 insertions(+), 7 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index e978f36e6be8..133a4be2d3e5 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -454,29 +454,60 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
arch_sync_dma_for_cpu_all();
 }
 
+/*
+ * Unmaps segments, except for ones marked as pci_p2pdma which do not
+ * require any further action as they contain a bus address.
+ */
 void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
int nents, enum dma_data_direction dir, unsigned long attrs)
 {
struct scatterlist *sg;
int i;
 
-   for_each_sg(sgl, sg, nents, i)
-   dma_direct_unmap_page(dev, sg->dma_address, sg_dma_len(sg), dir,
-attrs);
+   for_each_sg(sgl,  sg, nents, i) {
+   if (sg_is_dma_bus_address(sg))
+   sg_dma_unmark_bus_address(sg);
+   else
+   dma_direct_unmap_page(dev, sg->dma_address,
+ sg_dma_len(sg), dir, attrs);
+   }
 }
 #endif
 
 int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
enum dma_data_direction dir, unsigned long attrs)
 {
-   int i;
+   struct pci_p2pdma_map_state p2pdma_state = {};
+   enum pci_p2pdma_map_type map;
struct scatterlist *sg;
+   int i, ret;
 
for_each_sg(sgl, sg, nents, i) {
+   if (is_pci_p2pdma_page(sg_page(sg))) {
+   map = pci_p2pdma_map_segment(_state, dev, sg);
+   switch (map) {
+   case PCI_P2PDMA_MAP_BUS_ADDR:
+   continue;
+   case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+   /*
+* Any P2P mapping that traverses the PCI
+* host bridge must be mapped with CPU physical
+* address and not PCI bus addresses. This is
+* done with dma_direct_map_page() below.
+*/
+   break;
+   default:
+   ret = -EREMOTEIO;
+   goto out_unmap;
+   }
+   }
+
sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
sg->offset, sg->length, dir, attrs);
-   if (sg->dma_address == DMA_MAPPING_ERROR)
+   if (sg->dma_address == DMA_MAPPING_ERROR) {
+   ret = -EIO;
goto out_unmap;
+   }
sg_dma_len(sg) = sg->length;
}
 
@@ -484,7 +515,7 @@ int dma_direct_map_sg(struct device *dev, struct 
scatterlist *sgl, int nents,
 
 out_unmap:
dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
-   return -EIO;
+   return ret;
 }
 
 dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index a78c0ba70645..e38ffc5e6bdd 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -8,6 +8,7 @@
 #define _KERNEL_DMA_DIRECT_H
 
 #include 
+#include 
 
 int dma_direct_get_sgtable(struct device *dev, struct sg_table *sgt,
void *cpu_addr, dma_addr_t dma_addr, size_t size,
@@ -87,10 +88,15 @@ static inline dma_addr_t dma_direct_map_page(struct device 
*dev,
phys_addr_t phys = page_to_phys(page) + offset;
dma_addr_t dma_addr = phys_to_dma(dev, phys);
 
-   if (is_swiotlb_force_bounce(dev))
+   if (is_swiotlb_force_bounce(dev)) {
+   if (is_pci_p2pdma_page(page))
+   return DMA_MAPPING_ERROR;
return swiotlb_map(dev, phys, size, dir, attrs);
+   }
 
if (u

[PATCH v7 16/21] block: add check when merging zone device pages

2022-06-15 Thread Logan Gunthorpe
Consecutive zone device pages should not be merged into the same sgl
or bvec segment with other types of pages or if they belong to different
pgmaps. Otherwise getting the pgmap of a given segment is not possible
without scanning the entire segment. This helper returns true either if
both pages are not zone device pages or both pages are zone device
pages with the same pgmap.

Add a helper to determine if zone device pages are mergeable and use
this helper in page_is_mergeable().

Signed-off-by: Logan Gunthorpe 
---
 block/bio.c|  2 ++
 include/linux/mm.h | 23 +++
 2 files changed, 25 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index f92d0223247b..a402a4760457 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -865,6 +865,8 @@ static inline bool page_is_mergeable(const struct bio_vec 
*bv,
return false;
if (xen_domain() && !xen_biovec_phys_mergeable(bv, page))
return false;
+   if (!zone_device_pages_have_same_pgmap(bv->bv_page, page))
+   return false;
 
*same_page = ((vec_end_addr & PAGE_MASK) == page_addr);
if (*same_page)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0bcb54ea503c..33b2f4d9fd0a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1108,6 +1108,24 @@ static inline bool is_zone_device_page(const struct page 
*page)
 {
return page_zonenum(page) == ZONE_DEVICE;
 }
+
+/*
+ * Consecutive zone device pages should not be merged into the same sgl
+ * or bvec segment with other types of pages or if they belong to different
+ * pgmaps. Otherwise getting the pgmap of a given segment is not possible
+ * without scanning the entire segment. This helper returns true either if
+ * both pages are not zone device pages or both pages are zone device pages
+ * with the same pgmap.
+ */
+static inline bool zone_device_pages_have_same_pgmap(const struct page *a,
+const struct page *b)
+{
+   if (is_zone_device_page(a) != is_zone_device_page(b))
+   return false;
+   if (!is_zone_device_page(a))
+   return true;
+   return a->pgmap == b->pgmap;
+}
 extern void memmap_init_zone_device(struct zone *, unsigned long,
unsigned long, struct dev_pagemap *);
 #else
@@ -1115,6 +1133,11 @@ static inline bool is_zone_device_page(const struct page 
*page)
 {
return false;
 }
+static inline bool zone_device_pages_have_same_pgmap(const struct page *a,
+const struct page *b)
+{
+   return true;
+}
 #endif
 
 static inline bool folio_is_zone_device(const struct folio *folio)
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v7 01/21] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL

2022-06-15 Thread Logan Gunthorpe
Make use of the third free LSB in scatterlist's page_link on 64bit systems.

The extra bit will be used by dma_[un]map_sg_p2pdma() to determine when a
given SGL segments dma_address points to a PCI bus address.
dma_unmap_sg_p2pdma() will need to perform different cleanup when a
segment is marked as a bus address.

The new bit will only be used when CONFIG_PCI_P2PDMA is set; this means
PCI P2PDMA will require CONFIG_64BIT. This should be acceptable as the
majority of P2PDMA use cases are restricted to newer root complexes and
roughly require the extra address space for memory BARs used in the
transactions.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Chaitanya Kulkarni 
---
 drivers/pci/Kconfig |  5 +
 include/linux/scatterlist.h | 44 -
 2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 133c73207782..5cc7cba1941f 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -164,6 +164,11 @@ config PCI_PASID
 config PCI_P2PDMA
bool "PCI peer-to-peer transfer support"
depends on ZONE_DEVICE
+   #
+   # The need for the scatterlist DMA bus address flag means PCI P2PDMA
+   # requires 64bit
+   #
+   depends on 64BIT
select GENERIC_ALLOCATOR
help
  Enableѕ drivers to do PCI peer-to-peer transactions to and from
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 7ff9d6386c12..6561ca8aead8 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -64,12 +64,24 @@ struct sg_append_table {
 #define SG_CHAIN   0x01UL
 #define SG_END 0x02UL
 
+/*
+ * bit 2 is the third free bit in the page_link on 64bit systems which
+ * is used by dma_unmap_sg() to determine if the dma_address is a
+ * bus address when doing P2PDMA.
+ */
+#ifdef CONFIG_PCI_P2PDMA
+#define SG_DMA_BUS_ADDRESS 0x04UL
+static_assert(__alignof__(struct page) >= 8);
+#else
+#define SG_DMA_BUS_ADDRESS 0x00UL
+#endif
+
 /*
  * We overload the LSB of the page pointer to indicate whether it's
  * a valid sg entry, or whether it points to the start of a new scatterlist.
  * Those low bits are there for everyone! (thanks mason :-)
  */
-#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END)
+#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END | SG_DMA_BUS_ADDRESS)
 
 static inline unsigned int __sg_flags(struct scatterlist *sg)
 {
@@ -91,6 +103,11 @@ static inline bool sg_is_last(struct scatterlist *sg)
return __sg_flags(sg) & SG_END;
 }
 
+static inline bool sg_is_dma_bus_address(struct scatterlist *sg)
+{
+   return __sg_flags(sg) & SG_DMA_BUS_ADDRESS;
+}
+
 /**
  * sg_assign_page - Assign a given page to an SG entry
  * @sg:SG entry
@@ -245,6 +262,31 @@ static inline void sg_unmark_end(struct scatterlist *sg)
sg->page_link &= ~SG_END;
 }
 
+/**
+ * sg_dma_mark_bus address - Mark the scatterlist entry as a bus address
+ * @sg: SG entryScatterlist
+ *
+ * Description:
+ *   Marks the passed in sg entry to indicate that the dma_address is
+ *   a bus address and doesn't need to be unmapped.
+ **/
+static inline void sg_dma_mark_bus_address(struct scatterlist *sg)
+{
+   sg->page_link |= SG_DMA_BUS_ADDRESS;
+}
+
+/**
+ * sg_unmark_pci_p2pdma - Unmark the scatterlist entry as a bus address
+ * @sg: SG entryScatterlist
+ *
+ * Description:
+ *   Clears the bus address mark.
+ **/
+static inline void sg_dma_unmark_bus_address(struct scatterlist *sg)
+{
+   sg->page_link &= ~SG_DMA_BUS_ADDRESS;
+}
+
 /**
  * sg_phys - Return physical address of an sg entry
  * @sg: SG entry
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v7 04/21] PCI/P2PDMA: Introduce helpers for dma_map_sg implementations

2022-06-15 Thread Logan Gunthorpe
Add pci_p2pdma_map_segment() as a helper for simple dma_map_sg()
implementations. It takes an scatterlist segment that must point to a
pci_p2pdma struct page and will map it if the mapping requires a bus
address.

The return value indicates whether the mapping required a bus address
or whether the caller still needs to map the segment normally. If the
segment should not be mapped, -EREMOTEIO is returned.

This helper uses a state structure to track the changes to the
pgmap across calls and avoid needing to lookup into the xarray for
every page.

Also add pci_p2pdma_map_bus_segment() which is useful for IOMMU
dma_map_sg() implementations where the sg segment containing the page
differs from the sg segment containing the DMA address.

Prototypes for these helpers are added to dma-map-ops.h as they are only
useful to dma map implementations and don't need to pollute the public
pci-p2pdma header.

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
---
 drivers/pci/p2pdma.c| 59 +
 include/linux/dma-map-ops.h | 21 +
 2 files changed, 80 insertions(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 10b1d5c2b5de..2fc0f4750a2e 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -951,6 +951,65 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct 
scatterlist *sg,
 }
 EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
 
+/**
+ * pci_p2pdma_map_segment - map an sg segment determining the mapping type
+ * @state: State structure that should be declared outside of the for_each_sg()
+ * loop and initialized to zero.
+ * @dev: DMA device that's doing the mapping operation
+ * @sg: scatterlist segment to map
+ *
+ * This is a helper to be used by non-IOMMU dma_map_sg() implementations where
+ * the sg segment is the same for the page_link and the dma_address.
+ *
+ * Attempt to map a single segment in an SGL with the PCI bus address.
+ * The segment must point to a PCI P2PDMA page and thus must be
+ * wrapped in a is_pci_p2pdma_page(sg_page(sg)) check.
+ *
+ * Returns the type of mapping used and maps the page if the type is
+ * PCI_P2PDMA_MAP_BUS_ADDR.
+ */
+enum pci_p2pdma_map_type
+pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
+  struct scatterlist *sg)
+{
+   if (state->pgmap != sg_page(sg)->pgmap) {
+   state->pgmap = sg_page(sg)->pgmap;
+   state->map = pci_p2pdma_map_type(state->pgmap, dev);
+   state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
+   }
+
+   if (state->map == PCI_P2PDMA_MAP_BUS_ADDR) {
+   sg->dma_address = sg_phys(sg) + state->bus_off;
+   sg_dma_len(sg) = sg->length;
+   sg_dma_mark_bus_address(sg);
+   }
+
+   return state->map;
+}
+
+/**
+ * pci_p2pdma_map_bus_segment - map an sg segment pre determined to
+ * be mapped with PCI_P2PDMA_MAP_BUS_ADDR
+ * @pg_sg: scatterlist segment with the page to map
+ * @dma_sg: scatterlist segment to assign a DMA address to
+ *
+ * This is a helper for iommu dma_map_sg() implementations when the
+ * segment for the DMA address differs from the segment containing the
+ * source page.
+ *
+ * pci_p2pdma_map_type() must have already been called on the pg_sg and
+ * returned PCI_P2PDMA_MAP_BUS_ADDR.
+ */
+void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
+   struct scatterlist *dma_sg)
+{
+   struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(sg_page(pg_sg)->pgmap);
+
+   dma_sg->dma_address = sg_phys(pg_sg) + pgmap->bus_offset;
+   sg_dma_len(dma_sg) = pg_sg->length;
+   sg_dma_mark_bus_address(dma_sg);
+}
+
 /**
  * pci_p2pdma_enable_store - parse a configfs/sysfs attribute store
  * to enable p2pdma
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index d693a0e33bac..752f91e5eb5d 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -413,15 +413,36 @@ enum pci_p2pdma_map_type {
PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
 };
 
+struct pci_p2pdma_map_state {
+   struct dev_pagemap *pgmap;
+   int map;
+   u64 bus_off;
+};
+
 #ifdef CONFIG_PCI_P2PDMA
 enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
 struct device *dev);
+enum pci_p2pdma_map_type
+pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
+  struct scatterlist *sg);
+void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
+   struct scatterlist *dma_sg);
 #else /* CONFIG_PCI_P2PDMA */
 static inline enum pci_p2pdma_map_type
 pci_p2pdma_map_type(struct dev_pagemap *pgmap, struct device *dev)
 {
return PCI_P2PDMA_MAP_NOT_SUPPORTED;
 }
+static inline enum pci_p2pdma_map_type
+pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct 

[PATCH v7 19/21] block: set FOLL_PCI_P2PDMA in bio_map_user_iov()

2022-06-15 Thread Logan Gunthorpe
When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be
passed from userspace and enables the NVMe passthru requests to
use P2PDMA pages.

Signed-off-by: Logan Gunthorpe 
---
 block/blk-map.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index df8b066cd548..1d6bcf193a42 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -236,6 +236,7 @@ static int bio_map_user_iov(struct request *rq, struct 
iov_iter *iter,
 {
unsigned int max_sectors = queue_max_hw_sectors(rq->q);
unsigned int nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS);
+   unsigned int flags = 0;
struct bio *bio;
int ret;
int j;
@@ -248,13 +249,17 @@ static int bio_map_user_iov(struct request *rq, struct 
iov_iter *iter,
return -ENOMEM;
bio_init(bio, NULL, bio->bi_inline_vecs, nr_vecs, req_op(rq));
 
+   if (blk_queue_pci_p2pdma(rq->q))
+   flags |= FOLL_PCI_P2PDMA;
+
while (iov_iter_count(iter)) {
struct page **pages;
ssize_t bytes;
size_t offs, added = 0;
int npages;
 
-   bytes = iov_iter_get_pages_alloc(iter, , LONG_MAX, );
+   bytes = iov_iter_get_pages_alloc_flags(iter, , LONG_MAX,
+  , flags);
if (unlikely(bytes <= 0)) {
ret = bytes ? bytes : -EFAULT;
goto out_unmap;
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v7 14/21] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages

2022-06-15 Thread Logan Gunthorpe
GUP Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
allow obtaining P2PDMA pages. If GUP is called without the flag and a
P2PDMA page is found, it will return an error.

FOLL_PCI_P2PDMA cannot be set if FOLL_LONGTERM is set.

Signed-off-by: Logan Gunthorpe 
---
 include/linux/mm.h |  1 +
 mm/gup.c   | 22 +-
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bc8f326be0ce..0bcb54ea503c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2941,6 +2941,7 @@ struct page *follow_page(struct vm_area_struct *vma, 
unsigned long address,
 #define FOLL_SPLIT_PMD 0x2 /* split huge pmd before returning */
 #define FOLL_PIN   0x4 /* pages must be released via unpin_user_page */
 #define FOLL_FAST_ONLY 0x8 /* gup_fast: prevent fall-back to slow gup */
+#define FOLL_PCI_P2PDMA0x10 /* allow returning PCI P2PDMA pages */
 
 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
diff --git a/mm/gup.c b/mm/gup.c
index 551264407624..f15f01d06a09 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -564,6 +564,12 @@ static struct page *follow_page_pte(struct vm_area_struct 
*vma,
goto out;
}
 
+   if (unlikely(!(flags & FOLL_PCI_P2PDMA) &&
+is_pci_p2pdma_page(page))) {
+   page = ERR_PTR(-EREMOTEIO);
+   goto out;
+   }
+
VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
   !PageAnonExclusive(page), page);
 
@@ -994,6 +1000,9 @@ static int check_vma_flags(struct vm_area_struct *vma, 
unsigned long gup_flags)
if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
return -EOPNOTSUPP;
 
+   if ((gup_flags & FOLL_LONGTERM) && (gup_flags & FOLL_PCI_P2PDMA))
+   return -EOPNOTSUPP;
+
if (vma_is_secretmem(vma))
return -EFAULT;
 
@@ -2289,6 +2298,10 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
unsigned long end,
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
page = pte_page(pte);
 
+   if (unlikely(!(flags & FOLL_PCI_P2PDMA) &&
+is_pci_p2pdma_page(page)))
+   goto pte_unmap;
+
folio = try_grab_folio(page, 1, flags);
if (!folio)
goto pte_unmap;
@@ -2368,6 +2381,12 @@ static int __gup_device_huge(unsigned long pfn, unsigned 
long addr,
undo_dev_pagemap(nr, nr_start, flags, pages);
break;
}
+
+   if (!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
+   undo_dev_pagemap(nr, nr_start, flags, pages);
+   break;
+   }
+
SetPageReferenced(page);
pages[*nr] = page;
if (unlikely(!try_grab_page(page, flags))) {
@@ -2856,7 +2875,8 @@ static int internal_get_user_pages_fast(unsigned long 
start,
 
if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
   FOLL_FORCE | FOLL_PIN | FOLL_GET |
-  FOLL_FAST_ONLY | FOLL_NOFAULT)))
+  FOLL_FAST_ONLY | FOLL_NOFAULT |
+  FOLL_PCI_P2PDMA)))
return -EINVAL;
 
if (gup_flags & FOLL_PIN)
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v7 13/21] PCI/P2PDMA: Remove pci_p2pdma_[un]map_sg()

2022-06-15 Thread Logan Gunthorpe
This interface is superseded by support in dma_map_sg() which now supports
heterogeneous scatterlists. There are no longer any users, so remove it.

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Max Gurtovoy 
---
 drivers/pci/p2pdma.c   | 66 --
 include/linux/pci-p2pdma.h | 27 
 2 files changed, 93 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 2fc0f4750a2e..d4e635012ffe 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -885,72 +885,6 @@ enum pci_p2pdma_map_type pci_p2pdma_map_type(struct 
dev_pagemap *pgmap,
return type;
 }
 
-static int __pci_p2pdma_map_sg(struct pci_p2pdma_pagemap *p2p_pgmap,
-   struct device *dev, struct scatterlist *sg, int nents)
-{
-   struct scatterlist *s;
-   int i;
-
-   for_each_sg(sg, s, nents, i) {
-   s->dma_address = sg_phys(s) + p2p_pgmap->bus_offset;
-   sg_dma_len(s) = s->length;
-   }
-
-   return nents;
-}
-
-/**
- * pci_p2pdma_map_sg_attrs - map a PCI peer-to-peer scatterlist for DMA
- * @dev: device doing the DMA request
- * @sg: scatter list to map
- * @nents: elements in the scatterlist
- * @dir: DMA direction
- * @attrs: DMA attributes passed to dma_map_sg() (if called)
- *
- * Scatterlists mapped with this function should be unmapped using
- * pci_p2pdma_unmap_sg_attrs().
- *
- * Returns the number of SG entries mapped or 0 on error.
- */
-int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs)
-{
-   struct pci_p2pdma_pagemap *p2p_pgmap =
-   to_p2p_pgmap(sg_page(sg)->pgmap);
-
-   switch (pci_p2pdma_map_type(sg_page(sg)->pgmap, dev)) {
-   case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
-   return dma_map_sg_attrs(dev, sg, nents, dir, attrs);
-   case PCI_P2PDMA_MAP_BUS_ADDR:
-   return __pci_p2pdma_map_sg(p2p_pgmap, dev, sg, nents);
-   default:
-   /* Mapping is not Supported */
-   return 0;
-   }
-}
-EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg_attrs);
-
-/**
- * pci_p2pdma_unmap_sg_attrs - unmap a PCI peer-to-peer scatterlist that was
- * mapped with pci_p2pdma_map_sg()
- * @dev: device doing the DMA request
- * @sg: scatter list to map
- * @nents: number of elements returned by pci_p2pdma_map_sg()
- * @dir: DMA direction
- * @attrs: DMA attributes passed to dma_unmap_sg() (if called)
- */
-void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs)
-{
-   enum pci_p2pdma_map_type map_type;
-
-   map_type = pci_p2pdma_map_type(sg_page(sg)->pgmap, dev);
-
-   if (map_type == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE)
-   dma_unmap_sg_attrs(dev, sg, nents, dir, attrs);
-}
-EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
-
 /**
  * pci_p2pdma_map_segment - map an sg segment determining the mapping type
  * @state: State structure that should be declared outside of the for_each_sg()
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 8318a97c9c61..2c07aa6b7665 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -30,10 +30,6 @@ struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev 
*pdev,
 unsigned int *nents, u32 length);
 void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
 void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
-int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs);
-void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs);
 int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
bool *use_p2pdma);
 ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
@@ -83,17 +79,6 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
 static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 {
 }
-static inline int pci_p2pdma_map_sg_attrs(struct device *dev,
-   struct scatterlist *sg, int nents, enum dma_data_direction dir,
-   unsigned long attrs)
-{
-   return 0;
-}
-static inline void pci_p2pdma_unmap_sg_attrs(struct device *dev,
-   struct scatterlist *sg, int nents, enum dma_data_direction dir,
-   unsigned long attrs)
-{
-}
 static inline int pci_p2pdma_enable_store(const char *page,
struct pci_dev **p2p_dev, bool *use_p2pdma)
 {
@@ -119,16 +104,4 @@ static inline struct pci_dev *pci_p2pmem_find(struct 
device *client)
return pci_p2pmem_find_many(, 1);
 }
 
-static inline int pci_p2pdma_map_sg(

[PATCH v7 20/21] PCI/P2PDMA: Introduce pci_mmap_p2pmem()

2022-06-15 Thread Logan Gunthorpe
Introduce pci_mmap_p2pmem() which is a helper to allocate and mmap
a hunk of p2pmem into userspace.

Pages are allocated from the genalloc in bulk with their reference
count set to one. They are returned to the genalloc when the page is put
through p2pdma_page_free() (the reference count is once again set to 1
in free_zone_device_page()).

The VMA does not take a reference to the pages when they are inserted
with vmf_insert_mixed() (which is necessary for zone device pages) so
the backing P2P memory is stored in a structures in vm_private_data.

A pseudo mount is used to allocate an inode for each PCI device. The
inode's address_space is used in the file doing the mmap so that all
VMAs are collected and can be unmapped if the PCI device is unbound.
After unmapping, the VMAs are iterated through and their pages are
put so the device can continue to be unbound. An active flag is used
to signal to VMAs not to allocate any further P2P memory once the
removal process starts. The flag is synchronized with concurrent
access with an RCU lock.

The VMAs and inode will survive after the unbind of the device, but no
pages will be present in the VMA and a subsequent access will result
in a SIGBUS error.

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
---
 drivers/pci/p2pdma.c   | 210 -
 include/linux/pci-p2pdma.h |  16 +++
 include/uapi/linux/magic.h |   1 +
 3 files changed, 225 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index d4e635012ffe..a6572069008b 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -17,14 +17,19 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
+#include 
 
 struct pci_p2pdma {
struct gen_pool *pool;
bool p2pmem_published;
struct xarray map_types;
+   struct inode *inode;
+   bool active;
 };
 
 struct pci_p2pdma_pagemap {
@@ -101,6 +106,41 @@ static const struct attribute_group p2pmem_group = {
.name = "p2pmem",
 };
 
+/*
+ * P2PDMA internal mount
+ * Fake an internal VFS mount-point in order to allocate struct address_space
+ * mappings to remove VMAs on unbind events.
+ */
+static int pci_p2pdma_fs_cnt;
+static struct vfsmount *pci_p2pdma_fs_mnt;
+
+static int pci_p2pdma_fs_init_fs_context(struct fs_context *fc)
+{
+   return init_pseudo(fc, P2PDMA_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type pci_p2pdma_fs_type = {
+   .name = "p2dma",
+   .owner = THIS_MODULE,
+   .init_fs_context = pci_p2pdma_fs_init_fs_context,
+   .kill_sb = kill_anon_super,
+};
+
+static void p2pdma_page_free(struct page *page)
+{
+   struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
+   struct percpu_ref *ref;
+
+   gen_pool_free_owner(pgmap->provider->p2pdma->pool,
+   (uintptr_t)page_to_virt(page), PAGE_SIZE,
+   (void **));
+   percpu_ref_put(ref);
+}
+
+static const struct dev_pagemap_ops p2pdma_pgmap_ops = {
+   .page_free = p2pdma_page_free,
+};
+
 static void pci_p2pdma_release(void *data)
 {
struct pci_dev *pdev = data;
@@ -117,6 +157,9 @@ static void pci_p2pdma_release(void *data)
gen_pool_destroy(p2pdma->pool);
sysfs_remove_group(>dev.kobj, _group);
xa_destroy(>map_types);
+
+   iput(p2pdma->inode);
+   simple_release_fs(_p2pdma_fs_mnt, _p2pdma_fs_cnt);
 }
 
 static int pci_p2pdma_setup(struct pci_dev *pdev)
@@ -134,17 +177,32 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
if (!p2p->pool)
goto out;
 
-   error = devm_add_action_or_reset(>dev, pci_p2pdma_release, pdev);
+   error = simple_pin_fs(_p2pdma_fs_type, _p2pdma_fs_mnt,
+ _p2pdma_fs_cnt);
if (error)
goto out_pool_destroy;
 
+   p2p->inode = alloc_anon_inode(pci_p2pdma_fs_mnt->mnt_sb);
+   if (IS_ERR(p2p->inode)) {
+   error = -ENOMEM;
+   goto out_unpin_fs;
+   }
+
+   error = devm_add_action_or_reset(>dev, pci_p2pdma_release, pdev);
+   if (error)
+   goto out_put_inode;
+
error = sysfs_create_group(>dev.kobj, _group);
if (error)
-   goto out_pool_destroy;
+   goto out_put_inode;
 
rcu_assign_pointer(pdev->p2pdma, p2p);
return 0;
 
+out_put_inode:
+   iput(p2p->inode);
+out_unpin_fs:
+   simple_release_fs(_p2pdma_fs_mnt, _p2pdma_fs_cnt);
 out_pool_destroy:
gen_pool_destroy(p2p->pool);
 out:
@@ -152,6 +210,18 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
return error;
 }
 
+static void pci_p2pdma_unmap_mappings(void *data)
+{
+   struct pci_dev *pdev = data;
+   struct pci_p2pdma *p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
+
+   /* Ensure no new pages can be all

[PATCH v7 02/21] PCI/P2PDMA: Attempt to set map_type if it has not been set

2022-06-15 Thread Logan Gunthorpe
Attempt to find the mapping type for P2PDMA pages on the first
DMA map attempt if it has not been done ahead of time.

Previously, the mapping type was expected to be calculated ahead of
time, but if pages are to come from userspace then there's no
way to ensure the path was checked ahead of time.

This change will calculate the mapping type if it hasn't pre-calculated
so it is no longer invalid to call pci_p2pdma_map_sg() before the mapping
type is calculated, so drop the WARN_ON when that is the case.

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
Reviewed-by: Chaitanya Kulkarni 
---
 drivers/pci/p2pdma.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 462b429ad243..4e8bc457e29a 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -854,6 +854,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct 
dev_pagemap *pgmap,
struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
struct pci_dev *client;
struct pci_p2pdma *p2pdma;
+   int dist;
 
if (!provider->p2pdma)
return PCI_P2PDMA_MAP_NOT_SUPPORTED;
@@ -870,6 +871,10 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct 
dev_pagemap *pgmap,
type = xa_to_value(xa_load(>map_types,
   map_types_idx(client)));
rcu_read_unlock();
+
+   if (type == PCI_P2PDMA_MAP_UNKNOWN)
+   return calc_map_type_and_dist(provider, client, , true);
+
return type;
 }
 
@@ -912,7 +917,7 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct 
scatterlist *sg,
case PCI_P2PDMA_MAP_BUS_ADDR:
return __pci_p2pdma_map_sg(p2p_pgmap, dev, sg, nents);
default:
-   WARN_ON_ONCE(1);
+   /* Mapping is not Supported */
return 0;
}
 }
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v7 00/21] Userspace P2PDMA with O_DIRECT NVMe devices

2022-06-15 Thread Logan Gunthorpe
, and that synchronize_rcu() was likely
too slow to use in the vma close operation.
  - Collected Acks and Reviews by Bjorn, Jason and Max.

--

Logan Gunthorpe (21):
  lib/scatterlist: add flag for indicating P2PDMA segments in an SGL
  PCI/P2PDMA: Attempt to set map_type if it has not been set
  PCI/P2PDMA: Expose pci_p2pdma_map_type()
  PCI/P2PDMA: Introduce helpers for dma_map_sg implementations
  dma-mapping: allow EREMOTEIO return code for P2PDMA transfers
  dma-direct: support PCI P2PDMA pages in dma-direct map_sg
  dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support
  iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg
  nvme-pci: check DMA ops when indicating support for PCI P2PDMA
  nvme-pci: convert to using dma_map_sgtable()
  RDMA/core: introduce ib_dma_pci_p2p_dma_supported()
  RDMA/rw: drop pci_p2pdma_[un]map_sg()
  PCI/P2PDMA: Remove pci_p2pdma_[un]map_sg()
  mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
  iov_iter: introduce iov_iter_get_pages_[alloc_]flags()
  block: add check when merging zone device pages
  lib/scatterlist: add check when merging zone device pages
  block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()
  block: set FOLL_PCI_P2PDMA in bio_map_user_iov()
  PCI/P2PDMA: Introduce pci_mmap_p2pmem()
  nvme-pci: allow mmaping the CMB in userspace

 block/bio.c  |  10 +-
 block/blk-map.c  |   7 +-
 drivers/infiniband/core/rw.c |  45 +
 drivers/iommu/dma-iommu.c|  68 ++-
 drivers/nvme/host/core.c |  38 +++-
 drivers/nvme/host/nvme.h |   5 +-
 drivers/nvme/host/pci.c  | 103 ++-
 drivers/nvme/target/rdma.c   |   2 +-
 drivers/pci/Kconfig  |   5 +
 drivers/pci/p2pdma.c | 337 ---
 include/linux/dma-map-ops.h  |  76 
 include/linux/dma-mapping.h  |   5 +
 include/linux/mm.h   |  24 +++
 include/linux/pci-p2pdma.h   |  43 ++---
 include/linux/scatterlist.h  |  44 -
 include/linux/uio.h  |   6 +
 include/rdma/ib_verbs.h  |  11 ++
 include/uapi/linux/magic.h   |   1 +
 kernel/dma/direct.c  |  43 -
 kernel/dma/direct.h  |   8 +-
 kernel/dma/mapping.c |  22 ++-
 lib/iov_iter.c   |  25 ++-
 lib/scatterlist.c|  25 +--
 mm/gup.c |  22 ++-
 24 files changed, 765 insertions(+), 210 deletions(-)


base-commit: f2906aa863381afb0015a9eb7fefad885d4e5a56
--
2.30.2
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v7 07/21] dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support

2022-06-15 Thread Logan Gunthorpe
Add a flags member to the dma_map_ops structure with one flag to
indicate support for PCI P2PDMA.

Also, add a helper to check if a device supports PCI P2PDMA.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
---
 include/linux/dma-map-ops.h | 10 ++
 include/linux/dma-mapping.h |  5 +
 kernel/dma/mapping.c| 18 ++
 3 files changed, 33 insertions(+)

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 752f91e5eb5d..4d4161d58ce0 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -11,7 +11,17 @@
 
 struct cma;
 
+/*
+ * Values for struct dma_map_ops.flags:
+ *
+ * DMA_F_PCI_P2PDMA_SUPPORTED: Indicates the dma_map_ops implementation can
+ * handle PCI P2PDMA pages in the map_sg/unmap_sg operation.
+ */
+#define DMA_F_PCI_P2PDMA_SUPPORTED (1 << 0)
+
 struct dma_map_ops {
+   unsigned int flags;
+
void *(*alloc)(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t gfp,
unsigned long attrs);
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index dca2b1355bb1..f7c61b2b4b5e 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -140,6 +140,7 @@ int dma_mmap_attrs(struct device *dev, struct 
vm_area_struct *vma,
unsigned long attrs);
 bool dma_can_mmap(struct device *dev);
 int dma_supported(struct device *dev, u64 mask);
+bool dma_pci_p2pdma_supported(struct device *dev);
 int dma_set_mask(struct device *dev, u64 mask);
 int dma_set_coherent_mask(struct device *dev, u64 mask);
 u64 dma_get_required_mask(struct device *dev);
@@ -250,6 +251,10 @@ static inline int dma_supported(struct device *dev, u64 
mask)
 {
return 0;
 }
+static inline bool dma_pci_p2pdma_supported(struct device *dev)
+{
+   return false;
+}
 static inline int dma_set_mask(struct device *dev, u64 mask)
 {
return -EIO;
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 9f65d1041638..21793506fdb6 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -722,6 +722,24 @@ int dma_supported(struct device *dev, u64 mask)
 }
 EXPORT_SYMBOL(dma_supported);
 
+bool dma_pci_p2pdma_supported(struct device *dev)
+{
+   const struct dma_map_ops *ops = get_dma_ops(dev);
+
+   /* if ops is not set, dma direct will be used which supports P2PDMA */
+   if (!ops)
+   return true;
+
+   /*
+* Note: dma_ops_bypass is not checked here because P2PDMA should
+* not be used with dma mapping ops that do not have support even
+* if the specific device is bypassing them.
+*/
+
+   return ops->flags & DMA_F_PCI_P2PDMA_SUPPORTED;
+}
+EXPORT_SYMBOL_GPL(dma_pci_p2pdma_supported);
+
 #ifdef CONFIG_ARCH_HAS_DMA_SET_MASK
 void arch_dma_set_mask(struct device *dev, u64 mask);
 #else
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v7 11/21] RDMA/core: introduce ib_dma_pci_p2p_dma_supported()

2022-06-15 Thread Logan Gunthorpe
Introduce the helper function ib_dma_pci_p2p_dma_supported() to check
if a given ib_device can be used in P2PDMA transfers. This ensures
the ib_device is not using virt_dma and also that the underlying
dma_device supports P2PDMA.

Use the new helper in nvme-rdma to replace the existing check for
ib_uses_virt_dma(). Adding the dma_pci_p2pdma_supported() check allows
switching away from pci_p2pdma_[un]map_sg().

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Max Gurtovoy 
---
 drivers/nvme/target/rdma.c |  2 +-
 include/rdma/ib_verbs.h| 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 09fdcac87d17..4597bca43a6d 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -415,7 +415,7 @@ static int nvmet_rdma_alloc_rsp(struct nvmet_rdma_device 
*ndev,
if (ib_dma_mapping_error(ndev->device, r->send_sge.addr))
goto out_free_rsp;
 
-   if (!ib_uses_virt_dma(ndev->device))
+   if (ib_dma_pci_p2p_dma_supported(ndev->device))
r->req.p2p_client = >device->dev;
r->send_sge.length = sizeof(*r->req.cqe);
r->send_sge.lkey = ndev->pd->local_dma_lkey;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 9c6317cf80d5..523843d9ed6c 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -4013,6 +4013,17 @@ static inline bool ib_uses_virt_dma(struct ib_device 
*dev)
return IS_ENABLED(CONFIG_INFINIBAND_VIRT_DMA) && !dev->dma_device;
 }
 
+/*
+ * Check if a IB device's underlying DMA mapping supports P2PDMA transfers.
+ */
+static inline bool ib_dma_pci_p2p_dma_supported(struct ib_device *dev)
+{
+   if (ib_uses_virt_dma(dev))
+   return false;
+
+   return dma_pci_p2pdma_supported(dev->dma_device);
+}
+
 /**
  * ib_dma_mapping_error - check a DMA addr for error
  * @dev: The device for which the dma_addr was created
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v7 21/21] nvme-pci: allow mmaping the CMB in userspace

2022-06-15 Thread Logan Gunthorpe
Allow userspace to obtain CMB memory by mmaping the controller's
char device. The mmap call allocates and returns a hunk of CMB memory,
(the offset is ignored) so userspace does not have control over the
address within the CMB.

A VMA allocated in this way will only be usable by drivers that set
FOLL_PCI_P2PDMA when calling GUP. And inter-device support will be
checked the first time the pages are mapped for DMA.

Currently this is only supported by O_DIRECT to an PCI NVMe device
or through the NVMe passthrough IOCTL.

Signed-off-by: Logan Gunthorpe 
---
 drivers/nvme/host/core.c | 35 +++
 drivers/nvme/host/nvme.h |  3 +++
 drivers/nvme/host/pci.c  | 23 +++
 3 files changed, 57 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index d6e76f2dc293..23fe4b544bf1 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3166,6 +3166,7 @@ static int nvme_dev_open(struct inode *inode, struct file 
*file)
 {
struct nvme_ctrl *ctrl =
container_of(inode->i_cdev, struct nvme_ctrl, cdev);
+   int ret = -EINVAL;
 
switch (ctrl->state) {
case NVME_CTRL_LIVE:
@@ -3175,13 +3176,25 @@ static int nvme_dev_open(struct inode *inode, struct 
file *file)
}
 
nvme_get_ctrl(ctrl);
-   if (!try_module_get(ctrl->ops->module)) {
-   nvme_put_ctrl(ctrl);
-   return -EINVAL;
-   }
+   if (!try_module_get(ctrl->ops->module))
+   goto err_put_ctrl;
 
file->private_data = ctrl;
+
+   if (ctrl->ops->cdev_file_open) {
+   ret = ctrl->ops->cdev_file_open(ctrl, file);
+   if (ret)
+   goto err_put_mod;
+   }
+
return 0;
+
+err_put_mod:
+   module_put(ctrl->ops->module);
+err_put_ctrl:
+   nvme_put_ctrl(ctrl);
+   return ret;
+
 }
 
 static int nvme_dev_release(struct inode *inode, struct file *file)
@@ -3189,11 +3202,24 @@ static int nvme_dev_release(struct inode *inode, struct 
file *file)
struct nvme_ctrl *ctrl =
container_of(inode->i_cdev, struct nvme_ctrl, cdev);
 
+   if (ctrl->ops->cdev_file_release)
+   ctrl->ops->cdev_file_release(file);
+
module_put(ctrl->ops->module);
nvme_put_ctrl(ctrl);
return 0;
 }
 
+static int nvme_dev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+   struct nvme_ctrl *ctrl = file->private_data;
+
+   if (!ctrl->ops->mmap_cmb)
+   return -ENODEV;
+
+   return ctrl->ops->mmap_cmb(ctrl, vma);
+}
+
 static const struct file_operations nvme_dev_fops = {
.owner  = THIS_MODULE,
.open   = nvme_dev_open,
@@ -3201,6 +3227,7 @@ static const struct file_operations nvme_dev_fops = {
.unlocked_ioctl = nvme_dev_ioctl,
.compat_ioctl   = compat_ptr_ioctl,
.uring_cmd  = nvme_dev_uring_cmd,
+   .mmap   = nvme_dev_mmap,
 };
 
 static ssize_t nvme_sysfs_reset(struct device *dev,
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 957f79420cf3..44ff05d8e24d 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -503,6 +503,9 @@ struct nvme_ctrl_ops {
void (*delete_ctrl)(struct nvme_ctrl *ctrl);
int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
+   int (*cdev_file_open)(struct nvme_ctrl *ctrl, struct file *file);
+   void (*cdev_file_release)(struct file *file);
+   int (*mmap_cmb)(struct nvme_ctrl *ctrl, struct vm_area_struct *vma);
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 52b52a7efa9a..8ef3752b7ddb 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2972,6 +2972,26 @@ static bool nvme_pci_supports_pci_p2pdma(struct 
nvme_ctrl *ctrl)
return dma_pci_p2pdma_supported(dev->dev);
 }
 
+static int nvme_pci_cdev_file_open(struct nvme_ctrl *ctrl, struct file *file)
+{
+   struct pci_dev *pdev = to_pci_dev(to_nvme_dev(ctrl)->dev);
+
+   return pci_p2pdma_file_open(pdev, file);
+}
+
+static void nvme_pci_cdev_file_release(struct file *file)
+{
+   pci_p2pdma_file_release(file);
+}
+
+static int nvme_pci_mmap_cmb(struct nvme_ctrl *ctrl,
+struct vm_area_struct *vma)
+{
+   struct pci_dev *pdev = to_pci_dev(to_nvme_dev(ctrl)->dev);
+
+   return pci_mmap_p2pmem(pdev, vma);
+}
+
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.name   = "pcie",
.module = THIS_MODULE,
@@ -2983,6 +3003,9 @@ static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.submit_async_event = nvme_pci_submit_async_event,
.get_address= nvme_pci_get_ad

[PATCH v7 18/21] block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()

2022-06-15 Thread Logan Gunthorpe
When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be passed
from userspace and enables the O_DIRECT path in iomap based filesystems
and direct to block devices.

Signed-off-by: Logan Gunthorpe 
---
 block/bio.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index a402a4760457..0d152da8938d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1180,6 +1180,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, 
struct iov_iter *iter)
struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
struct page **pages = (struct page **)bv;
bool same_page = false;
+   unsigned int flags = 0;
ssize_t size, left;
unsigned len, i;
size_t offset;
@@ -1192,7 +1193,12 @@ static int __bio_iov_iter_get_pages(struct bio *bio, 
struct iov_iter *iter)
BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
 
-   size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, );
+   if (bio->bi_bdev && bio->bi_bdev->bd_disk &&
+   blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))
+   flags |= FOLL_PCI_P2PDMA;
+
+   size = iov_iter_get_pages_flags(iter, pages, LONG_MAX, nr_pages,
+   , flags);
if (unlikely(size <= 0))
return size ? size : -EFAULT;
 
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v7 12/21] RDMA/rw: drop pci_p2pdma_[un]map_sg()

2022-06-15 Thread Logan Gunthorpe
dma_map_sg() now supports the use of P2PDMA pages so pci_p2pdma_map_sg()
is no longer necessary and may be dropped. This means the
rdma_rw_[un]map_sg() helpers are no longer necessary. Remove it all.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
---
 drivers/infiniband/core/rw.c | 45 
 1 file changed, 9 insertions(+), 36 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index 4d98f931a13d..8367974b7998 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -274,33 +274,6 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
return 1;
 }
 
-static void rdma_rw_unmap_sg(struct ib_device *dev, struct scatterlist *sg,
-u32 sg_cnt, enum dma_data_direction dir)
-{
-   if (is_pci_p2pdma_page(sg_page(sg)))
-   pci_p2pdma_unmap_sg(dev->dma_device, sg, sg_cnt, dir);
-   else
-   ib_dma_unmap_sg(dev, sg, sg_cnt, dir);
-}
-
-static int rdma_rw_map_sgtable(struct ib_device *dev, struct sg_table *sgt,
-  enum dma_data_direction dir)
-{
-   int nents;
-
-   if (is_pci_p2pdma_page(sg_page(sgt->sgl))) {
-   if (WARN_ON_ONCE(ib_uses_virt_dma(dev)))
-   return 0;
-   nents = pci_p2pdma_map_sg(dev->dma_device, sgt->sgl,
- sgt->orig_nents, dir);
-   if (!nents)
-   return -EIO;
-   sgt->nents = nents;
-   return 0;
-   }
-   return ib_dma_map_sgtable_attrs(dev, sgt, dir, 0);
-}
-
 /**
  * rdma_rw_ctx_init - initialize a RDMA READ/WRITE context
  * @ctx:   context to initialize
@@ -327,7 +300,7 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp 
*qp, u32 port_num,
};
int ret;
 
-   ret = rdma_rw_map_sgtable(dev, , dir);
+   ret = ib_dma_map_sgtable_attrs(dev, , dir, 0);
if (ret)
return ret;
sg_cnt = sgt.nents;
@@ -366,7 +339,7 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp 
*qp, u32 port_num,
return ret;
 
 out_unmap_sg:
-   rdma_rw_unmap_sg(dev, sgt.sgl, sgt.orig_nents, dir);
+   ib_dma_unmap_sgtable_attrs(dev, , dir, 0);
return ret;
 }
 EXPORT_SYMBOL(rdma_rw_ctx_init);
@@ -414,12 +387,12 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
return -EINVAL;
}
 
-   ret = rdma_rw_map_sgtable(dev, , dir);
+   ret = ib_dma_map_sgtable_attrs(dev, , dir, 0);
if (ret)
return ret;
 
if (prot_sg_cnt) {
-   ret = rdma_rw_map_sgtable(dev, _sgt, dir);
+   ret = ib_dma_map_sgtable_attrs(dev, _sgt, dir, 0);
if (ret)
goto out_unmap_sg;
}
@@ -486,9 +459,9 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
kfree(ctx->reg);
 out_unmap_prot_sg:
if (prot_sgt.nents)
-   rdma_rw_unmap_sg(dev, prot_sgt.sgl, prot_sgt.orig_nents, dir);
+   ib_dma_unmap_sgtable_attrs(dev, _sgt, dir, 0);
 out_unmap_sg:
-   rdma_rw_unmap_sg(dev, sgt.sgl, sgt.orig_nents, dir);
+   ib_dma_unmap_sgtable_attrs(dev, , dir, 0);
return ret;
 }
 EXPORT_SYMBOL(rdma_rw_ctx_signature_init);
@@ -621,7 +594,7 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct 
ib_qp *qp,
break;
}
 
-   rdma_rw_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+   ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy);
 
@@ -649,8 +622,8 @@ void rdma_rw_ctx_destroy_signature(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
kfree(ctx->reg);
 
if (prot_sg_cnt)
-   rdma_rw_unmap_sg(qp->pd->device, prot_sg, prot_sg_cnt, dir);
-   rdma_rw_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+   ib_dma_unmap_sg(qp->pd->device, prot_sg, prot_sg_cnt, dir);
+   ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy_signature);
 
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 20/21] PCI/P2PDMA: Introduce pci_mmap_p2pmem()

2022-06-02 Thread Logan Gunthorpe



On 2022-06-02 11:28, Jason Gunthorpe wrote:
> On Thu, Jun 02, 2022 at 10:49:15AM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2022-06-02 10:30, Jason Gunthorpe wrote:
>>> On Thu, Jun 02, 2022 at 10:16:10AM -0600, Logan Gunthorpe wrote:
>>>
>>>>> Just stuff the pages into the mmap, and your driver unprobe will
>>>>> automatically block until all the mmaps are closed - no different than
>>>>> having an open file descriptor or something.
>>>>
>>>> Oh is that what we want?
>>>
>>> Yes, it is the typical case - eg if you have a sysfs file open unbind
>>> hangs indefinitely. Many drivers can't unbind while they have open file
>>> descriptors/etc.
>>>
>>> A couple drivers go out of their way to allow unbinding while a live
>>> userspace exists but this can get complicated. Usually there should be
>>> a good reason.
>>>
>>> The module will already be refcounted anyhow because the mmap points
>>> to a char file which holds a module reference - meaning a simple rmmod
>>> of the driver shouldn't work already..
>>
>> Also, I just tried it... If I open a sysfs file for an nvme device (ie.
>> /sys/class/nvme/nvme4/cntlid) and unbind the device, it does not block.
>> A subsequent read on that file descriptor returns ENODEV. Which is what
>> I would have expected.
> 
> Oh interesting, this has been changed since years ago when I last
> looked, the kernfs_get_active() is now more narrowed than it once
> was. So manybe sysfs isn't the same concern it used to be!

Yeah, so I really think *not* blocking unbind indefinitely is the better
approach here. It's what has always been done with device dax, etc.
mmaps in userspace processes get unmapped and will fault with SIGBUS on
next access and unbind will actually unbind the device relatively
promptly. Userspace processes can fail or try to handle the device going
away gracefully.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 20/21] PCI/P2PDMA: Introduce pci_mmap_p2pmem()

2022-06-02 Thread Logan Gunthorpe



On 2022-06-02 10:30, Jason Gunthorpe wrote:
> On Thu, Jun 02, 2022 at 10:16:10AM -0600, Logan Gunthorpe wrote:
> 
>>> Just stuff the pages into the mmap, and your driver unprobe will
>>> automatically block until all the mmaps are closed - no different than
>>> having an open file descriptor or something.
>>
>> Oh is that what we want?
> 
> Yes, it is the typical case - eg if you have a sysfs file open unbind
> hangs indefinitely. Many drivers can't unbind while they have open file
> descriptors/etc.
> 
> A couple drivers go out of their way to allow unbinding while a live
> userspace exists but this can get complicated. Usually there should be
> a good reason.
> 
> The module will already be refcounted anyhow because the mmap points
> to a char file which holds a module reference - meaning a simple rmmod
> of the driver shouldn't work already..

Also, I just tried it... If I open a sysfs file for an nvme device (ie.
/sys/class/nvme/nvme4/cntlid) and unbind the device, it does not block.
A subsequent read on that file descriptor returns ENODEV. Which is what
I would have expected.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 20/21] PCI/P2PDMA: Introduce pci_mmap_p2pmem()

2022-06-02 Thread Logan Gunthorpe




On 2022-06-02 10:30, Jason Gunthorpe wrote:
> On Thu, Jun 02, 2022 at 10:16:10AM -0600, Logan Gunthorpe wrote:
> 
>>> Just stuff the pages into the mmap, and your driver unprobe will
>>> automatically block until all the mmaps are closed - no different than
>>> having an open file descriptor or something.
>>
>> Oh is that what we want?
> 
> Yes, it is the typical case - eg if you have a sysfs file open unbind
> hangs indefinitely. Many drivers can't unbind while they have open file
> descriptors/etc.
> 
> A couple drivers go out of their way to allow unbinding while a live
> userspace exists but this can get complicated. Usually there should be
> a good reason.

This is not my experience. All the drivers I've worked with do not block
unbind with open file descriptors (at least for char devices). I know,
for example, that having a file descriptor open of /dev/nvmeX does not
cause unbinding to block. I figured this was the expectation as the
userspace process doing the unbind won't be able to be interrupted
seeing there's no way to fail on that path. Though, it certainly would
make things a lot easier if the unbind can block indefinitely as it
usually requires some complicated locking.

Do you have an example of this? What mechanisms are developers using to
block unbind with open file descriptors?

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 20/21] PCI/P2PDMA: Introduce pci_mmap_p2pmem()

2022-06-02 Thread Logan Gunthorpe



On 2022-06-01 18:00, Jason Gunthorpe wrote:
> On Fri, May 27, 2022 at 04:41:08PM -0600, Logan Gunthorpe wrote:
>>>
>>> IIRC this is the last part:
>>>
>>> https://lore.kernel.org/linux-mm/20220524190632.3304-1-alex.sie...@amd.com/
>>>
>>> And the earlier bit with Christoph's pieces looks like it might get
>>> merged to v5.19..
>>>
>>> The general idea is once pte_devmap is not set then all the
>>> refcounting works the way it should. This is what all new ZONE_DEVICE
>>> users should do..
>>
>> Ok, I don't actually follow how those patches relate to this.
>>
>> Based on your description I guess I don't need to set PFN_DEV and
> 
> Yes
> 
>> perhaps not use vmf_insert_mixed()? And then just use vm_normal_page()?
> 
> I'm not sure ATM the best function to use, but yes, a function that
> doesn't set PFN_DEV is needed here.
>  
>> But the refcounting of the pages seemed like it was already sane to me,
>> unless you mean that the code no longer has to synchronize_rcu() before
>> returning the pages... 
> 
> Right. It also doesn't need to call unmap range or keep track of the
> inode, or do any of that stuff unless it really needs mmap revokation
> semantics (which I doubt this use case does)
> 
> unmap range was only necessary because the refcounting is wrong -
> since the pte's don't hold a ref on the page in PFN_DEV mode it is
> necessary to wipe all the PTE explicitly before going ahead to
> decrement the refcount on this path.
> 
> Just stuff the pages into the mmap, and your driver unprobe will
> automatically block until all the mmaps are closed - no different than
> having an open file descriptor or something.

Oh is that what we want? With the current method the mmaps are unmapped
on unbind so that it doesn't block indefinitely. It seems more typical
for resources to be dropped quickly on unbind and processes that are
using them will get an error on next use.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 20/21] PCI/P2PDMA: Introduce pci_mmap_p2pmem()

2022-05-27 Thread Logan Gunthorpe



On 2022-05-27 13:03, Jason Gunthorpe wrote:
> On Fri, May 27, 2022 at 09:35:07AM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2022-05-27 06:55, Jason Gunthorpe wrote:
>>> On Thu, Apr 07, 2022 at 09:47:16AM -0600, Logan Gunthorpe wrote:
>>>> +static void pci_p2pdma_unmap_mappings(void *data)
>>>> +{
>>>> +  struct pci_dev *pdev = data;
>>>> +  struct pci_p2pdma *p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
>>>> +
>>>> +  /* Ensure no new pages can be allocated in mappings */
>>>> +  p2pdma->active = false;
>>>> +  synchronize_rcu();
>>>> +
>>>> +  unmap_mapping_range(p2pdma->inode->i_mapping, 0, 0, 1);
>>>> +
>>>> +  /*
>>>> +   * On some architectures, TLB flushes are done with call_rcu()
>>>> +   * so to ensure GUP fast is done with the pages, call synchronize_rcu()
>>>> +   * before freeing them.
>>>> +   */
>>>> +  synchronize_rcu();
>>>> +  pci_p2pdma_free_mappings(p2pdma->inode->i_mapping);
>>>
>>> With the series from Felix getting close this should get updated to
>>> not set pte_devmap and use proper natural refcounting without any of
>>> this stuff.
>>
>> Can you send a link? I'm not sure what you are referring to.
> 
> IIRC this is the last part:
> 
> https://lore.kernel.org/linux-mm/20220524190632.3304-1-alex.sie...@amd.com/
> 
> And the earlier bit with Christoph's pieces looks like it might get
> merged to v5.19..
> 
> The general idea is once pte_devmap is not set then all the
> refcounting works the way it should. This is what all new ZONE_DEVICE
> users should do..

Ok, I don't actually follow how those patches relate to this.

Based on your description I guess I don't need to set PFN_DEV and
perhaps not use vmf_insert_mixed()? And then just use vm_normal_page()?

But the refcounting of the pages seemed like it was already sane to me,
unless you mean that the code no longer has to synchronize_rcu() before
returning the pages... that would be spectacular and clean things up a
lot (plus fix an annoying issue where if you use then free all the
memory you can't allocate new memory for an indeterminate amount of time).

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 20/21] PCI/P2PDMA: Introduce pci_mmap_p2pmem()

2022-05-27 Thread Logan Gunthorpe



On 2022-05-27 06:55, Jason Gunthorpe wrote:
> On Thu, Apr 07, 2022 at 09:47:16AM -0600, Logan Gunthorpe wrote:
>> +static void pci_p2pdma_unmap_mappings(void *data)
>> +{
>> +struct pci_dev *pdev = data;
>> +struct pci_p2pdma *p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
>> +
>> +/* Ensure no new pages can be allocated in mappings */
>> +p2pdma->active = false;
>> +synchronize_rcu();
>> +
>> +unmap_mapping_range(p2pdma->inode->i_mapping, 0, 0, 1);
>> +
>> +/*
>> + * On some architectures, TLB flushes are done with call_rcu()
>> + * so to ensure GUP fast is done with the pages, call synchronize_rcu()
>> + * before freeing them.
>> + */
>> +synchronize_rcu();
>> +pci_p2pdma_free_mappings(p2pdma->inode->i_mapping);
> 
> With the series from Felix getting close this should get updated to
> not set pte_devmap and use proper natural refcounting without any of
> this stuff.

Can you send a link? I'm not sure what you are referring to.

Thanks,

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 00/21] Userspace P2PDMA with O_DIRECT NVMe devices

2022-05-16 Thread Logan Gunthorpe



On 2022-05-16 16:31, Chaitanya Kulkarni wrote:
> Do you have any plans to re-spin this ?

I didn't get any feedback this cycle, so there haven't been any changes.
I'll probably do a rebase and resend after the merge window.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 21/21] nvme-pci: allow mmaping the CMB in userspace

2022-04-07 Thread Logan Gunthorpe
Allow userspace to obtain CMB memory by mmaping the controller's
char device. The mmap call allocates and returns a hunk of CMB memory,
(the offset is ignored) so userspace does not have control over the
address within the CMB.

A VMA allocated in this way will only be usable by drivers that set
FOLL_PCI_P2PDMA when calling GUP. And inter-device support will be
checked the first time the pages are mapped for DMA.

Currently this is only supported by O_DIRECT to an PCI NVMe device
or through the NVMe passthrough IOCTL.

Signed-off-by: Logan Gunthorpe 
---
 drivers/nvme/host/core.c | 15 +++
 drivers/nvme/host/nvme.h |  2 ++
 drivers/nvme/host/pci.c  | 17 +
 3 files changed, 34 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index bbc276dda49f..1fd3372c2c18 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3114,6 +3114,10 @@ static int nvme_dev_open(struct inode *inode, struct 
file *file)
}
 
file->private_data = ctrl;
+
+   if (ctrl->ops->cdev_file_open)
+   ctrl->ops->cdev_file_open(ctrl, file);
+
return 0;
 }
 
@@ -3127,12 +3131,23 @@ static int nvme_dev_release(struct inode *inode, struct 
file *file)
return 0;
 }
 
+static int nvme_dev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+   struct nvme_ctrl *ctrl = file->private_data;
+
+   if (!ctrl->ops->mmap_cmb)
+   return -ENODEV;
+
+   return ctrl->ops->mmap_cmb(ctrl, vma);
+}
+
 static const struct file_operations nvme_dev_fops = {
.owner  = THIS_MODULE,
.open   = nvme_dev_open,
.release= nvme_dev_release,
.unlocked_ioctl = nvme_dev_ioctl,
.compat_ioctl   = compat_ptr_ioctl,
+   .mmap   = nvme_dev_mmap,
 };
 
 static ssize_t nvme_sysfs_reset(struct device *dev,
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 7d97bfb2a9e2..24fbcd274c64 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -497,6 +497,8 @@ struct nvme_ctrl_ops {
void (*delete_ctrl)(struct nvme_ctrl *ctrl);
int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
+   void (*cdev_file_open)(struct nvme_ctrl *ctrl, struct file *file);
+   int (*mmap_cmb)(struct nvme_ctrl *ctrl, struct vm_area_struct *vma);
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 07412116d4d1..5946244e0295 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2965,6 +2965,21 @@ static bool nvme_pci_supports_pci_p2pdma(struct 
nvme_ctrl *ctrl)
return dma_pci_p2pdma_supported(dev->dev);
 }
 
+static void nvme_pci_cdev_file_open(struct nvme_ctrl *ctrl, struct file *file)
+{
+   struct pci_dev *pdev = to_pci_dev(to_nvme_dev(ctrl)->dev);
+
+   pci_p2pdma_file_open(pdev, file);
+}
+
+static int nvme_pci_mmap_cmb(struct nvme_ctrl *ctrl,
+struct vm_area_struct *vma)
+{
+   struct pci_dev *pdev = to_pci_dev(to_nvme_dev(ctrl)->dev);
+
+   return pci_mmap_p2pmem(pdev, vma);
+}
+
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.name   = "pcie",
.module = THIS_MODULE,
@@ -2976,6 +2991,8 @@ static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.submit_async_event = nvme_pci_submit_async_event,
.get_address= nvme_pci_get_address,
.supports_pci_p2pdma= nvme_pci_supports_pci_p2pdma,
+   .cdev_file_open = nvme_pci_cdev_file_open,
+   .mmap_cmb   = nvme_pci_mmap_cmb,
 };
 
 static int nvme_dev_map(struct nvme_dev *dev)
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 12/21] RDMA/rw: drop pci_p2pdma_[un]map_sg()

2022-04-07 Thread Logan Gunthorpe
dma_map_sg() now supports the use of P2PDMA pages so pci_p2pdma_map_sg()
is no longer necessary and may be dropped. This means the
rdma_rw_[un]map_sg() helpers are no longer necessary. Remove it all.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
---
 drivers/infiniband/core/rw.c | 45 
 1 file changed, 9 insertions(+), 36 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index 4d98f931a13d..8367974b7998 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -274,33 +274,6 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
return 1;
 }
 
-static void rdma_rw_unmap_sg(struct ib_device *dev, struct scatterlist *sg,
-u32 sg_cnt, enum dma_data_direction dir)
-{
-   if (is_pci_p2pdma_page(sg_page(sg)))
-   pci_p2pdma_unmap_sg(dev->dma_device, sg, sg_cnt, dir);
-   else
-   ib_dma_unmap_sg(dev, sg, sg_cnt, dir);
-}
-
-static int rdma_rw_map_sgtable(struct ib_device *dev, struct sg_table *sgt,
-  enum dma_data_direction dir)
-{
-   int nents;
-
-   if (is_pci_p2pdma_page(sg_page(sgt->sgl))) {
-   if (WARN_ON_ONCE(ib_uses_virt_dma(dev)))
-   return 0;
-   nents = pci_p2pdma_map_sg(dev->dma_device, sgt->sgl,
- sgt->orig_nents, dir);
-   if (!nents)
-   return -EIO;
-   sgt->nents = nents;
-   return 0;
-   }
-   return ib_dma_map_sgtable_attrs(dev, sgt, dir, 0);
-}
-
 /**
  * rdma_rw_ctx_init - initialize a RDMA READ/WRITE context
  * @ctx:   context to initialize
@@ -327,7 +300,7 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp 
*qp, u32 port_num,
};
int ret;
 
-   ret = rdma_rw_map_sgtable(dev, , dir);
+   ret = ib_dma_map_sgtable_attrs(dev, , dir, 0);
if (ret)
return ret;
sg_cnt = sgt.nents;
@@ -366,7 +339,7 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp 
*qp, u32 port_num,
return ret;
 
 out_unmap_sg:
-   rdma_rw_unmap_sg(dev, sgt.sgl, sgt.orig_nents, dir);
+   ib_dma_unmap_sgtable_attrs(dev, , dir, 0);
return ret;
 }
 EXPORT_SYMBOL(rdma_rw_ctx_init);
@@ -414,12 +387,12 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
return -EINVAL;
}
 
-   ret = rdma_rw_map_sgtable(dev, , dir);
+   ret = ib_dma_map_sgtable_attrs(dev, , dir, 0);
if (ret)
return ret;
 
if (prot_sg_cnt) {
-   ret = rdma_rw_map_sgtable(dev, _sgt, dir);
+   ret = ib_dma_map_sgtable_attrs(dev, _sgt, dir, 0);
if (ret)
goto out_unmap_sg;
}
@@ -486,9 +459,9 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
kfree(ctx->reg);
 out_unmap_prot_sg:
if (prot_sgt.nents)
-   rdma_rw_unmap_sg(dev, prot_sgt.sgl, prot_sgt.orig_nents, dir);
+   ib_dma_unmap_sgtable_attrs(dev, _sgt, dir, 0);
 out_unmap_sg:
-   rdma_rw_unmap_sg(dev, sgt.sgl, sgt.orig_nents, dir);
+   ib_dma_unmap_sgtable_attrs(dev, , dir, 0);
return ret;
 }
 EXPORT_SYMBOL(rdma_rw_ctx_signature_init);
@@ -621,7 +594,7 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct 
ib_qp *qp,
break;
}
 
-   rdma_rw_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+   ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy);
 
@@ -649,8 +622,8 @@ void rdma_rw_ctx_destroy_signature(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
kfree(ctx->reg);
 
if (prot_sg_cnt)
-   rdma_rw_unmap_sg(qp->pd->device, prot_sg, prot_sg_cnt, dir);
-   rdma_rw_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+   ib_dma_unmap_sg(qp->pd->device, prot_sg, prot_sg_cnt, dir);
+   ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy_signature);
 
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 16/21] block: add check when merging zone device pages

2022-04-07 Thread Logan Gunthorpe
Consecutive zone device pages should not be merged into the same sgl
or bvec segment with other types of pages or if they belong to different
pgmaps. Otherwise getting the pgmap of a given segment is not possible
without scanning the entire segment. This helper returns true either if
both pages are not zone device pages or both pages are zone device
pages with the same pgmap.

Add a helper to determine if zone device pages are mergeable and use
this helper in page_is_mergeable().

Signed-off-by: Logan Gunthorpe 
---
 block/bio.c|  2 ++
 include/linux/mm.h | 23 +++
 2 files changed, 25 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index cdd7b2915c53..3406c0450db3 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -834,6 +834,8 @@ static inline bool page_is_mergeable(const struct bio_vec 
*bv,
return false;
if (xen_domain() && !xen_biovec_phys_mergeable(bv, page))
return false;
+   if (!zone_device_pages_have_same_pgmap(bv->bv_page, page))
+   return false;
 
*same_page = ((vec_end_addr & PAGE_MASK) == page_addr);
if (*same_page)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 14ef41af8b77..fb2264a17e4a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1108,6 +1108,24 @@ static inline bool is_zone_device_page(const struct page 
*page)
 {
return page_zonenum(page) == ZONE_DEVICE;
 }
+
+/*
+ * Consecutive zone device pages should not be merged into the same sgl
+ * or bvec segment with other types of pages or if they belong to different
+ * pgmaps. Otherwise getting the pgmap of a given segment is not possible
+ * without scanning the entire segment. This helper returns true either if
+ * both pages are not zone device pages or both pages are zone device pages
+ * with the same pgmap.
+ */
+static inline bool zone_device_pages_have_same_pgmap(const struct page *a,
+const struct page *b)
+{
+   if (is_zone_device_page(a) != is_zone_device_page(b))
+   return false;
+   if (!is_zone_device_page(a))
+   return true;
+   return a->pgmap == b->pgmap;
+}
 extern void memmap_init_zone_device(struct zone *, unsigned long,
unsigned long, struct dev_pagemap *);
 #else
@@ -1115,6 +1133,11 @@ static inline bool is_zone_device_page(const struct page 
*page)
 {
return false;
 }
+static inline bool zone_device_pages_have_same_pgmap(const struct page *a,
+const struct page *b)
+{
+   return true;
+}
 #endif
 
 static inline bool folio_is_zone_device(const struct folio *folio)
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 00/21] Userspace P2PDMA with O_DIRECT NVMe devices

2022-04-07 Thread Logan Gunthorpe
Hi,

This patchset continues my work to add userspace P2PDMA access using
O_DIRECT NVMe devices. This posting contains some minor fixes and a
rebase onto v5.18-rc1 which contains cleanup from Christoph around
free_zone_device_page() that helps to enable this patchset. The
previous posting was here[1].

The patchset enables userspace P2PDMA by allowing userspace to mmap()
allocated chunks of the CMB. The resulting VMA can be passed only
to O_DIRECT IO on NVMe backed files or block devices. A flag is added
to GUP() in Patch <>, then Patches <> through <> wire this flag up based
on whether the block queue indicates P2PDMA support. Patches <>
through <> enable the CMB to be mapped into userspace by mmaping
the nvme char device.

This is relatively straightforward, however the one significant
problem is that, presently, pci_p2pdma_map_sg() requires a homogeneous
SGL with all P2PDMA pages or all regular pages. Enhancing GUP to
support enforcing this rule would require a huge hack that I don't
expect would be all that pallatable. So the first 13 patches add
support for P2PDMA pages to dma_map_sg[table]() to the dma-direct
and dma-iommu implementations. Thus systems without an IOMMU plus
Intel and AMD IOMMUs are supported. (Other IOMMU implementations would
then be unsupported, notably ARM and PowerPC but support would be added
when they convert to dma-iommu).

dma_map_sgtable() is preferred when dealing with P2PDMA memory as it
will return -EREMOTEIO when the DMA device cannot map specific P2PDMA
pages based on the existing rules in calc_map_type_and_dist().

The other issue is dma_unmap_sg() needs a flag to determine whether a
given dma_addr_t was mapped regularly or as a PCI bus address. To allow
this, a third flag is added to the page_link field in struct
scatterlist. This effectively means support for P2PDMA will now depend
on CONFIG_64BIT.

Feedback welcome.

This series is based on v5.18-rc1. A git branch is available here:

  https://github.com/sbates130272/linux-p2pmem/  p2pdma_user_cmb_v6

Thanks,

Logan

[1] lkml.kernel.org/r/20220128002614.6136-1-log...@deltatee.com

--

Changes since v5:
  - Rebased onto v5.18-rc1 which includes Christophs cleanup to
free_zone_device_page() (similar to Ralph's patch).
  - Fix bug with concurrent first calls to pci_p2pdma_vma_fault()
that caused a double allocation and lost p2p memory. Noticed
by Andrew Maier.
  - Collected a Reviewed-by tag from Chaitanya.
  - Numerous minor fixes to commit messages

Changes since v4:
  - Rebase onto v5.17-rc1.
  - Included Ralph Cambell's patches which removes the ZONE_DEVICE page
reference count offset. This is just to demonstrate that this
series is compatible with that direction.
  - Added a comment in pci_p2pdma_map_sg_attrs(), per Chaitanya and
included his Reviewed-by tags.
  - Patch 1 in the last series which cleaned up scatterlist.h
has been upstreamed.
  - Dropped NEED_SG_DMA_BUS_ADDR_FLAG seeing depends on doesn't
work with selected symbols, per Christoph.
  - Switched iov_iter_get_pages_[alloc_]flags to be exported with
EXPORT_SYMBOL_GPL, per Christoph.
  - Renamed zone_device_pages_are_mergeable() to
zone_device_pages_have_same_pgmap(), per Christoph.
  - Renamed .mmap_file_open operation in nvme_ctrl_ops to
cdev_file_open(), per Christoph.

Changes since v3:
  - Add some comment and commit message cleanups I had missed for v3,
also moved the prototypes for some of the p2pdma helpers to
dma-map-ops.h (which I missed in v3 and was suggested in v2).
  - Add separate cleanup patch for scatterlist.h and change the macros
to functions. (Suggested by Chaitanya and Jason, respectively)
  - Rename sg_dma_mark_pci_p2pdma() and sg_is_dma_pci_p2pdma() to
sg_dma_mark_bus_address() and sg_is_dma_bus_address() which
is a more generic name (As requested by Jason)
  - Fixes to some comments and commit messages as suggested by Bjorn
and Jason.
  - Ensure swiotlb is not used with P2PDMA pages. (Per Jason)
  - The sgtable coversion in RDMA was split out and sent upstream
separately, the new patch is only the removal. (Per Jason)
  - Moved the FOLL_PCI_P2PDMA check outside of get_dev_pagemap() as
Jason suggested this will be removed in the near term.
  - Add two patches to ensure that zone device pages with different
pgmaps are never merged in the block layer or
sg_alloc_append_table_from_pages() (Per Jason)
  - Ensure synchronize_rcu() or call_rcu() is used before returning
pages to the genalloc. (Jason pointed out that pages are not
gauranteed to be unused in all architectures until at least
after an RCU grace period, and that synchronize_rcu() was likely
too slow to use in the vma close operation.
  - Collected Acks and Reviews by Bjorn, Jason and Max.

Logan Gunthorpe (21):
  lib/scatterlist: add flag for indicating P2PDMA segments in an SGL
  PCI/P2PDMA: Attempt to set map_type if it has not

[PATCH v6 02/21] PCI/P2PDMA: Attempt to set map_type if it has not been set

2022-04-07 Thread Logan Gunthorpe
Attempt to find the mapping type for P2PDMA pages on the first
DMA map attempt if it has not been done ahead of time.

Previously, the mapping type was expected to be calculated ahead of
time, but if pages are to come from userspace then there's no
way to ensure the path was checked ahead of time.

This change will calculate the mapping type if it hasn't pre-calculated
so it is no longer invalid to call pci_p2pdma_map_sg() before the mapping
type is calculated, so drop the WARN_ON when that is the case.

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
Reviewed-by: Chaitanya Kulkarni 
---
 drivers/pci/p2pdma.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 30b1df3c9d2f..c3a68e82cf36 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -849,6 +849,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct 
dev_pagemap *pgmap,
struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
struct pci_dev *client;
struct pci_p2pdma *p2pdma;
+   int dist;
 
if (!provider->p2pdma)
return PCI_P2PDMA_MAP_NOT_SUPPORTED;
@@ -865,6 +866,10 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct 
dev_pagemap *pgmap,
type = xa_to_value(xa_load(>map_types,
   map_types_idx(client)));
rcu_read_unlock();
+
+   if (type == PCI_P2PDMA_MAP_UNKNOWN)
+   return calc_map_type_and_dist(provider, client, , true);
+
return type;
 }
 
@@ -907,7 +912,7 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct 
scatterlist *sg,
case PCI_P2PDMA_MAP_BUS_ADDR:
return __pci_p2pdma_map_sg(p2p_pgmap, dev, sg, nents);
default:
-   WARN_ON_ONCE(1);
+   /* Mapping is not Supported */
return 0;
}
 }
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 14/21] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages

2022-04-07 Thread Logan Gunthorpe
GUP Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
allow obtaining P2PDMA pages. If GUP is called without the flag and a
P2PDMA page is found, it will return an error.

FOLL_PCI_P2PDMA cannot be set if FOLL_LONGTERM is set.

Signed-off-by: Logan Gunthorpe 
---
 include/linux/mm.h |  1 +
 mm/gup.c   | 22 +-
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e34edb775334..14ef41af8b77 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2936,6 +2936,7 @@ struct page *follow_page(struct vm_area_struct *vma, 
unsigned long address,
 #define FOLL_SPLIT_PMD 0x2 /* split huge pmd before returning */
 #define FOLL_PIN   0x4 /* pages must be released via unpin_user_page */
 #define FOLL_FAST_ONLY 0x8 /* gup_fast: prevent fall-back to slow gup */
+#define FOLL_PCI_P2PDMA0x10 /* allow returning PCI P2PDMA pages */
 
 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
diff --git a/mm/gup.c b/mm/gup.c
index f598a037eb04..0af6f802ca38 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -490,6 +490,12 @@ static struct page *follow_page_pte(struct vm_area_struct 
*vma,
page = pte_page(pte);
else
goto no_page;
+
+   if (unlikely(!(flags & FOLL_PCI_P2PDMA) &&
+is_pci_p2pdma_page(page))) {
+   page = ERR_PTR(-EREMOTEIO);
+   goto out;
+   }
} else if (unlikely(!page)) {
if (flags & FOLL_DUMP) {
/* Avoid special (like zero) pages in core dumps */
@@ -919,6 +925,9 @@ static int check_vma_flags(struct vm_area_struct *vma, 
unsigned long gup_flags)
if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
return -EOPNOTSUPP;
 
+   if ((gup_flags & FOLL_LONGTERM) && (gup_flags & FOLL_PCI_P2PDMA))
+   return -EOPNOTSUPP;
+
if (vma_is_secretmem(vma))
return -EFAULT;
 
@@ -2184,6 +2193,10 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
unsigned long end,
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
page = pte_page(pte);
 
+   if (unlikely(pte_devmap(pte) && !(flags & FOLL_PCI_P2PDMA) &&
+is_pci_p2pdma_page(page)))
+   goto pte_unmap;
+
folio = try_grab_folio(page, 1, flags);
if (!folio)
goto pte_unmap;
@@ -2258,6 +2271,12 @@ static int __gup_device_huge(unsigned long pfn, unsigned 
long addr,
undo_dev_pagemap(nr, nr_start, flags, pages);
break;
}
+
+   if (!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
+   undo_dev_pagemap(nr, nr_start, flags, pages);
+   break;
+   }
+
SetPageReferenced(page);
pages[*nr] = page;
if (unlikely(!try_grab_page(page, flags))) {
@@ -2729,7 +2748,8 @@ static int internal_get_user_pages_fast(unsigned long 
start,
 
if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
   FOLL_FORCE | FOLL_PIN | FOLL_GET |
-  FOLL_FAST_ONLY | FOLL_NOFAULT)))
+  FOLL_FAST_ONLY | FOLL_NOFAULT |
+  FOLL_PCI_P2PDMA)))
return -EINVAL;
 
if (gup_flags & FOLL_PIN)
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 19/21] block: set FOLL_PCI_P2PDMA in bio_map_user_iov()

2022-04-07 Thread Logan Gunthorpe
When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be
passed from userspace and enables the NVMe passthru requests to
use P2PDMA pages.

Signed-off-by: Logan Gunthorpe 
---
 block/blk-map.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index c7f71d83eff1..85baf922a0e8 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -234,6 +234,7 @@ static int bio_map_user_iov(struct request *rq, struct 
iov_iter *iter,
gfp_t gfp_mask)
 {
unsigned int max_sectors = queue_max_hw_sectors(rq->q);
+   unsigned int flags = 0;
struct bio *bio;
int ret;
int j;
@@ -246,13 +247,17 @@ static int bio_map_user_iov(struct request *rq, struct 
iov_iter *iter,
return -ENOMEM;
bio->bi_opf |= req_op(rq);
 
+   if (blk_queue_pci_p2pdma(rq->q))
+   flags |= FOLL_PCI_P2PDMA;
+
while (iov_iter_count(iter)) {
struct page **pages;
ssize_t bytes;
size_t offs, added = 0;
int npages;
 
-   bytes = iov_iter_get_pages_alloc(iter, , LONG_MAX, );
+   bytes = iov_iter_get_pages_alloc_flags(iter, , LONG_MAX,
+  , flags);
if (unlikely(bytes <= 0)) {
ret = bytes ? bytes : -EFAULT;
goto out_unmap;
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 04/21] PCI/P2PDMA: Introduce helpers for dma_map_sg implementations

2022-04-07 Thread Logan Gunthorpe
Add pci_p2pdma_map_segment() as a helper for simple dma_map_sg()
implementations. It takes an scatterlist segment that must point to a
pci_p2pdma struct page and will map it if the mapping requires a bus
address.

The return value indicates whether the mapping required a bus address
or whether the caller still needs to map the segment normally. If the
segment should not be mapped, -EREMOTEIO is returned.

This helper uses a state structure to track the changes to the
pgmap across calls and avoid needing to lookup into the xarray for
every page.

Also add pci_p2pdma_map_bus_segment() which is useful for IOMMU
dma_map_sg() implementations where the sg segment containing the page
differs from the sg segment containing the DMA address.

Prototypes for these helpers are added to dma-map-ops.h as they are only
useful to dma map implementations and don't need to pollute the public
pci-p2pdma header.

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
---
 drivers/pci/p2pdma.c| 59 +
 include/linux/dma-map-ops.h | 21 +
 2 files changed, 80 insertions(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 8573bf9d651a..9032c2ed2cdf 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -946,6 +946,65 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct 
scatterlist *sg,
 }
 EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
 
+/**
+ * pci_p2pdma_map_segment - map an sg segment determining the mapping type
+ * @state: State structure that should be declared outside of the for_each_sg()
+ * loop and initialized to zero.
+ * @dev: DMA device that's doing the mapping operation
+ * @sg: scatterlist segment to map
+ *
+ * This is a helper to be used by non-IOMMU dma_map_sg() implementations where
+ * the sg segment is the same for the page_link and the dma_address.
+ *
+ * Attempt to map a single segment in an SGL with the PCI bus address.
+ * The segment must point to a PCI P2PDMA page and thus must be
+ * wrapped in a is_pci_p2pdma_page(sg_page(sg)) check.
+ *
+ * Returns the type of mapping used and maps the page if the type is
+ * PCI_P2PDMA_MAP_BUS_ADDR.
+ */
+enum pci_p2pdma_map_type
+pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
+  struct scatterlist *sg)
+{
+   if (state->pgmap != sg_page(sg)->pgmap) {
+   state->pgmap = sg_page(sg)->pgmap;
+   state->map = pci_p2pdma_map_type(state->pgmap, dev);
+   state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
+   }
+
+   if (state->map == PCI_P2PDMA_MAP_BUS_ADDR) {
+   sg->dma_address = sg_phys(sg) + state->bus_off;
+   sg_dma_len(sg) = sg->length;
+   sg_dma_mark_bus_address(sg);
+   }
+
+   return state->map;
+}
+
+/**
+ * pci_p2pdma_map_bus_segment - map an sg segment pre determined to
+ * be mapped with PCI_P2PDMA_MAP_BUS_ADDR
+ * @pg_sg: scatterlist segment with the page to map
+ * @dma_sg: scatterlist segment to assign a DMA address to
+ *
+ * This is a helper for iommu dma_map_sg() implementations when the
+ * segment for the DMA address differs from the segment containing the
+ * source page.
+ *
+ * pci_p2pdma_map_type() must have already been called on the pg_sg and
+ * returned PCI_P2PDMA_MAP_BUS_ADDR.
+ */
+void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
+   struct scatterlist *dma_sg)
+{
+   struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(sg_page(pg_sg)->pgmap);
+
+   dma_sg->dma_address = sg_phys(pg_sg) + pgmap->bus_offset;
+   sg_dma_len(dma_sg) = pg_sg->length;
+   sg_dma_mark_bus_address(dma_sg);
+}
+
 /**
  * pci_p2pdma_enable_store - parse a configfs/sysfs attribute store
  * to enable p2pdma
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index d693a0e33bac..752f91e5eb5d 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -413,15 +413,36 @@ enum pci_p2pdma_map_type {
PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
 };
 
+struct pci_p2pdma_map_state {
+   struct dev_pagemap *pgmap;
+   int map;
+   u64 bus_off;
+};
+
 #ifdef CONFIG_PCI_P2PDMA
 enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
 struct device *dev);
+enum pci_p2pdma_map_type
+pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
+  struct scatterlist *sg);
+void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
+   struct scatterlist *dma_sg);
 #else /* CONFIG_PCI_P2PDMA */
 static inline enum pci_p2pdma_map_type
 pci_p2pdma_map_type(struct dev_pagemap *pgmap, struct device *dev)
 {
return PCI_P2PDMA_MAP_NOT_SUPPORTED;
 }
+static inline enum pci_p2pdma_map_type
+pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct 

[PATCH v6 01/21] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL

2022-04-07 Thread Logan Gunthorpe
Make use of the third free LSB in scatterlist's page_link on 64bit systems.

The extra bit will be used by dma_[un]map_sg_p2pdma() to determine when a
given SGL segments dma_address points to a PCI bus address.
dma_unmap_sg_p2pdma() will need to perform different cleanup when a
segment is marked as a bus address.

The new bit will only be used when CONFIG_PCI_P2PDMA is set; this means
PCI P2PDMA will require CONFIG_64BIT. This should be acceptable as the
majority of P2PDMA use cases are restricted to newer root complexes and
roughly require the extra address space for memory BARs used in the
transactions.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Chaitanya Kulkarni 
---
 drivers/pci/Kconfig |  5 +
 include/linux/scatterlist.h | 44 -
 2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 133c73207782..5cc7cba1941f 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -164,6 +164,11 @@ config PCI_PASID
 config PCI_P2PDMA
bool "PCI peer-to-peer transfer support"
depends on ZONE_DEVICE
+   #
+   # The need for the scatterlist DMA bus address flag means PCI P2PDMA
+   # requires 64bit
+   #
+   depends on 64BIT
select GENERIC_ALLOCATOR
help
  Enableѕ drivers to do PCI peer-to-peer transactions to and from
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 7ff9d6386c12..6561ca8aead8 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -64,12 +64,24 @@ struct sg_append_table {
 #define SG_CHAIN   0x01UL
 #define SG_END 0x02UL
 
+/*
+ * bit 2 is the third free bit in the page_link on 64bit systems which
+ * is used by dma_unmap_sg() to determine if the dma_address is a
+ * bus address when doing P2PDMA.
+ */
+#ifdef CONFIG_PCI_P2PDMA
+#define SG_DMA_BUS_ADDRESS 0x04UL
+static_assert(__alignof__(struct page) >= 8);
+#else
+#define SG_DMA_BUS_ADDRESS 0x00UL
+#endif
+
 /*
  * We overload the LSB of the page pointer to indicate whether it's
  * a valid sg entry, or whether it points to the start of a new scatterlist.
  * Those low bits are there for everyone! (thanks mason :-)
  */
-#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END)
+#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END | SG_DMA_BUS_ADDRESS)
 
 static inline unsigned int __sg_flags(struct scatterlist *sg)
 {
@@ -91,6 +103,11 @@ static inline bool sg_is_last(struct scatterlist *sg)
return __sg_flags(sg) & SG_END;
 }
 
+static inline bool sg_is_dma_bus_address(struct scatterlist *sg)
+{
+   return __sg_flags(sg) & SG_DMA_BUS_ADDRESS;
+}
+
 /**
  * sg_assign_page - Assign a given page to an SG entry
  * @sg:SG entry
@@ -245,6 +262,31 @@ static inline void sg_unmark_end(struct scatterlist *sg)
sg->page_link &= ~SG_END;
 }
 
+/**
+ * sg_dma_mark_bus address - Mark the scatterlist entry as a bus address
+ * @sg: SG entryScatterlist
+ *
+ * Description:
+ *   Marks the passed in sg entry to indicate that the dma_address is
+ *   a bus address and doesn't need to be unmapped.
+ **/
+static inline void sg_dma_mark_bus_address(struct scatterlist *sg)
+{
+   sg->page_link |= SG_DMA_BUS_ADDRESS;
+}
+
+/**
+ * sg_unmark_pci_p2pdma - Unmark the scatterlist entry as a bus address
+ * @sg: SG entryScatterlist
+ *
+ * Description:
+ *   Clears the bus address mark.
+ **/
+static inline void sg_dma_unmark_bus_address(struct scatterlist *sg)
+{
+   sg->page_link &= ~SG_DMA_BUS_ADDRESS;
+}
+
 /**
  * sg_phys - Return physical address of an sg entry
  * @sg: SG entry
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v6 07/21] dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support

2022-04-07 Thread Logan Gunthorpe
Add a flags member to the dma_map_ops structure with one flag to
indicate support for PCI P2PDMA.

Also, add a helper to check if a device supports PCI P2PDMA.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
---
 include/linux/dma-map-ops.h | 10 ++
 include/linux/dma-mapping.h |  5 +
 kernel/dma/mapping.c| 18 ++
 3 files changed, 33 insertions(+)

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 752f91e5eb5d..4d4161d58ce0 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -11,7 +11,17 @@
 
 struct cma;
 
+/*
+ * Values for struct dma_map_ops.flags:
+ *
+ * DMA_F_PCI_P2PDMA_SUPPORTED: Indicates the dma_map_ops implementation can
+ * handle PCI P2PDMA pages in the map_sg/unmap_sg operation.
+ */
+#define DMA_F_PCI_P2PDMA_SUPPORTED (1 << 0)
+
 struct dma_map_ops {
+   unsigned int flags;
+
void *(*alloc)(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t gfp,
unsigned long attrs);
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index dca2b1355bb1..f7c61b2b4b5e 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -140,6 +140,7 @@ int dma_mmap_attrs(struct device *dev, struct 
vm_area_struct *vma,
unsigned long attrs);
 bool dma_can_mmap(struct device *dev);
 int dma_supported(struct device *dev, u64 mask);
+bool dma_pci_p2pdma_supported(struct device *dev);
 int dma_set_mask(struct device *dev, u64 mask);
 int dma_set_coherent_mask(struct device *dev, u64 mask);
 u64 dma_get_required_mask(struct device *dev);
@@ -250,6 +251,10 @@ static inline int dma_supported(struct device *dev, u64 
mask)
 {
return 0;
 }
+static inline bool dma_pci_p2pdma_supported(struct device *dev)
+{
+   return false;
+}
 static inline int dma_set_mask(struct device *dev, u64 mask)
 {
return -EIO;
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 9f65d1041638..21793506fdb6 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -722,6 +722,24 @@ int dma_supported(struct device *dev, u64 mask)
 }
 EXPORT_SYMBOL(dma_supported);
 
+bool dma_pci_p2pdma_supported(struct device *dev)
+{
+   const struct dma_map_ops *ops = get_dma_ops(dev);
+
+   /* if ops is not set, dma direct will be used which supports P2PDMA */
+   if (!ops)
+   return true;
+
+   /*
+* Note: dma_ops_bypass is not checked here because P2PDMA should
+* not be used with dma mapping ops that do not have support even
+* if the specific device is bypassing them.
+*/
+
+   return ops->flags & DMA_F_PCI_P2PDMA_SUPPORTED;
+}
+EXPORT_SYMBOL_GPL(dma_pci_p2pdma_supported);
+
 #ifdef CONFIG_ARCH_HAS_DMA_SET_MASK
 void arch_dma_set_mask(struct device *dev, u64 mask);
 #else
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 08/21] iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg

2022-04-07 Thread Logan Gunthorpe
When a PCI P2PDMA page is seen, set the IOVA length of the segment
to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
apply the appropriate bus address to the segment. The IOVA is not
created if the scatterlist only consists of P2PDMA pages.

A P2PDMA page may have three possible outcomes when being mapped:
  1) If the data path between the two devices doesn't go through
 the root port, then it should be mapped with a PCI bus address
  2) If the data path goes through the host bridge, it should be mapped
 normally with an IOMMU IOVA.
  3) It is not possible for the two devices to communicate and thus
 the mapping operation should fail (and it will return -EREMOTEIO).

Similar to dma-direct, the sg_dma_mark_pci_p2pdma() flag is used to
indicate bus address segments. On unmap, P2PDMA segments are skipped
over when determining the start and end IOVA addresses.

With this change, the flags variable in the dma_map_ops is set to
DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for P2PDMA pages.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
---
 drivers/iommu/dma-iommu.c | 68 +++
 1 file changed, 61 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 09f6e1c0f9c0..ef86f2b573d1 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1045,6 +1046,16 @@ static int __finalise_sg(struct device *dev, struct 
scatterlist *sg, int nents,
sg_dma_address(s) = DMA_MAPPING_ERROR;
sg_dma_len(s) = 0;
 
+   if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {
+   if (i > 0)
+   cur = sg_next(cur);
+
+   pci_p2pdma_map_bus_segment(s, cur);
+   count++;
+   cur_len = 0;
+   continue;
+   }
+
/*
 * Now fill in the real DMA data. If...
 * - there is a valid output segment to append to
@@ -1141,6 +1152,8 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
struct iova_domain *iovad = >iovad;
struct scatterlist *s, *prev = NULL;
int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
+   struct dev_pagemap *pgmap = NULL;
+   enum pci_p2pdma_map_type map_type;
dma_addr_t iova;
size_t iova_len = 0;
unsigned long mask = dma_get_seg_boundary(dev);
@@ -1176,6 +1189,35 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
s_length = iova_align(iovad, s_length + s_iova_off);
s->length = s_length;
 
+   if (is_pci_p2pdma_page(sg_page(s))) {
+   if (sg_page(s)->pgmap != pgmap) {
+   pgmap = sg_page(s)->pgmap;
+   map_type = pci_p2pdma_map_type(pgmap, dev);
+   }
+
+   switch (map_type) {
+   case PCI_P2PDMA_MAP_BUS_ADDR:
+   /*
+* A zero length will be ignored by
+* iommu_map_sg() and then can be detected
+* in __finalise_sg() to actually map the
+* bus address.
+*/
+   s->length = 0;
+   continue;
+   case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+   /*
+* Mapping through host bridge should be
+* mapped with regular IOVAs, thus we
+* do nothing here and continue below.
+*/
+   break;
+   default:
+   ret = -EREMOTEIO;
+   goto out_restore_sg;
+   }
+   }
+
/*
 * Due to the alignment of our single IOVA allocation, we can
 * depend on these assumptions about the segment boundary mask:
@@ -1198,6 +1240,9 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
prev = s;
}
 
+   if (!iova_len)
+   return __finalise_sg(dev, sg, nents, 0);
+
iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
if (!iova) {
ret = -ENOMEM;
@@ -1219,7 +1264,7 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
 out_restore_sg:
__invalidate_sg(sg, nents);
 out:
-   if (ret != -ENOMEM)
+   if (ret != -ENOMEM && ret != -EREMOTEIO)

[PATCH v6 13/21] PCI/P2PDMA: Remove pci_p2pdma_[un]map_sg()

2022-04-07 Thread Logan Gunthorpe
This interface is superseded by support in dma_map_sg() which now supports
heterogeneous scatterlists. There are no longer any users, so remove it.

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Max Gurtovoy 
---
 drivers/pci/p2pdma.c   | 66 --
 include/linux/pci-p2pdma.h | 27 
 2 files changed, 93 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 9032c2ed2cdf..4d3cab9da748 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -880,72 +880,6 @@ enum pci_p2pdma_map_type pci_p2pdma_map_type(struct 
dev_pagemap *pgmap,
return type;
 }
 
-static int __pci_p2pdma_map_sg(struct pci_p2pdma_pagemap *p2p_pgmap,
-   struct device *dev, struct scatterlist *sg, int nents)
-{
-   struct scatterlist *s;
-   int i;
-
-   for_each_sg(sg, s, nents, i) {
-   s->dma_address = sg_phys(s) + p2p_pgmap->bus_offset;
-   sg_dma_len(s) = s->length;
-   }
-
-   return nents;
-}
-
-/**
- * pci_p2pdma_map_sg_attrs - map a PCI peer-to-peer scatterlist for DMA
- * @dev: device doing the DMA request
- * @sg: scatter list to map
- * @nents: elements in the scatterlist
- * @dir: DMA direction
- * @attrs: DMA attributes passed to dma_map_sg() (if called)
- *
- * Scatterlists mapped with this function should be unmapped using
- * pci_p2pdma_unmap_sg_attrs().
- *
- * Returns the number of SG entries mapped or 0 on error.
- */
-int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs)
-{
-   struct pci_p2pdma_pagemap *p2p_pgmap =
-   to_p2p_pgmap(sg_page(sg)->pgmap);
-
-   switch (pci_p2pdma_map_type(sg_page(sg)->pgmap, dev)) {
-   case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
-   return dma_map_sg_attrs(dev, sg, nents, dir, attrs);
-   case PCI_P2PDMA_MAP_BUS_ADDR:
-   return __pci_p2pdma_map_sg(p2p_pgmap, dev, sg, nents);
-   default:
-   /* Mapping is not Supported */
-   return 0;
-   }
-}
-EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg_attrs);
-
-/**
- * pci_p2pdma_unmap_sg_attrs - unmap a PCI peer-to-peer scatterlist that was
- * mapped with pci_p2pdma_map_sg()
- * @dev: device doing the DMA request
- * @sg: scatter list to map
- * @nents: number of elements returned by pci_p2pdma_map_sg()
- * @dir: DMA direction
- * @attrs: DMA attributes passed to dma_unmap_sg() (if called)
- */
-void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs)
-{
-   enum pci_p2pdma_map_type map_type;
-
-   map_type = pci_p2pdma_map_type(sg_page(sg)->pgmap, dev);
-
-   if (map_type == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE)
-   dma_unmap_sg_attrs(dev, sg, nents, dir, attrs);
-}
-EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
-
 /**
  * pci_p2pdma_map_segment - map an sg segment determining the mapping type
  * @state: State structure that should be declared outside of the for_each_sg()
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 8318a97c9c61..2c07aa6b7665 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -30,10 +30,6 @@ struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev 
*pdev,
 unsigned int *nents, u32 length);
 void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
 void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
-int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs);
-void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs);
 int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
bool *use_p2pdma);
 ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
@@ -83,17 +79,6 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
 static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 {
 }
-static inline int pci_p2pdma_map_sg_attrs(struct device *dev,
-   struct scatterlist *sg, int nents, enum dma_data_direction dir,
-   unsigned long attrs)
-{
-   return 0;
-}
-static inline void pci_p2pdma_unmap_sg_attrs(struct device *dev,
-   struct scatterlist *sg, int nents, enum dma_data_direction dir,
-   unsigned long attrs)
-{
-}
 static inline int pci_p2pdma_enable_store(const char *page,
struct pci_dev **p2p_dev, bool *use_p2pdma)
 {
@@ -119,16 +104,4 @@ static inline struct pci_dev *pci_p2pmem_find(struct 
device *client)
return pci_p2pmem_find_many(, 1);
 }
 
-static inline int pci_p2pdma_map_sg(

[PATCH v6 20/21] PCI/P2PDMA: Introduce pci_mmap_p2pmem()

2022-04-07 Thread Logan Gunthorpe
Introduce pci_mmap_p2pmem() which is a helper to allocate and mmap
a hunk of p2pmem into userspace.

Pages are allocated from the genalloc in bulk with their reference
count set to one. They are returned to the genalloc when the page is put
through p2pdma_page_free() (the reference count is once again set to 1
in free_zone_device_page()).

The VMA does not take a reference to the pages when they are inserted
with vmf_insert_mixed() (which is necessary for zone device pages) so
the backing P2P memory is stored in a structures in vm_private_data.

A pseudo mount is used to allocate an inode for each PCI device. The
inode's address_space is used in the file doing the mmap so that all
VMAs are collected and can be unmapped if the PCI device is unbound.
After unmapping, the VMAs are iterated through and their pages are
put so the device can continue to be unbound. An active flag is used
to signal to VMAs not to allocate any further P2P memory once the
removal process starts. The flag is synchronized with concurrent
access with an RCU lock.

The VMAs and inode will survive after the unbind of the device, but no
pages will be present in the VMA and a subsequent access will result
in a SIGBUS error.

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
---
 drivers/pci/p2pdma.c   | 340 -
 include/linux/pci-p2pdma.h |  11 ++
 include/uapi/linux/magic.h |   1 +
 3 files changed, 350 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4d3cab9da748..cce4c7b6dd75 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -17,14 +17,19 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
+#include 
 
 struct pci_p2pdma {
struct gen_pool *pool;
bool p2pmem_published;
struct xarray map_types;
+   struct inode *inode;
+   bool active;
 };
 
 struct pci_p2pdma_pagemap {
@@ -33,6 +38,17 @@ struct pci_p2pdma_pagemap {
u64 bus_offset;
 };
 
+struct pci_p2pdma_map {
+   struct kref ref;
+   struct rcu_head rcu;
+   struct pci_dev *pdev;
+   struct inode *inode;
+   size_t len;
+
+   spinlock_t kaddr_lock;
+   void *kaddr;
+};
+
 static struct pci_p2pdma_pagemap *to_p2p_pgmap(struct dev_pagemap *pgmap)
 {
return container_of(pgmap, struct pci_p2pdma_pagemap, pgmap);
@@ -101,6 +117,38 @@ static const struct attribute_group p2pmem_group = {
.name = "p2pmem",
 };
 
+/*
+ * P2PDMA internal mount
+ * Fake an internal VFS mount-point in order to allocate struct address_space
+ * mappings to remove VMAs on unbind events.
+ */
+static int pci_p2pdma_fs_cnt;
+static struct vfsmount *pci_p2pdma_fs_mnt;
+
+static int pci_p2pdma_fs_init_fs_context(struct fs_context *fc)
+{
+   return init_pseudo(fc, P2PDMA_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type pci_p2pdma_fs_type = {
+   .name = "p2dma",
+   .owner = THIS_MODULE,
+   .init_fs_context = pci_p2pdma_fs_init_fs_context,
+   .kill_sb = kill_anon_super,
+};
+
+static void p2pdma_page_free(struct page *page)
+{
+   struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
+
+   gen_pool_free(pgmap->provider->p2pdma->pool,
+ (uintptr_t)page_to_virt(page), PAGE_SIZE);
+}
+
+static const struct dev_pagemap_ops p2pdma_pgmap_ops = {
+   .page_free = p2pdma_page_free,
+};
+
 static void pci_p2pdma_release(void *data)
 {
struct pci_dev *pdev = data;
@@ -117,6 +165,9 @@ static void pci_p2pdma_release(void *data)
gen_pool_destroy(p2pdma->pool);
sysfs_remove_group(>dev.kobj, _group);
xa_destroy(>map_types);
+
+   iput(p2pdma->inode);
+   simple_release_fs(_p2pdma_fs_mnt, _p2pdma_fs_cnt);
 }
 
 static int pci_p2pdma_setup(struct pci_dev *pdev)
@@ -134,17 +185,32 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
if (!p2p->pool)
goto out;
 
-   error = devm_add_action_or_reset(>dev, pci_p2pdma_release, pdev);
+   error = simple_pin_fs(_p2pdma_fs_type, _p2pdma_fs_mnt,
+ _p2pdma_fs_cnt);
if (error)
goto out_pool_destroy;
 
+   p2p->inode = alloc_anon_inode(pci_p2pdma_fs_mnt->mnt_sb);
+   if (IS_ERR(p2p->inode)) {
+   error = -ENOMEM;
+   goto out_unpin_fs;
+   }
+
+   error = devm_add_action_or_reset(>dev, pci_p2pdma_release, pdev);
+   if (error)
+   goto out_put_inode;
+
error = sysfs_create_group(>dev.kobj, _group);
if (error)
-   goto out_pool_destroy;
+   goto out_put_inode;
 
rcu_assign_pointer(pdev->p2pdma, p2p);
return 0;
 
+out_put_inode:
+   iput(p2p->inode);
+out_unpin_fs:
+   simple_release_fs(_p2pdma_fs_mnt, _p2pdma_fs_cnt);
 out_pool_destroy:
gen_pool_destroy(p2p->pool);
 out:
@@ -152,6 +

[PATCH v6 18/21] block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()

2022-04-07 Thread Logan Gunthorpe
When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be passed
from userspace and enables the O_DIRECT path in iomap based filesystems
and direct to block devices.

Signed-off-by: Logan Gunthorpe 
---
 block/bio.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index 3406c0450db3..271a720a6dc1 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1149,6 +1149,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, 
struct iov_iter *iter)
struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
struct page **pages = (struct page **)bv;
bool same_page = false;
+   unsigned int flags = 0;
ssize_t size, left;
unsigned len, i;
size_t offset;
@@ -1161,7 +1162,12 @@ static int __bio_iov_iter_get_pages(struct bio *bio, 
struct iov_iter *iter)
BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
 
-   size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, );
+   if (bio->bi_bdev && bio->bi_bdev->bd_disk &&
+   blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))
+   flags |= FOLL_PCI_P2PDMA;
+
+   size = iov_iter_get_pages_flags(iter, pages, LONG_MAX, nr_pages,
+   , flags);
if (unlikely(size <= 0))
return size ? size : -EFAULT;
 
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 15/21] iov_iter: introduce iov_iter_get_pages_[alloc_]flags()

2022-04-07 Thread Logan Gunthorpe
Add iov_iter_get_pages_flags() and iov_iter_get_pages_alloc_flags()
which take a flags argument that is passed to get_user_pages_fast().

This is so that FOLL_PCI_P2PDMA can be passed when appropriate.

Signed-off-by: Logan Gunthorpe 
---
 include/linux/uio.h |  6 ++
 lib/iov_iter.c  | 25 +++--
 2 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 739285fe5a2f..ddf9e4cf4a59 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -232,8 +232,14 @@ void iov_iter_pipe(struct iov_iter *i, unsigned int 
direction, struct pipe_inode
 void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t 
count);
 void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray 
*xarray,
 loff_t start, size_t count);
+ssize_t iov_iter_get_pages_flags(struct iov_iter *i, struct page **pages,
+   size_t maxsize, unsigned maxpages, size_t *start,
+   unsigned int gup_flags);
 ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
size_t maxsize, unsigned maxpages, size_t *start);
+ssize_t iov_iter_get_pages_alloc_flags(struct iov_iter *i,
+   struct page ***pages, size_t maxsize, size_t *start,
+   unsigned int gup_flags);
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
size_t maxsize, size_t *start);
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 6dd5330f7a99..9bf6e3af5120 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1515,9 +1515,9 @@ static struct page *first_bvec_segment(const struct 
iov_iter *i,
return page;
 }
 
-ssize_t iov_iter_get_pages(struct iov_iter *i,
+ssize_t iov_iter_get_pages_flags(struct iov_iter *i,
   struct page **pages, size_t maxsize, unsigned maxpages,
-  size_t *start)
+  size_t *start, unsigned int gup_flags)
 {
size_t len;
int n, res;
@@ -1528,7 +1528,6 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
return 0;
 
if (likely(iter_is_iovec(i))) {
-   unsigned int gup_flags = 0;
unsigned long addr;
 
if (iov_iter_rw(i) != WRITE)
@@ -1558,6 +1557,13 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
return iter_xarray_get_pages(i, pages, maxsize, maxpages, 
start);
return -EFAULT;
 }
+EXPORT_SYMBOL_GPL(iov_iter_get_pages_flags);
+
+ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
+  size_t maxsize, unsigned maxpages, size_t *start)
+{
+   return iov_iter_get_pages_flags(i, pages, maxsize, maxpages, start, 0);
+}
 EXPORT_SYMBOL(iov_iter_get_pages);
 
 static struct page **get_pages_array(size_t n)
@@ -1640,9 +1646,9 @@ static ssize_t iter_xarray_get_pages_alloc(struct 
iov_iter *i,
return actual;
 }
 
-ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
+ssize_t iov_iter_get_pages_alloc_flags(struct iov_iter *i,
   struct page ***pages, size_t maxsize,
-  size_t *start)
+  size_t *start, unsigned int gup_flags)
 {
struct page **p;
size_t len;
@@ -1654,7 +1660,6 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
return 0;
 
if (likely(iter_is_iovec(i))) {
-   unsigned int gup_flags = 0;
unsigned long addr;
 
if (iov_iter_rw(i) != WRITE)
@@ -1667,6 +1672,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
p = get_pages_array(n);
if (!p)
return -ENOMEM;
+
res = get_user_pages_fast(addr, n, gup_flags, p);
if (unlikely(res <= 0)) {
kvfree(p);
@@ -1694,6 +1700,13 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
return iter_xarray_get_pages_alloc(i, pages, maxsize, start);
return -EFAULT;
 }
+EXPORT_SYMBOL_GPL(iov_iter_get_pages_alloc_flags);
+
+ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
+size_t maxsize, size_t *start)
+{
+   return iov_iter_get_pages_alloc_flags(i, pages, maxsize, start, 0);
+}
 EXPORT_SYMBOL(iov_iter_get_pages_alloc);
 
 size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 03/21] PCI/P2PDMA: Expose pci_p2pdma_map_type()

2022-04-07 Thread Logan Gunthorpe
pci_p2pdma_map_type() will be needed by the dma-iommu map_sg
implementation because it will need to determine the mapping type
ahead of actually doing the mapping to create the actual IOMMU mapping.

Prototypes for this helper are added to dma-map-ops.h as they are only
useful to dma map implementations and don't need to pollute the public
pci-p2pdma header

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Chaitanya Kulkarni 
---
 drivers/pci/p2pdma.c| 25 +
 include/linux/dma-map-ops.h | 45 +
 2 files changed, 61 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index c3a68e82cf36..8573bf9d651a 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -10,6 +10,7 @@
 
 #define pr_fmt(fmt) "pci-p2pdma: " fmt
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -20,13 +21,6 @@
 #include 
 #include 
 
-enum pci_p2pdma_map_type {
-   PCI_P2PDMA_MAP_UNKNOWN = 0,
-   PCI_P2PDMA_MAP_NOT_SUPPORTED,
-   PCI_P2PDMA_MAP_BUS_ADDR,
-   PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
-};
-
 struct pci_p2pdma {
struct gen_pool *pool;
bool p2pmem_published;
@@ -842,8 +836,21 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
 
-static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
-   struct device *dev)
+/**
+ * pci_p2pdma_map_type - return the type of mapping that should be used for
+ * a given device and pgmap
+ * @pgmap: the pagemap of a page to determine the mapping type for
+ * @dev: device that is mapping the page
+ *
+ * Returns one of:
+ * PCI_P2PDMA_MAP_NOT_SUPPORTED - The mapping should not be done
+ * PCI_P2PDMA_MAP_BUS_ADDR - The mapping should use the PCI bus address
+ * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE - The mapping should be done normally
+ * using the CPU physical address (in dma-direct) or an IOVA
+ * mapping for the IOMMU.
+ */
+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
+struct device *dev)
 {
enum pci_p2pdma_map_type type = PCI_P2PDMA_MAP_NOT_SUPPORTED;
struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 0d5b06b3a4a6..d693a0e33bac 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -379,4 +379,49 @@ static inline void debug_dma_dump_mappings(struct device 
*dev)
 
 extern const struct dma_map_ops dma_dummy_ops;
 
+enum pci_p2pdma_map_type {
+   /*
+* PCI_P2PDMA_MAP_UNKNOWN: Used internally for indicating the mapping
+* type hasn't been calculated yet. Functions that return this enum
+* never return this value.
+*/
+   PCI_P2PDMA_MAP_UNKNOWN = 0,
+
+   /*
+* PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
+* traverse the host bridge and the host bridge is not in the
+* allowlist. DMA Mapping routines should return an error when
+* this is returned.
+*/
+   PCI_P2PDMA_MAP_NOT_SUPPORTED,
+
+   /*
+* PCI_P2PDMA_BUS_ADDR: Indicates that two devices can talk to
+* each other directly through a PCI switch and the transaction will
+* not traverse the host bridge. Such a mapping should program
+* the DMA engine with PCI bus addresses.
+*/
+   PCI_P2PDMA_MAP_BUS_ADDR,
+
+   /*
+* PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
+* to each other, but the transaction traverses a host bridge on the
+* allowlist. In this case, a normal mapping either with CPU physical
+* addresses (in the case of dma-direct) or IOVA addresses (in the
+* case of IOMMUs) should be used to program the DMA engine.
+*/
+   PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
+};
+
+#ifdef CONFIG_PCI_P2PDMA
+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
+struct device *dev);
+#else /* CONFIG_PCI_P2PDMA */
+static inline enum pci_p2pdma_map_type
+pci_p2pdma_map_type(struct dev_pagemap *pgmap, struct device *dev)
+{
+   return PCI_P2PDMA_MAP_NOT_SUPPORTED;
+}
+#endif /* CONFIG_PCI_P2PDMA */
+
 #endif /* _LINUX_DMA_MAP_OPS_H */
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 10/21] nvme-pci: convert to using dma_map_sgtable()

2022-04-07 Thread Logan Gunthorpe
The dma_map operations now support P2PDMA pages directly. So remove
the calls to pci_p2pdma_[un]map_sg_attrs() and replace them with calls
to dma_map_sgtable().

dma_map_sgtable() returns more complete error codes than dma_map_sg()
and allows differentiating EREMOTEIO errors in case an unsupported
P2PDMA transfer is requested. When this happens, return BLK_STS_TARGET
so the request isn't retried.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Max Gurtovoy 
Reviewed-by: Chaitanya Kulkarni 
---
 drivers/nvme/host/pci.c | 69 +
 1 file changed, 29 insertions(+), 40 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fec4c7191310..07412116d4d1 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -230,11 +230,10 @@ struct nvme_iod {
bool use_sgl;
int aborted;
int npages; /* In the PRP list. 0 means small pool in use */
-   int nents;  /* Used in scatterlist */
dma_addr_t first_dma;
unsigned int dma_len;   /* length of single DMA segment mapping */
dma_addr_t meta_dma;
-   struct scatterlist *sg;
+   struct sg_table sgt;
 };
 
 static inline unsigned int nvme_dbbuf_size(struct nvme_dev *dev)
@@ -524,7 +523,7 @@ static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
 static void **nvme_pci_iod_list(struct request *req)
 {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-   return (void **)(iod->sg + blk_rq_nr_phys_segments(req));
+   return (void **)(iod->sgt.sgl + blk_rq_nr_phys_segments(req));
 }
 
 static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req)
@@ -576,17 +575,6 @@ static void nvme_free_sgls(struct nvme_dev *dev, struct 
request *req)
}
 }
 
-static void nvme_unmap_sg(struct nvme_dev *dev, struct request *req)
-{
-   struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-
-   if (is_pci_p2pdma_page(sg_page(iod->sg)))
-   pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
-   rq_dma_dir(req));
-   else
-   dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req));
-}
-
 static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -597,9 +585,10 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct 
request *req)
return;
}
 
-   WARN_ON_ONCE(!iod->nents);
+   WARN_ON_ONCE(!iod->sgt.nents);
+
+   dma_unmap_sgtable(dev->dev, >sgt, rq_dma_dir(req), 0);
 
-   nvme_unmap_sg(dev, req);
if (iod->npages == 0)
dma_pool_free(dev->prp_small_pool, nvme_pci_iod_list(req)[0],
  iod->first_dma);
@@ -607,7 +596,7 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct 
request *req)
nvme_free_sgls(dev, req);
else
nvme_free_prps(dev, req);
-   mempool_free(iod->sg, dev->iod_mempool);
+   mempool_free(iod->sgt.sgl, dev->iod_mempool);
 }
 
 static void nvme_print_sgl(struct scatterlist *sgl, int nents)
@@ -630,7 +619,7 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev 
*dev,
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct dma_pool *pool;
int length = blk_rq_payload_bytes(req);
-   struct scatterlist *sg = iod->sg;
+   struct scatterlist *sg = iod->sgt.sgl;
int dma_len = sg_dma_len(sg);
u64 dma_addr = sg_dma_address(sg);
int offset = dma_addr & (NVME_CTRL_PAGE_SIZE - 1);
@@ -703,16 +692,16 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev 
*dev,
dma_len = sg_dma_len(sg);
}
 done:
-   cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
+   cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sgt.sgl));
cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
return BLK_STS_OK;
 free_prps:
nvme_free_prps(dev, req);
return BLK_STS_RESOURCE;
 bad_sgl:
-   WARN(DO_ONCE(nvme_print_sgl, iod->sg, iod->nents),
+   WARN(DO_ONCE(nvme_print_sgl, iod->sgt.sgl, iod->sgt.nents),
"Invalid SGL for payload:%d nents:%d\n",
-   blk_rq_payload_bytes(req), iod->nents);
+   blk_rq_payload_bytes(req), iod->sgt.nents);
return BLK_STS_IOERR;
 }
 
@@ -738,12 +727,13 @@ static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc 
*sge,
 }
 
 static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
-   struct request *req, struct nvme_rw_command *cmd, int entries)
+   struct request *req, struct nvme_rw_command *cmd)
 {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct dma_pool *pool;
struct nvme_sgl_desc *sg_list;
-   struct scatterlist *sg = iod->sg;
+   struct scatt

[PATCH v6 05/21] dma-mapping: allow EREMOTEIO return code for P2PDMA transfers

2022-04-07 Thread Logan Gunthorpe
Add EREMOTEIO error return to dma_map_sgtable() which will be used
by .map_sg() implementations that detect P2PDMA pages that the
underlying DMA device cannot access.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
---
 kernel/dma/mapping.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index db7244291b74..9f65d1041638 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -197,7 +197,7 @@ static int __dma_map_sg_attrs(struct device *dev, struct 
scatterlist *sg,
if (ents > 0)
debug_dma_map_sg(dev, sg, nents, ents, dir, attrs);
else if (WARN_ON_ONCE(ents != -EINVAL && ents != -ENOMEM &&
- ents != -EIO))
+ ents != -EIO && ents != -EREMOTEIO))
return -EIO;
 
return ents;
@@ -255,6 +255,8 @@ EXPORT_SYMBOL(dma_map_sg_attrs);
  * complete the mapping. Should succeed if retried later.
  *   -EIO  Legacy error code with an unknown meaning. eg. this is
  * returned if a lower level call returned DMA_MAPPING_ERROR.
+ *   -EREMOTEIOThe DMA device cannot access P2PDMA memory specified in
+ * the sg_table. This will not succeed if retried.
  */
 int dma_map_sgtable(struct device *dev, struct sg_table *sgt,
enum dma_data_direction dir, unsigned long attrs)
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 06/21] dma-direct: support PCI P2PDMA pages in dma-direct map_sg

2022-04-07 Thread Logan Gunthorpe
Add PCI P2PDMA support for dma_direct_map_sg() so that it can map
PCI P2PDMA pages directly without a hack in the callers. This allows
for heterogeneous SGLs that contain both P2PDMA and regular pages.

A P2PDMA page may have three possible outcomes when being mapped:
  1) If the data path between the two devices doesn't go through the
 root port, then it should be mapped with a PCI bus address
  2) If the data path goes through the host bridge, it should be mapped
 normally, as though it were a CPU physical address
  3) It is not possible for the two devices to communicate and thus
 the mapping operation should fail (and it will return -EREMOTEIO).

SGL segments that contain PCI bus addresses are marked with
sg_dma_mark_pci_p2pdma() and are ignored when unmapped.

P2PDMA mappings are also failed if swiotlb needs to be used on the
mapping.

Signed-off-by: Logan Gunthorpe 
---
 kernel/dma/direct.c | 43 +--
 kernel/dma/direct.h |  8 +++-
 2 files changed, 44 insertions(+), 7 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 9743c6ccce1a..33b838a3ccb2 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -461,29 +461,60 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
arch_sync_dma_for_cpu_all();
 }
 
+/*
+ * Unmaps segments, except for ones marked as pci_p2pdma which do not
+ * require any further action as they contain a bus address.
+ */
 void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
int nents, enum dma_data_direction dir, unsigned long attrs)
 {
struct scatterlist *sg;
int i;
 
-   for_each_sg(sgl, sg, nents, i)
-   dma_direct_unmap_page(dev, sg->dma_address, sg_dma_len(sg), dir,
-attrs);
+   for_each_sg(sgl,  sg, nents, i) {
+   if (sg_is_dma_bus_address(sg))
+   sg_dma_unmark_bus_address(sg);
+   else
+   dma_direct_unmap_page(dev, sg->dma_address,
+ sg_dma_len(sg), dir, attrs);
+   }
 }
 #endif
 
 int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
enum dma_data_direction dir, unsigned long attrs)
 {
-   int i;
+   struct pci_p2pdma_map_state p2pdma_state = {};
+   enum pci_p2pdma_map_type map;
struct scatterlist *sg;
+   int i, ret;
 
for_each_sg(sgl, sg, nents, i) {
+   if (is_pci_p2pdma_page(sg_page(sg))) {
+   map = pci_p2pdma_map_segment(_state, dev, sg);
+   switch (map) {
+   case PCI_P2PDMA_MAP_BUS_ADDR:
+   continue;
+   case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+   /*
+* Any P2P mapping that traverses the PCI
+* host bridge must be mapped with CPU physical
+* address and not PCI bus addresses. This is
+* done with dma_direct_map_page() below.
+*/
+   break;
+   default:
+   ret = -EREMOTEIO;
+   goto out_unmap;
+   }
+   }
+
sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
sg->offset, sg->length, dir, attrs);
-   if (sg->dma_address == DMA_MAPPING_ERROR)
+   if (sg->dma_address == DMA_MAPPING_ERROR) {
+   ret = -EIO;
goto out_unmap;
+   }
sg_dma_len(sg) = sg->length;
}
 
@@ -491,7 +522,7 @@ int dma_direct_map_sg(struct device *dev, struct 
scatterlist *sgl, int nents,
 
 out_unmap:
dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
-   return -EIO;
+   return ret;
 }
 
 dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index 4632b0f4f72e..81b213409ce8 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -8,6 +8,7 @@
 #define _KERNEL_DMA_DIRECT_H
 
 #include 
+#include 
 
 int dma_direct_get_sgtable(struct device *dev, struct sg_table *sgt,
void *cpu_addr, dma_addr_t dma_addr, size_t size,
@@ -87,10 +88,15 @@ static inline dma_addr_t dma_direct_map_page(struct device 
*dev,
phys_addr_t phys = page_to_phys(page) + offset;
dma_addr_t dma_addr = phys_to_dma(dev, phys);
 
-   if (is_swiotlb_force_bounce(dev))
+   if (is_swiotlb_force_bounce(dev)) {
+   if (is_pci_p2pdma_page(page))
+   return DMA_MAPPING_ERROR;
return swiotlb_map(dev, phys, size, dir, attrs);
+   }
 
if (u

[PATCH v6 17/21] lib/scatterlist: add check when merging zone device pages

2022-04-07 Thread Logan Gunthorpe
Consecutive zone device pages should not be merged into the same sgl
or bvec segment with other types of pages or if they belong to different
pgmaps. Otherwise getting the pgmap of a given segment is not possible
without scanning the entire segment. This helper returns true either if
both pages are not zone device pages or both pages are zone device
pages with the same pgmap.

Factor out the check for page mergability into a pages_are_mergable()
helper and add a check with zone_device_pages_are_mergeable().

Signed-off-by: Logan Gunthorpe 
---
 lib/scatterlist.c | 25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index d5e82e4a57ad..af53a0984f76 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -410,6 +410,15 @@ static struct scatterlist *get_next_sg(struct 
sg_append_table *table,
return new_sg;
 }
 
+static bool pages_are_mergeable(struct page *a, struct page *b)
+{
+   if (page_to_pfn(a) != page_to_pfn(b) + 1)
+   return false;
+   if (!zone_device_pages_have_same_pgmap(a, b))
+   return false;
+   return true;
+}
+
 /**
  * sg_alloc_append_table_from_pages - Allocate and initialize an append sg
  *table from an array of pages
@@ -447,6 +456,7 @@ int sg_alloc_append_table_from_pages(struct sg_append_table 
*sgt_append,
unsigned int chunks, cur_page, seg_len, i, prv_len = 0;
unsigned int added_nents = 0;
struct scatterlist *s = sgt_append->prv;
+   struct page *last_pg;
 
/*
 * The algorithm below requires max_segment to be aligned to PAGE_SIZE
@@ -460,21 +470,17 @@ int sg_alloc_append_table_from_pages(struct 
sg_append_table *sgt_append,
return -EOPNOTSUPP;
 
if (sgt_append->prv) {
-   unsigned long paddr =
-   (page_to_pfn(sg_page(sgt_append->prv)) * PAGE_SIZE +
-sgt_append->prv->offset + sgt_append->prv->length) /
-   PAGE_SIZE;
-
if (WARN_ON(offset))
return -EINVAL;
 
/* Merge contiguous pages into the last SG */
prv_len = sgt_append->prv->length;
-   while (n_pages && page_to_pfn(pages[0]) == paddr) {
+   last_pg = sg_page(sgt_append->prv);
+   while (n_pages && pages_are_mergeable(last_pg, pages[0])) {
if (sgt_append->prv->length + PAGE_SIZE > max_segment)
break;
sgt_append->prv->length += PAGE_SIZE;
-   paddr++;
+   last_pg = pages[0];
pages++;
n_pages--;
}
@@ -488,7 +494,7 @@ int sg_alloc_append_table_from_pages(struct sg_append_table 
*sgt_append,
for (i = 1; i < n_pages; i++) {
seg_len += PAGE_SIZE;
if (seg_len >= max_segment ||
-   page_to_pfn(pages[i]) != page_to_pfn(pages[i - 1]) + 1) {
+   !pages_are_mergeable(pages[i], pages[i - 1])) {
chunks++;
seg_len = 0;
}
@@ -504,8 +510,7 @@ int sg_alloc_append_table_from_pages(struct sg_append_table 
*sgt_append,
for (j = cur_page + 1; j < n_pages; j++) {
seg_len += PAGE_SIZE;
if (seg_len >= max_segment ||
-   page_to_pfn(pages[j]) !=
-   page_to_pfn(pages[j - 1]) + 1)
+   !pages_are_mergeable(pages[j], pages[j - 1]))
break;
}
 
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 11/21] RDMA/core: introduce ib_dma_pci_p2p_dma_supported()

2022-04-07 Thread Logan Gunthorpe
Introduce the helper function ib_dma_pci_p2p_dma_supported() to check
if a given ib_device can be used in P2PDMA transfers. This ensures
the ib_device is not using virt_dma and also that the underlying
dma_device supports P2PDMA.

Use the new helper in nvme-rdma to replace the existing check for
ib_uses_virt_dma(). Adding the dma_pci_p2pdma_supported() check allows
switching away from pci_p2pdma_[un]map_sg().

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Max Gurtovoy 
---
 drivers/nvme/target/rdma.c |  2 +-
 include/rdma/ib_verbs.h| 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 2fab0b219b25..12258f87ccc8 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -415,7 +415,7 @@ static int nvmet_rdma_alloc_rsp(struct nvmet_rdma_device 
*ndev,
if (ib_dma_mapping_error(ndev->device, r->send_sge.addr))
goto out_free_rsp;
 
-   if (!ib_uses_virt_dma(ndev->device))
+   if (ib_dma_pci_p2p_dma_supported(ndev->device))
r->req.p2p_client = >device->dev;
r->send_sge.length = sizeof(*r->req.cqe);
r->send_sge.lkey = ndev->pd->local_dma_lkey;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 69d883f7fb41..79609ab73014 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -4003,6 +4003,17 @@ static inline bool ib_uses_virt_dma(struct ib_device 
*dev)
return IS_ENABLED(CONFIG_INFINIBAND_VIRT_DMA) && !dev->dma_device;
 }
 
+/*
+ * Check if a IB device's underlying DMA mapping supports P2PDMA transfers.
+ */
+static inline bool ib_dma_pci_p2p_dma_supported(struct ib_device *dev)
+{
+   if (ib_uses_virt_dma(dev))
+   return false;
+
+   return dma_pci_p2pdma_supported(dev->dma_device);
+}
+
 /**
  * ib_dma_mapping_error - check a DMA addr for error
  * @dev: The device for which the dma_addr was created
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 09/21] nvme-pci: check DMA ops when indicating support for PCI P2PDMA

2022-04-07 Thread Logan Gunthorpe
Introduce a supports_pci_p2pdma() operation in nvme_ctrl_ops to
replace the fixed NVME_F_PCI_P2PDMA flag such that the dma_map_ops
flags can be checked for PCI P2PDMA support.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Chaitanya Kulkarni 
---
 drivers/nvme/host/core.c |  3 ++-
 drivers/nvme/host/nvme.h |  2 +-
 drivers/nvme/host/pci.c  | 11 +--
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index efb85c6d8e2d..bbc276dda49f 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3912,7 +3912,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
unsigned nsid,
blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, ns->queue);
 
blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
-   if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
+   if (ctrl->ops->supports_pci_p2pdma &&
+   ctrl->ops->supports_pci_p2pdma(ctrl))
blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
 
ns->ctrl = ctrl;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 1393bbf82d71..7d97bfb2a9e2 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -489,7 +489,6 @@ struct nvme_ctrl_ops {
unsigned int flags;
 #define NVME_F_FABRICS (1 << 0)
 #define NVME_F_METADATA_SUPPORTED  (1 << 1)
-#define NVME_F_PCI_P2PDMA  (1 << 2)
int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
@@ -497,6 +496,7 @@ struct nvme_ctrl_ops {
void (*submit_async_event)(struct nvme_ctrl *ctrl);
void (*delete_ctrl)(struct nvme_ctrl *ctrl);
int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
+   bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d817ca17463e..fec4c7191310 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2969,17 +2969,24 @@ static int nvme_pci_get_address(struct nvme_ctrl *ctrl, 
char *buf, int size)
return snprintf(buf, size, "%s\n", dev_name(>dev));
 }
 
+static bool nvme_pci_supports_pci_p2pdma(struct nvme_ctrl *ctrl)
+{
+   struct nvme_dev *dev = to_nvme_dev(ctrl);
+
+   return dma_pci_p2pdma_supported(dev->dev);
+}
+
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.name   = "pcie",
.module = THIS_MODULE,
-   .flags  = NVME_F_METADATA_SUPPORTED |
- NVME_F_PCI_P2PDMA,
+   .flags  = NVME_F_METADATA_SUPPORTED,
.reg_read32 = nvme_pci_reg_read32,
.reg_write32= nvme_pci_reg_write32,
.reg_read64 = nvme_pci_reg_read64,
.free_ctrl  = nvme_pci_free_ctrl,
.submit_async_event = nvme_pci_submit_async_event,
.get_address= nvme_pci_get_address,
+   .supports_pci_p2pdma= nvme_pci_supports_pci_p2pdma,
 };
 
 static int nvme_dev_map(struct nvme_dev *dev)
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 08/24] dma-direct: support PCI P2PDMA pages in dma-direct map_sg

2022-02-01 Thread Logan Gunthorpe



On 2022-02-01 1:53 p.m., Jonathan Derrick wrote:
> 
> 
> On 1/27/2022 5:25 PM, Logan Gunthorpe wrote:
>> Add PCI P2PDMA support for dma_direct_map_sg() so that it can map
>> PCI P2PDMA pages directly without a hack in the callers. This allows
>> for heterogeneous SGLs that contain both P2PDMA and regular pages.
>>
>> A P2PDMA page may have three possible outcomes when being mapped:
>>1) If the data path between the two devices doesn't go through the
>>   root port, then it should be mapped with a PCI bus address
>>2) If the data path goes through the host bridge, it should be mapped
>>   normally, as though it were a CPU physical address
>>3) It is not possible for the two devices to communicate and thus
>>   the mapping operation should fail (and it will return -EREMOTEIO).
>>
>> SGL segments that contain PCI bus addresses are marked with
>> sg_dma_mark_pci_p2pdma() and are ignored when unmapped.
>>
>> P2PDMA mappings are also failed if swiotlb needs to be used on the
>> mapping.
>>
>> Signed-off-by: Logan Gunthorpe 
>> ---
>>   kernel/dma/direct.c | 43 +--
>>   kernel/dma/direct.h |  7 ++-
>>   2 files changed, 43 insertions(+), 7 deletions(-)
>>
>> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
>> index 50f48e9e4598..975df5f3aaf9 100644
>> --- a/kernel/dma/direct.c
>> +++ b/kernel/dma/direct.c
>> @@ -461,29 +461,60 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
>>  arch_sync_dma_for_cpu_all();
>>   }
>>   
>> +/*
>> + * Unmaps segments, except for ones marked as pci_p2pdma which do not
>> + * require any further action as they contain a bus address.
>> + */
>>   void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
>>  int nents, enum dma_data_direction dir, unsigned long attrs)
>>   {
>>  struct scatterlist *sg;
>>  int i;
>>   
>> -for_each_sg(sgl, sg, nents, i)
>> -dma_direct_unmap_page(dev, sg->dma_address, sg_dma_len(sg), dir,
>> - attrs);
>> +for_each_sg(sgl,  sg, nents, i) {
>> +if (sg_is_dma_bus_address(sg))
>> +sg_dma_unmark_bus_address(sg);
>> +else
>> +dma_direct_unmap_page(dev, sg->dma_address,
>> +  sg_dma_len(sg), dir, attrs);
>> +}
>>   }
>>   #endif
>>   
>>   int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int 
>> nents,
>>  enum dma_data_direction dir, unsigned long attrs)
>>   {
>> -int i;
>> +struct pci_p2pdma_map_state p2pdma_state = {};
>> +enum pci_p2pdma_map_type map;
>>  struct scatterlist *sg;
>> +int i, ret;
>>   
>>  for_each_sg(sgl, sg, nents, i) {
>> +if (is_pci_p2pdma_page(sg_page(sg))) {
>> +map = pci_p2pdma_map_segment(_state, dev, sg);
>> +switch (map) {
>> +case PCI_P2PDMA_MAP_BUS_ADDR:
>> +continue;
>> +case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
>> +/*
>> + * Any P2P mapping that traverses the PCI
>> + * host bridge must be mapped with CPU physical
>> + * address and not PCI bus addresses. This is
>> + * done with dma_direct_map_page() below.
>> + */
>> +break;
>> +default:
>> +ret = -EREMOTEIO;
>> +goto out_unmap;
>> +}
>> +}
> I'm a little confused about this code. Would there be a case where the 
> mapping needs
> to be checked for each sg in the list? And if some sg in the sgl can be mapped
> differently, would we want to continue checking the rest of the sg in the sgl 
> instead
> of breaking out of the loop completely?

Yes, the code supports heterogeneous SGLs with P2PDMA and regular
memory; it's also theoretically possible to mix P2PDMA memory for
different devices. So yes, the mapping must be checked for every SG in
the list. It can't just see one SG that points to P2PDMA memory and
assume the rest are all good.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 00/24] Userspace P2PDMA with O_DIRECT NVMe devices

2022-01-31 Thread Logan Gunthorpe



On 2022-01-31 11:56 a.m., Jonathan Derrick wrote:
>> This is relatively straightforward, however the one significant
>> problem is that, presently, pci_p2pdma_map_sg() requires a homogeneous
>> SGL with all P2PDMA pages or all regular pages. Enhancing GUP to
>> support enforcing this rule would require a huge hack that I don't
>> expect would be all that pallatable. So patches 3 to 16 add
>> support for P2PDMA pages to dma_map_sg[table]() to the dma-direct
>> and dma-iommu implementations. Thus systems without an IOMMU plus
>> Intel and AMD IOMMUs are supported. (Other IOMMU implementations would
>> then be unsupported, notably ARM and PowerPC but support would be added
>> when they convert to dma-iommu).
> Am I understanding that an IO may use a mix of p2pdma and system pages?
> Would that cause inconsistent latencies?

Yes, that certainly would be a possibility. People developing
applications that do such mixing would have to weight that issue if
latency is something they care about.

But it's counter productive and causes other difficulties for the kernel
to enforce only homogenous IO.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 22/24] mm: use custom page_free for P2PDMA pages

2022-01-28 Thread Logan Gunthorpe



On 2022-01-28 7:22 a.m., Jason Gunthorpe wrote:
> On Thu, Jan 27, 2022 at 05:26:12PM -0700, Logan Gunthorpe wrote:
>> When P2PDMA pages are passed to userspace, they will need to be
>> reference counted properly and returned to their genalloc after their
>> reference count returns to 1. This is accomplished with the existing
> 
> It is reference count returns to 0 now, right?

Right, yes.

Thanks,

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 02/24] mm: remove extra ZONE_DEVICE struct page refcount

2022-01-28 Thread Logan Gunthorpe



On 2022-01-28 7:21 a.m., Jason Gunthorpe wrote:
> On Thu, Jan 27, 2022 at 05:25:52PM -0700, Logan Gunthorpe wrote:
>> From: Ralph Campbell 
>>
>> ZONE_DEVICE struct pages have an extra reference count that complicates the
>> code for put_page() and several places in the kernel that need to check the
>> reference count to see that a page is not being used (gup, compaction,
>> migration, etc.). Clean up the code so the reference count doesn't need to
>> be treated specially for ZONE_DEVICE.
>>
>> [logang: dropped no longer used section from mm.h including
>>  page_is_devmap_managed, rebased on v5.17-rc1 (possibly poorly)]
>> Signed-off-by: Ralph Campbell 
>> Signed-off-by: Alex Sierra 
>> Signed-off-by: Logan Gunthorpe 
>> Reviewed-by: Christoph Hellwig 
>> ---
>>  arch/powerpc/kvm/book3s_hv_uvmem.c |  2 +-
>>  drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
>>  fs/dax.c   |  4 +-
>>  include/linux/dax.h|  2 +-
>>  include/linux/memremap.h   |  7 +--
>>  include/linux/mm.h | 44 
>>  lib/test_hmm.c |  2 +-
>>  mm/internal.h  |  8 +++
>>  mm/memcontrol.c|  6 +--
>>  mm/memremap.c  | 70 +++---
>>  mm/migrate.c   |  5 --
>>  mm/page_alloc.c|  3 ++
>>  mm/swap.c  | 45 ++---
>>  13 files changed, 46 insertions(+), 154 deletions(-)
> 
> This patch still can't be applied until the FSDAX issues are solved,
> right? See my remarks the last time it was posted..

Yes. As I mentioned in the cover, this is just to show that this
patchset is compatible with the direction this patch goes.

Logan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 17/24] iov_iter: introduce iov_iter_get_pages_[alloc_]flags()

2022-01-27 Thread Logan Gunthorpe
Add iov_iter_get_pages_flags() and iov_iter_get_pages_alloc_flags()
which take a flags argument that is passed to get_user_pages_fast().

This is so that FOLL_PCI_P2PDMA can be passed when appropriate.

Signed-off-by: Logan Gunthorpe 
---
 include/linux/uio.h |  6 ++
 lib/iov_iter.c  | 25 +++--
 2 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 1198a2bfc9bf..22cf1db3a6c5 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -232,8 +232,14 @@ void iov_iter_pipe(struct iov_iter *i, unsigned int 
direction, struct pipe_inode
 void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t 
count);
 void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray 
*xarray,
 loff_t start, size_t count);
+ssize_t iov_iter_get_pages_flags(struct iov_iter *i, struct page **pages,
+   size_t maxsize, unsigned maxpages, size_t *start,
+   unsigned int gup_flags);
 ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
size_t maxsize, unsigned maxpages, size_t *start);
+ssize_t iov_iter_get_pages_alloc_flags(struct iov_iter *i,
+   struct page ***pages, size_t maxsize, size_t *start,
+   unsigned int gup_flags);
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
size_t maxsize, size_t *start);
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index b0e0acdf96c1..9eeea74f85a7 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1513,9 +1513,9 @@ static struct page *first_bvec_segment(const struct 
iov_iter *i,
return page;
 }
 
-ssize_t iov_iter_get_pages(struct iov_iter *i,
+ssize_t iov_iter_get_pages_flags(struct iov_iter *i,
   struct page **pages, size_t maxsize, unsigned maxpages,
-  size_t *start)
+  size_t *start, unsigned int gup_flags)
 {
size_t len;
int n, res;
@@ -1526,7 +1526,6 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
return 0;
 
if (likely(iter_is_iovec(i))) {
-   unsigned int gup_flags = 0;
unsigned long addr;
 
if (iov_iter_rw(i) != WRITE)
@@ -1556,6 +1555,13 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
return iter_xarray_get_pages(i, pages, maxsize, maxpages, 
start);
return -EFAULT;
 }
+EXPORT_SYMBOL_GPL(iov_iter_get_pages_flags);
+
+ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
+  size_t maxsize, unsigned maxpages, size_t *start)
+{
+   return iov_iter_get_pages_flags(i, pages, maxsize, maxpages, start, 0);
+}
 EXPORT_SYMBOL(iov_iter_get_pages);
 
 static struct page **get_pages_array(size_t n)
@@ -1638,9 +1644,9 @@ static ssize_t iter_xarray_get_pages_alloc(struct 
iov_iter *i,
return actual;
 }
 
-ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
+ssize_t iov_iter_get_pages_alloc_flags(struct iov_iter *i,
   struct page ***pages, size_t maxsize,
-  size_t *start)
+  size_t *start, unsigned int gup_flags)
 {
struct page **p;
size_t len;
@@ -1652,7 +1658,6 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
return 0;
 
if (likely(iter_is_iovec(i))) {
-   unsigned int gup_flags = 0;
unsigned long addr;
 
if (iov_iter_rw(i) != WRITE)
@@ -1665,6 +1670,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
p = get_pages_array(n);
if (!p)
return -ENOMEM;
+
res = get_user_pages_fast(addr, n, gup_flags, p);
if (unlikely(res <= 0)) {
kvfree(p);
@@ -1692,6 +1698,13 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
return iter_xarray_get_pages_alloc(i, pages, maxsize, start);
return -EFAULT;
 }
+EXPORT_SYMBOL_GPL(iov_iter_get_pages_alloc_flags);
+
+ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
+size_t maxsize, size_t *start)
+{
+   return iov_iter_get_pages_alloc_flags(i, pages, maxsize, start, 0);
+}
 EXPORT_SYMBOL(iov_iter_get_pages_alloc);
 
 size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 13/24] RDMA/core: introduce ib_dma_pci_p2p_dma_supported()

2022-01-27 Thread Logan Gunthorpe
Introduce the helper function ib_dma_pci_p2p_dma_supported() to check
if a given ib_device can be used in P2PDMA transfers. This ensures
the ib_device is not using virt_dma and also that the underlying
dma_device supports P2PDMA.

Use the new helper in nvme-rdma to replace the existing check for
ib_uses_virt_dma(). Adding the dma_pci_p2pdma_supported() check allows
switching away from pci_p2pdma_[un]map_sg().

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Max Gurtovoy 
---
 drivers/nvme/target/rdma.c |  2 +-
 include/rdma/ib_verbs.h| 11 +++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 1deb4043e242..22519739a874 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -415,7 +415,7 @@ static int nvmet_rdma_alloc_rsp(struct nvmet_rdma_device 
*ndev,
if (ib_dma_mapping_error(ndev->device, r->send_sge.addr))
goto out_free_rsp;
 
-   if (!ib_uses_virt_dma(ndev->device))
+   if (ib_dma_pci_p2p_dma_supported(ndev->device))
r->req.p2p_client = >device->dev;
r->send_sge.length = sizeof(*r->req.cqe);
r->send_sge.lkey = ndev->pd->local_dma_lkey;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 69d883f7fb41..79609ab73014 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -4003,6 +4003,17 @@ static inline bool ib_uses_virt_dma(struct ib_device 
*dev)
return IS_ENABLED(CONFIG_INFINIBAND_VIRT_DMA) && !dev->dma_device;
 }
 
+/*
+ * Check if a IB device's underlying DMA mapping supports P2PDMA transfers.
+ */
+static inline bool ib_dma_pci_p2p_dma_supported(struct ib_device *dev)
+{
+   if (ib_uses_virt_dma(dev))
+   return false;
+
+   return dma_pci_p2pdma_supported(dev->dma_device);
+}
+
 /**
  * ib_dma_mapping_error - check a DMA addr for error
  * @dev: The device for which the dma_addr was created
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 08/24] dma-direct: support PCI P2PDMA pages in dma-direct map_sg

2022-01-27 Thread Logan Gunthorpe
Add PCI P2PDMA support for dma_direct_map_sg() so that it can map
PCI P2PDMA pages directly without a hack in the callers. This allows
for heterogeneous SGLs that contain both P2PDMA and regular pages.

A P2PDMA page may have three possible outcomes when being mapped:
  1) If the data path between the two devices doesn't go through the
 root port, then it should be mapped with a PCI bus address
  2) If the data path goes through the host bridge, it should be mapped
 normally, as though it were a CPU physical address
  3) It is not possible for the two devices to communicate and thus
 the mapping operation should fail (and it will return -EREMOTEIO).

SGL segments that contain PCI bus addresses are marked with
sg_dma_mark_pci_p2pdma() and are ignored when unmapped.

P2PDMA mappings are also failed if swiotlb needs to be used on the
mapping.

Signed-off-by: Logan Gunthorpe 
---
 kernel/dma/direct.c | 43 +--
 kernel/dma/direct.h |  7 ++-
 2 files changed, 43 insertions(+), 7 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 50f48e9e4598..975df5f3aaf9 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -461,29 +461,60 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
arch_sync_dma_for_cpu_all();
 }
 
+/*
+ * Unmaps segments, except for ones marked as pci_p2pdma which do not
+ * require any further action as they contain a bus address.
+ */
 void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
int nents, enum dma_data_direction dir, unsigned long attrs)
 {
struct scatterlist *sg;
int i;
 
-   for_each_sg(sgl, sg, nents, i)
-   dma_direct_unmap_page(dev, sg->dma_address, sg_dma_len(sg), dir,
-attrs);
+   for_each_sg(sgl,  sg, nents, i) {
+   if (sg_is_dma_bus_address(sg))
+   sg_dma_unmark_bus_address(sg);
+   else
+   dma_direct_unmap_page(dev, sg->dma_address,
+ sg_dma_len(sg), dir, attrs);
+   }
 }
 #endif
 
 int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
enum dma_data_direction dir, unsigned long attrs)
 {
-   int i;
+   struct pci_p2pdma_map_state p2pdma_state = {};
+   enum pci_p2pdma_map_type map;
struct scatterlist *sg;
+   int i, ret;
 
for_each_sg(sgl, sg, nents, i) {
+   if (is_pci_p2pdma_page(sg_page(sg))) {
+   map = pci_p2pdma_map_segment(_state, dev, sg);
+   switch (map) {
+   case PCI_P2PDMA_MAP_BUS_ADDR:
+   continue;
+   case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+   /*
+* Any P2P mapping that traverses the PCI
+* host bridge must be mapped with CPU physical
+* address and not PCI bus addresses. This is
+* done with dma_direct_map_page() below.
+*/
+   break;
+   default:
+   ret = -EREMOTEIO;
+   goto out_unmap;
+   }
+   }
+
sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
sg->offset, sg->length, dir, attrs);
-   if (sg->dma_address == DMA_MAPPING_ERROR)
+   if (sg->dma_address == DMA_MAPPING_ERROR) {
+   ret = -EIO;
goto out_unmap;
+   }
sg_dma_len(sg) = sg->length;
}
 
@@ -491,7 +522,7 @@ int dma_direct_map_sg(struct device *dev, struct 
scatterlist *sgl, int nents,
 
 out_unmap:
dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
-   return -EIO;
+   return ret;
 }
 
 dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index 4632b0f4f72e..a33152d79069 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -87,10 +87,15 @@ static inline dma_addr_t dma_direct_map_page(struct device 
*dev,
phys_addr_t phys = page_to_phys(page) + offset;
dma_addr_t dma_addr = phys_to_dma(dev, phys);
 
-   if (is_swiotlb_force_bounce(dev))
+   if (is_swiotlb_force_bounce(dev)) {
+   if (is_pci_p2pdma_page(page))
+   return DMA_MAPPING_ERROR;
return swiotlb_map(dev, phys, size, dir, attrs);
+   }
 
if (unlikely(!dma_capable(dev, dma_addr, size, true))) {
+   if (is_pci_p2pdma_page(page))
+   return DMA_MAPPING_ERROR;
if (swiotlb_force != SWIOTLB_NO_FORCE)
  

[PATCH v5 10/24] iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg

2022-01-27 Thread Logan Gunthorpe
When a PCI P2PDMA page is seen, set the IOVA length of the segment
to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
apply the appropriate bus address to the segment. The IOVA is not
created if the scatterlist only consists of P2PDMA pages.

A P2PDMA page may have three possible outcomes when being mapped:
  1) If the data path between the two devices doesn't go through
 the root port, then it should be mapped with a PCI bus address
  2) If the data path goes through the host bridge, it should be mapped
 normally with an IOMMU IOVA.
  3) It is not possible for the two devices to communicate and thus
 the mapping operation should fail (and it will return -EREMOTEIO).

Similar to dma-direct, the sg_dma_mark_pci_p2pdma() flag is used to
indicate bus address segments. On unmap, P2PDMA segments are skipped
over when determining the start and end IOVA addresses.

With this change, the flags variable in the dma_map_ops is set to
DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for P2PDMA pages.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
---
 drivers/iommu/dma-iommu.c | 67 +++
 1 file changed, 60 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d85d54f2b549..434e70105180 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1043,6 +1043,16 @@ static int __finalise_sg(struct device *dev, struct 
scatterlist *sg, int nents,
sg_dma_address(s) = DMA_MAPPING_ERROR;
sg_dma_len(s) = 0;
 
+   if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {
+   if (i > 0)
+   cur = sg_next(cur);
+
+   pci_p2pdma_map_bus_segment(s, cur);
+   count++;
+   cur_len = 0;
+   continue;
+   }
+
/*
 * Now fill in the real DMA data. If...
 * - there is a valid output segment to append to
@@ -1139,6 +1149,8 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
struct iova_domain *iovad = >iovad;
struct scatterlist *s, *prev = NULL;
int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
+   struct dev_pagemap *pgmap = NULL;
+   enum pci_p2pdma_map_type map_type;
dma_addr_t iova;
size_t iova_len = 0;
unsigned long mask = dma_get_seg_boundary(dev);
@@ -1174,6 +1186,35 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
s_length = iova_align(iovad, s_length + s_iova_off);
s->length = s_length;
 
+   if (is_pci_p2pdma_page(sg_page(s))) {
+   if (sg_page(s)->pgmap != pgmap) {
+   pgmap = sg_page(s)->pgmap;
+   map_type = pci_p2pdma_map_type(pgmap, dev);
+   }
+
+   switch (map_type) {
+   case PCI_P2PDMA_MAP_BUS_ADDR:
+   /*
+* A zero length will be ignored by
+* iommu_map_sg() and then can be detected
+* in __finalise_sg() to actually map the
+* bus address.
+*/
+   s->length = 0;
+   continue;
+   case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+   /*
+* Mapping through host bridge should be
+* mapped with regular IOVAs, thus we
+* do nothing here and continue below.
+*/
+   break;
+   default:
+   ret = -EREMOTEIO;
+   goto out_restore_sg;
+   }
+   }
+
/*
 * Due to the alignment of our single IOVA allocation, we can
 * depend on these assumptions about the segment boundary mask:
@@ -1196,6 +1237,9 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
prev = s;
}
 
+   if (!iova_len)
+   return __finalise_sg(dev, sg, nents, 0);
+
iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
if (!iova) {
ret = -ENOMEM;
@@ -1217,7 +1261,7 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
 out_restore_sg:
__invalidate_sg(sg, nents);
 out:
-   if (ret != -ENOMEM)
+   if (ret != -ENOMEM && ret != -EREMOTEIO)
return -EINVAL;
return ret;
 }
@@ -1225,7 +1269,7 @@ static int iommu_dma_map_sg(struc

[PATCH v5 12/24] nvme-pci: convert to using dma_map_sgtable()

2022-01-27 Thread Logan Gunthorpe
The dma_map operations now support P2PDMA pages directly. So remove
the calls to pci_p2pdma_[un]map_sg_attrs() and replace them with calls
to dma_map_sgtable().

dma_map_sgtable() returns more complete error codes than dma_map_sg()
and allows differentiating EREMOTEIO errors in case an unsupported
P2PDMA transfer is requested. When this happens, return BLK_STS_TARGET
so the request isn't retried.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Max Gurtovoy 
---
 drivers/nvme/host/pci.c | 69 +
 1 file changed, 29 insertions(+), 40 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 35e9b1e13d9f..330515886dfc 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -229,11 +229,10 @@ struct nvme_iod {
bool use_sgl;
int aborted;
int npages; /* In the PRP list. 0 means small pool in use */
-   int nents;  /* Used in scatterlist */
dma_addr_t first_dma;
unsigned int dma_len;   /* length of single DMA segment mapping */
dma_addr_t meta_dma;
-   struct scatterlist *sg;
+   struct sg_table sgt;
 };
 
 static inline unsigned int nvme_dbbuf_size(struct nvme_dev *dev)
@@ -522,7 +521,7 @@ static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
 static void **nvme_pci_iod_list(struct request *req)
 {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-   return (void **)(iod->sg + blk_rq_nr_phys_segments(req));
+   return (void **)(iod->sgt.sgl + blk_rq_nr_phys_segments(req));
 }
 
 static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req)
@@ -574,17 +573,6 @@ static void nvme_free_sgls(struct nvme_dev *dev, struct 
request *req)
}
 }
 
-static void nvme_unmap_sg(struct nvme_dev *dev, struct request *req)
-{
-   struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-
-   if (is_pci_p2pdma_page(sg_page(iod->sg)))
-   pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
-   rq_dma_dir(req));
-   else
-   dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req));
-}
-
 static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -595,9 +583,10 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct 
request *req)
return;
}
 
-   WARN_ON_ONCE(!iod->nents);
+   WARN_ON_ONCE(!iod->sgt.nents);
+
+   dma_unmap_sgtable(dev->dev, >sgt, rq_dma_dir(req), 0);
 
-   nvme_unmap_sg(dev, req);
if (iod->npages == 0)
dma_pool_free(dev->prp_small_pool, nvme_pci_iod_list(req)[0],
  iod->first_dma);
@@ -605,7 +594,7 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct 
request *req)
nvme_free_sgls(dev, req);
else
nvme_free_prps(dev, req);
-   mempool_free(iod->sg, dev->iod_mempool);
+   mempool_free(iod->sgt.sgl, dev->iod_mempool);
 }
 
 static void nvme_print_sgl(struct scatterlist *sgl, int nents)
@@ -628,7 +617,7 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev 
*dev,
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct dma_pool *pool;
int length = blk_rq_payload_bytes(req);
-   struct scatterlist *sg = iod->sg;
+   struct scatterlist *sg = iod->sgt.sgl;
int dma_len = sg_dma_len(sg);
u64 dma_addr = sg_dma_address(sg);
int offset = dma_addr & (NVME_CTRL_PAGE_SIZE - 1);
@@ -701,16 +690,16 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev 
*dev,
dma_len = sg_dma_len(sg);
}
 done:
-   cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
+   cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sgt.sgl));
cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
return BLK_STS_OK;
 free_prps:
nvme_free_prps(dev, req);
return BLK_STS_RESOURCE;
 bad_sgl:
-   WARN(DO_ONCE(nvme_print_sgl, iod->sg, iod->nents),
+   WARN(DO_ONCE(nvme_print_sgl, iod->sgt.sgl, iod->sgt.nents),
"Invalid SGL for payload:%d nents:%d\n",
-   blk_rq_payload_bytes(req), iod->nents);
+   blk_rq_payload_bytes(req), iod->sgt.nents);
return BLK_STS_IOERR;
 }
 
@@ -736,12 +725,13 @@ static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc 
*sge,
 }
 
 static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
-   struct request *req, struct nvme_rw_command *cmd, int entries)
+   struct request *req, struct nvme_rw_command *cmd)
 {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct dma_pool *pool;
struct nvme_sgl_desc *sg_list;
-   struct scatterlist *sg = iod->sg;
+   struct scatterlist *sg = iod->sgt.sgl;

[PATCH v5 06/24] PCI/P2PDMA: Introduce helpers for dma_map_sg implementations

2022-01-27 Thread Logan Gunthorpe
Add pci_p2pdma_map_segment() as a helper for simple dma_map_sg()
implementations. It takes an scatterlist segment that must point to a
pci_p2pdma struct page and will map it if the mapping requires a bus
address.

The return value indicates whether the mapping required a bus address
or whether the caller still needs to map the segment normally. If the
segment should not be mapped, -EREMOTEIO is returned.

This helper uses a state structure to track the changes to the
pgmap across calls and avoid needing to lookup into the xarray for
every page.

Also add pci_p2pdma_map_bus_segment() which is useful for IOMMU
dma_map_sg() implementations where the sg segment containing the page
differs from the sg segment containing the DMA address.

Prototypes for these helpers are added to dma-map-ops.h as they are only
useful to dma map implementations and don't need to pollute the public
pci-p2pdma header.

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
---
 drivers/pci/p2pdma.c| 59 +
 include/linux/dma-map-ops.h | 21 +
 2 files changed, 80 insertions(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 751d682e66ae..2b2cf00664e7 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -945,6 +945,65 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct 
scatterlist *sg,
 }
 EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
 
+/**
+ * pci_p2pdma_map_segment - map an sg segment determining the mapping type
+ * @state: State structure that should be declared outside of the for_each_sg()
+ * loop and initialized to zero.
+ * @dev: DMA device that's doing the mapping operation
+ * @sg: scatterlist segment to map
+ *
+ * This is a helper to be used by non-IOMMU dma_map_sg() implementations where
+ * the sg segment is the same for the page_link and the dma_address.
+ *
+ * Attempt to map a single segment in an SGL with the PCI bus address.
+ * The segment must point to a PCI P2PDMA page and thus must be
+ * wrapped in a is_pci_p2pdma_page(sg_page(sg)) check.
+ *
+ * Returns the type of mapping used and maps the page if the type is
+ * PCI_P2PDMA_MAP_BUS_ADDR.
+ */
+enum pci_p2pdma_map_type
+pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
+  struct scatterlist *sg)
+{
+   if (state->pgmap != sg_page(sg)->pgmap) {
+   state->pgmap = sg_page(sg)->pgmap;
+   state->map = pci_p2pdma_map_type(state->pgmap, dev);
+   state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
+   }
+
+   if (state->map == PCI_P2PDMA_MAP_BUS_ADDR) {
+   sg->dma_address = sg_phys(sg) + state->bus_off;
+   sg_dma_len(sg) = sg->length;
+   sg_dma_mark_bus_address(sg);
+   }
+
+   return state->map;
+}
+
+/**
+ * pci_p2pdma_map_bus_segment - map an sg segment pre determined to
+ * be mapped with PCI_P2PDMA_MAP_BUS_ADDR
+ * @pg_sg: scatterlist segment with the page to map
+ * @dma_sg: scatterlist segment to assign a DMA address to
+ *
+ * This is a helper for iommu dma_map_sg() implementations when the
+ * segment for the DMA address differs from the segment containing the
+ * source page.
+ *
+ * pci_p2pdma_map_type() must have already been called on the pg_sg and
+ * returned PCI_P2PDMA_MAP_BUS_ADDR.
+ */
+void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
+   struct scatterlist *dma_sg)
+{
+   struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(sg_page(pg_sg)->pgmap);
+
+   dma_sg->dma_address = sg_phys(pg_sg) + pgmap->bus_offset;
+   sg_dma_len(dma_sg) = pg_sg->length;
+   sg_dma_mark_bus_address(dma_sg);
+}
+
 /**
  * pci_p2pdma_enable_store - parse a configfs/sysfs attribute store
  * to enable p2pdma
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index d693a0e33bac..752f91e5eb5d 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -413,15 +413,36 @@ enum pci_p2pdma_map_type {
PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
 };
 
+struct pci_p2pdma_map_state {
+   struct dev_pagemap *pgmap;
+   int map;
+   u64 bus_off;
+};
+
 #ifdef CONFIG_PCI_P2PDMA
 enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
 struct device *dev);
+enum pci_p2pdma_map_type
+pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
+  struct scatterlist *sg);
+void pci_p2pdma_map_bus_segment(struct scatterlist *pg_sg,
+   struct scatterlist *dma_sg);
 #else /* CONFIG_PCI_P2PDMA */
 static inline enum pci_p2pdma_map_type
 pci_p2pdma_map_type(struct dev_pagemap *pgmap, struct device *dev)
 {
return PCI_P2PDMA_MAP_NOT_SUPPORTED;
 }
+static inline enum pci_p2pdma_map_type
+pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct 

[PATCH v5 09/24] dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support

2022-01-27 Thread Logan Gunthorpe
Add a flags member to the dma_map_ops structure with one flag to
indicate support for PCI P2PDMA.

Also, add a helper to check if a device supports PCI P2PDMA.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
---
 include/linux/dma-map-ops.h | 10 ++
 include/linux/dma-mapping.h |  5 +
 kernel/dma/mapping.c| 18 ++
 3 files changed, 33 insertions(+)

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 752f91e5eb5d..4d4161d58ce0 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -11,7 +11,17 @@
 
 struct cma;
 
+/*
+ * Values for struct dma_map_ops.flags:
+ *
+ * DMA_F_PCI_P2PDMA_SUPPORTED: Indicates the dma_map_ops implementation can
+ * handle PCI P2PDMA pages in the map_sg/unmap_sg operation.
+ */
+#define DMA_F_PCI_P2PDMA_SUPPORTED (1 << 0)
+
 struct dma_map_ops {
+   unsigned int flags;
+
void *(*alloc)(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t gfp,
unsigned long attrs);
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index dca2b1355bb1..f7c61b2b4b5e 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -140,6 +140,7 @@ int dma_mmap_attrs(struct device *dev, struct 
vm_area_struct *vma,
unsigned long attrs);
 bool dma_can_mmap(struct device *dev);
 int dma_supported(struct device *dev, u64 mask);
+bool dma_pci_p2pdma_supported(struct device *dev);
 int dma_set_mask(struct device *dev, u64 mask);
 int dma_set_coherent_mask(struct device *dev, u64 mask);
 u64 dma_get_required_mask(struct device *dev);
@@ -250,6 +251,10 @@ static inline int dma_supported(struct device *dev, u64 
mask)
 {
return 0;
 }
+static inline bool dma_pci_p2pdma_supported(struct device *dev)
+{
+   return false;
+}
 static inline int dma_set_mask(struct device *dev, u64 mask)
 {
return -EIO;
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index c056a1468189..74858326ef94 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -724,6 +724,24 @@ int dma_supported(struct device *dev, u64 mask)
 }
 EXPORT_SYMBOL(dma_supported);
 
+bool dma_pci_p2pdma_supported(struct device *dev)
+{
+   const struct dma_map_ops *ops = get_dma_ops(dev);
+
+   /* if ops is not set, dma direct will be used which supports P2PDMA */
+   if (!ops)
+   return true;
+
+   /*
+* Note: dma_ops_bypass is not checked here because P2PDMA should
+* not be used with dma mapping ops that do not have support even
+* if the specific device is bypassing them.
+*/
+
+   return ops->flags & DMA_F_PCI_P2PDMA_SUPPORTED;
+}
+EXPORT_SYMBOL_GPL(dma_pci_p2pdma_supported);
+
 #ifdef CONFIG_ARCH_HAS_DMA_SET_MASK
 void arch_dma_set_mask(struct device *dev, u64 mask);
 #else
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 05/24] PCI/P2PDMA: Expose pci_p2pdma_map_type()

2022-01-27 Thread Logan Gunthorpe
pci_p2pdma_map_type() will be needed by the dma-iommu map_sg
implementation because it will need to determine the mapping type
ahead of actually doing the mapping to create the actual IOMMU mapping.

Prototypes for this helper are added to dma-map-ops.h as they are only
useful to dma map implementations and don't need to pollute the public
pci-p2pdma header

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Chaitanya Kulkarni 
---
 drivers/pci/p2pdma.c| 25 +
 include/linux/dma-map-ops.h | 45 +
 2 files changed, 61 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index bb8900ce073a..751d682e66ae 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -10,6 +10,7 @@
 
 #define pr_fmt(fmt) "pci-p2pdma: " fmt
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -20,13 +21,6 @@
 #include 
 #include 
 
-enum pci_p2pdma_map_type {
-   PCI_P2PDMA_MAP_UNKNOWN = 0,
-   PCI_P2PDMA_MAP_NOT_SUPPORTED,
-   PCI_P2PDMA_MAP_BUS_ADDR,
-   PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
-};
-
 struct pci_p2pdma {
struct gen_pool *pool;
bool p2pmem_published;
@@ -841,8 +835,21 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
 
-static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
-   struct device *dev)
+/**
+ * pci_p2pdma_map_type - return the type of mapping that should be used for
+ * a given device and pgmap
+ * @pgmap: the pagemap of a page to determine the mapping type for
+ * @dev: device that is mapping the page
+ *
+ * Returns one of:
+ * PCI_P2PDMA_MAP_NOT_SUPPORTED - The mapping should not be done
+ * PCI_P2PDMA_MAP_BUS_ADDR - The mapping should use the PCI bus address
+ * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE - The mapping should be done normally
+ * using the CPU physical address (in dma-direct) or an IOVA
+ * mapping for the IOMMU.
+ */
+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
+struct device *dev)
 {
enum pci_p2pdma_map_type type = PCI_P2PDMA_MAP_NOT_SUPPORTED;
struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 0d5b06b3a4a6..d693a0e33bac 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -379,4 +379,49 @@ static inline void debug_dma_dump_mappings(struct device 
*dev)
 
 extern const struct dma_map_ops dma_dummy_ops;
 
+enum pci_p2pdma_map_type {
+   /*
+* PCI_P2PDMA_MAP_UNKNOWN: Used internally for indicating the mapping
+* type hasn't been calculated yet. Functions that return this enum
+* never return this value.
+*/
+   PCI_P2PDMA_MAP_UNKNOWN = 0,
+
+   /*
+* PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
+* traverse the host bridge and the host bridge is not in the
+* allowlist. DMA Mapping routines should return an error when
+* this is returned.
+*/
+   PCI_P2PDMA_MAP_NOT_SUPPORTED,
+
+   /*
+* PCI_P2PDMA_BUS_ADDR: Indicates that two devices can talk to
+* each other directly through a PCI switch and the transaction will
+* not traverse the host bridge. Such a mapping should program
+* the DMA engine with PCI bus addresses.
+*/
+   PCI_P2PDMA_MAP_BUS_ADDR,
+
+   /*
+* PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
+* to each other, but the transaction traverses a host bridge on the
+* allowlist. In this case, a normal mapping either with CPU physical
+* addresses (in the case of dma-direct) or IOVA addresses (in the
+* case of IOMMUs) should be used to program the DMA engine.
+*/
+   PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
+};
+
+#ifdef CONFIG_PCI_P2PDMA
+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
+struct device *dev);
+#else /* CONFIG_PCI_P2PDMA */
+static inline enum pci_p2pdma_map_type
+pci_p2pdma_map_type(struct dev_pagemap *pgmap, struct device *dev)
+{
+   return PCI_P2PDMA_MAP_NOT_SUPPORTED;
+}
+#endif /* CONFIG_PCI_P2PDMA */
+
 #endif /* _LINUX_DMA_MAP_OPS_H */
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 16/24] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages

2022-01-27 Thread Logan Gunthorpe
Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
allow obtaining P2PDMA pages. If a caller does not set this flag
and tries to map P2PDMA pages it will fail.

This is implemented by checking failing if PCI p2pdma pages are
found when FOLL_PCI_P2PDMA is set. This is only done if pte_devmap()
is set.

FOLL_PCI_P2PDMA cannot be set if FOLL_LONGTERM is set.

Signed-off-by: Logan Gunthorpe 
---
 include/linux/mm.h |  1 +
 mm/gup.c   | 22 +-
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ab20ed73678..24f44230dcbf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2888,6 +2888,7 @@ struct page *follow_page(struct vm_area_struct *vma, 
unsigned long address,
 #define FOLL_SPLIT_PMD 0x2 /* split huge pmd before returning */
 #define FOLL_PIN   0x4 /* pages must be released via unpin_user_page */
 #define FOLL_FAST_ONLY 0x8 /* gup_fast: prevent fall-back to slow gup */
+#define FOLL_PCI_P2PDMA0x10 /* allow returning PCI P2PDMA pages */
 
 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
diff --git a/mm/gup.c b/mm/gup.c
index f0af462ac1e2..66e8cbd168b6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -527,6 +527,12 @@ static struct page *follow_page_pte(struct vm_area_struct 
*vma,
page = pte_page(pte);
else
goto no_page;
+
+   if (unlikely(!(flags & FOLL_PCI_P2PDMA) &&
+is_pci_p2pdma_page(page))) {
+   page = ERR_PTR(-EREMOTEIO);
+   goto out;
+   }
} else if (unlikely(!page)) {
if (flags & FOLL_DUMP) {
/* Avoid special (like zero) pages in core dumps */
@@ -985,6 +991,9 @@ static int check_vma_flags(struct vm_area_struct *vma, 
unsigned long gup_flags)
if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
return -EOPNOTSUPP;
 
+   if ((gup_flags & FOLL_LONGTERM) && (gup_flags & FOLL_PCI_P2PDMA))
+   return -EOPNOTSUPP;
+
if (vma_is_secretmem(vma))
return -EFAULT;
 
@@ -2304,6 +2313,10 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
unsigned long end,
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
page = pte_page(pte);
 
+   if (unlikely(pte_devmap(pte) && !(flags & FOLL_PCI_P2PDMA) &&
+is_pci_p2pdma_page(page)))
+   goto pte_unmap;
+
head = try_grab_compound_head(page, 1, flags);
if (!head)
goto pte_unmap;
@@ -2381,6 +2394,12 @@ static int __gup_device_huge(unsigned long pfn, unsigned 
long addr,
undo_dev_pagemap(nr, nr_start, flags, pages);
break;
}
+
+   if (!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
+   undo_dev_pagemap(nr, nr_start, flags, pages);
+   break;
+   }
+
SetPageReferenced(page);
pages[*nr] = page;
if (unlikely(!try_grab_page(page, flags))) {
@@ -2849,7 +2868,8 @@ static int internal_get_user_pages_fast(unsigned long 
start,
 
if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
   FOLL_FORCE | FOLL_PIN | FOLL_GET |
-  FOLL_FAST_ONLY | FOLL_NOFAULT)))
+  FOLL_FAST_ONLY | FOLL_NOFAULT |
+  FOLL_PCI_P2PDMA)))
return -EINVAL;
 
if (gup_flags & FOLL_PIN)
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 18/24] block: add check when merging zone device pages

2022-01-27 Thread Logan Gunthorpe
Consecutive zone device pages should not be merged into the same sgl
or bvec segment with other types of pages or if they belong to different
pgmaps. Otherwise getting the pgmap of a given segment is not possible
without scanning the entire segment. This helper returns true either if
both pages are not zone device pages or both pages are zone device
pages with the same pgmap.

Add a helper to determine if zone device pages are mergeable and use
this helper in page_is_mergeable().

Signed-off-by: Logan Gunthorpe 
---
 block/bio.c|  2 ++
 include/linux/mm.h | 23 +++
 2 files changed, 25 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 4312a8085396..055fcf159461 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -806,6 +806,8 @@ static inline bool page_is_mergeable(const struct bio_vec 
*bv,
return false;
if (xen_domain() && !xen_biovec_phys_mergeable(bv, page))
return false;
+   if (!zone_device_pages_have_same_pgmap(bv->bv_page, page))
+   return false;
 
*same_page = ((vec_end_addr & PAGE_MASK) == page_addr);
if (*same_page)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 24f44230dcbf..4a8e8cddd910 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1080,6 +1080,24 @@ static inline bool is_zone_device_page(const struct page 
*page)
 {
return page_zonenum(page) == ZONE_DEVICE;
 }
+
+/*
+ * Consecutive zone device pages should not be merged into the same sgl
+ * or bvec segment with other types of pages or if they belong to different
+ * pgmaps. Otherwise getting the pgmap of a given segment is not possible
+ * without scanning the entire segment. This helper returns true either if
+ * both pages are not zone device pages or both pages are zone device pages
+ * with the same pgmap.
+ */
+static inline bool zone_device_pages_have_same_pgmap(const struct page *a,
+const struct page *b)
+{
+   if (is_zone_device_page(a) != is_zone_device_page(b))
+   return false;
+   if (!is_zone_device_page(a))
+   return true;
+   return a->pgmap == b->pgmap;
+}
 extern void memmap_init_zone_device(struct zone *, unsigned long,
unsigned long, struct dev_pagemap *);
 #else
@@ -1087,6 +1105,11 @@ static inline bool is_zone_device_page(const struct page 
*page)
 {
return false;
 }
+static inline bool zone_device_pages_have_same_pgmap(const struct page *a,
+const struct page *b)
+{
+   return true;
+}
 #endif
 
 static inline bool is_zone_movable_page(const struct page *page)
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 15/24] PCI/P2PDMA: Remove pci_p2pdma_[un]map_sg()

2022-01-27 Thread Logan Gunthorpe
This interface is superseded by support in dma_map_sg() which now supports
heterogeneous scatterlists. There are no longer any users, so remove it.

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Max Gurtovoy 
---
 drivers/pci/p2pdma.c   | 66 --
 include/linux/pci-p2pdma.h | 27 
 2 files changed, 93 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 2b2cf00664e7..e66694cc9e14 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -879,72 +879,6 @@ enum pci_p2pdma_map_type pci_p2pdma_map_type(struct 
dev_pagemap *pgmap,
return type;
 }
 
-static int __pci_p2pdma_map_sg(struct pci_p2pdma_pagemap *p2p_pgmap,
-   struct device *dev, struct scatterlist *sg, int nents)
-{
-   struct scatterlist *s;
-   int i;
-
-   for_each_sg(sg, s, nents, i) {
-   s->dma_address = sg_phys(s) + p2p_pgmap->bus_offset;
-   sg_dma_len(s) = s->length;
-   }
-
-   return nents;
-}
-
-/**
- * pci_p2pdma_map_sg_attrs - map a PCI peer-to-peer scatterlist for DMA
- * @dev: device doing the DMA request
- * @sg: scatter list to map
- * @nents: elements in the scatterlist
- * @dir: DMA direction
- * @attrs: DMA attributes passed to dma_map_sg() (if called)
- *
- * Scatterlists mapped with this function should be unmapped using
- * pci_p2pdma_unmap_sg_attrs().
- *
- * Returns the number of SG entries mapped or 0 on error.
- */
-int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs)
-{
-   struct pci_p2pdma_pagemap *p2p_pgmap =
-   to_p2p_pgmap(sg_page(sg)->pgmap);
-
-   switch (pci_p2pdma_map_type(sg_page(sg)->pgmap, dev)) {
-   case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
-   return dma_map_sg_attrs(dev, sg, nents, dir, attrs);
-   case PCI_P2PDMA_MAP_BUS_ADDR:
-   return __pci_p2pdma_map_sg(p2p_pgmap, dev, sg, nents);
-   default:
-   /* Mapping is not Supported */
-   return 0;
-   }
-}
-EXPORT_SYMBOL_GPL(pci_p2pdma_map_sg_attrs);
-
-/**
- * pci_p2pdma_unmap_sg_attrs - unmap a PCI peer-to-peer scatterlist that was
- * mapped with pci_p2pdma_map_sg()
- * @dev: device doing the DMA request
- * @sg: scatter list to map
- * @nents: number of elements returned by pci_p2pdma_map_sg()
- * @dir: DMA direction
- * @attrs: DMA attributes passed to dma_unmap_sg() (if called)
- */
-void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs)
-{
-   enum pci_p2pdma_map_type map_type;
-
-   map_type = pci_p2pdma_map_type(sg_page(sg)->pgmap, dev);
-
-   if (map_type == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE)
-   dma_unmap_sg_attrs(dev, sg, nents, dir, attrs);
-}
-EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
-
 /**
  * pci_p2pdma_map_segment - map an sg segment determining the mapping type
  * @state: State structure that should be declared outside of the for_each_sg()
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 8318a97c9c61..2c07aa6b7665 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -30,10 +30,6 @@ struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev 
*pdev,
 unsigned int *nents, u32 length);
 void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
 void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
-int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs);
-void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
-   int nents, enum dma_data_direction dir, unsigned long attrs);
 int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
bool *use_p2pdma);
 ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
@@ -83,17 +79,6 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
 static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 {
 }
-static inline int pci_p2pdma_map_sg_attrs(struct device *dev,
-   struct scatterlist *sg, int nents, enum dma_data_direction dir,
-   unsigned long attrs)
-{
-   return 0;
-}
-static inline void pci_p2pdma_unmap_sg_attrs(struct device *dev,
-   struct scatterlist *sg, int nents, enum dma_data_direction dir,
-   unsigned long attrs)
-{
-}
 static inline int pci_p2pdma_enable_store(const char *page,
struct pci_dev **p2p_dev, bool *use_p2pdma)
 {
@@ -119,16 +104,4 @@ static inline struct pci_dev *pci_p2pmem_find(struct 
device *client)
return pci_p2pmem_find_many(, 1);
 }
 
-static inline int pci_p2pdma_map_sg(

[PATCH v5 14/24] RDMA/rw: drop pci_p2pdma_[un]map_sg()

2022-01-27 Thread Logan Gunthorpe
dma_map_sg() now supports the use of P2PDMA pages so pci_p2pdma_map_sg()
is no longer necessary and may be dropped. This means the
rdma_rw_[un]map_sg() helpers are no longer necessary. Remove it all.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
---
 drivers/infiniband/core/rw.c | 45 
 1 file changed, 9 insertions(+), 36 deletions(-)

diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index 5a3bd41b331c..d4517b68d1ca 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -273,33 +273,6 @@ static int rdma_rw_init_single_wr(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
return 1;
 }
 
-static void rdma_rw_unmap_sg(struct ib_device *dev, struct scatterlist *sg,
-u32 sg_cnt, enum dma_data_direction dir)
-{
-   if (is_pci_p2pdma_page(sg_page(sg)))
-   pci_p2pdma_unmap_sg(dev->dma_device, sg, sg_cnt, dir);
-   else
-   ib_dma_unmap_sg(dev, sg, sg_cnt, dir);
-}
-
-static int rdma_rw_map_sgtable(struct ib_device *dev, struct sg_table *sgt,
-  enum dma_data_direction dir)
-{
-   int nents;
-
-   if (is_pci_p2pdma_page(sg_page(sgt->sgl))) {
-   if (WARN_ON_ONCE(ib_uses_virt_dma(dev)))
-   return 0;
-   nents = pci_p2pdma_map_sg(dev->dma_device, sgt->sgl,
- sgt->orig_nents, dir);
-   if (!nents)
-   return -EIO;
-   sgt->nents = nents;
-   return 0;
-   }
-   return ib_dma_map_sgtable_attrs(dev, sgt, dir, 0);
-}
-
 /**
  * rdma_rw_ctx_init - initialize a RDMA READ/WRITE context
  * @ctx:   context to initialize
@@ -326,7 +299,7 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp 
*qp, u32 port_num,
};
int ret;
 
-   ret = rdma_rw_map_sgtable(dev, , dir);
+   ret = ib_dma_map_sgtable_attrs(dev, , dir, 0);
if (ret)
return ret;
sg_cnt = sgt.nents;
@@ -365,7 +338,7 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp 
*qp, u32 port_num,
return ret;
 
 out_unmap_sg:
-   rdma_rw_unmap_sg(dev, sgt.sgl, sgt.orig_nents, dir);
+   ib_dma_unmap_sgtable_attrs(dev, , dir, 0);
return ret;
 }
 EXPORT_SYMBOL(rdma_rw_ctx_init);
@@ -413,12 +386,12 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
return -EINVAL;
}
 
-   ret = rdma_rw_map_sgtable(dev, , dir);
+   ret = ib_dma_map_sgtable_attrs(dev, , dir, 0);
if (ret)
return ret;
 
if (prot_sg_cnt) {
-   ret = rdma_rw_map_sgtable(dev, _sgt, dir);
+   ret = ib_dma_map_sgtable_attrs(dev, _sgt, dir, 0);
if (ret)
goto out_unmap_sg;
}
@@ -485,9 +458,9 @@ int rdma_rw_ctx_signature_init(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
kfree(ctx->reg);
 out_unmap_prot_sg:
if (prot_sgt.nents)
-   rdma_rw_unmap_sg(dev, prot_sgt.sgl, prot_sgt.orig_nents, dir);
+   ib_dma_unmap_sgtable_attrs(dev, _sgt, dir, 0);
 out_unmap_sg:
-   rdma_rw_unmap_sg(dev, sgt.sgl, sgt.orig_nents, dir);
+   ib_dma_unmap_sgtable_attrs(dev, , dir, 0);
return ret;
 }
 EXPORT_SYMBOL(rdma_rw_ctx_signature_init);
@@ -620,7 +593,7 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct 
ib_qp *qp,
break;
}
 
-   rdma_rw_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+   ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy);
 
@@ -648,8 +621,8 @@ void rdma_rw_ctx_destroy_signature(struct rdma_rw_ctx *ctx, 
struct ib_qp *qp,
kfree(ctx->reg);
 
if (prot_sg_cnt)
-   rdma_rw_unmap_sg(qp->pd->device, prot_sg, prot_sg_cnt, dir);
-   rdma_rw_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
+   ib_dma_unmap_sg(qp->pd->device, prot_sg, prot_sg_cnt, dir);
+   ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
 }
 EXPORT_SYMBOL(rdma_rw_ctx_destroy_signature);
 
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 07/24] dma-mapping: allow EREMOTEIO return code for P2PDMA transfers

2022-01-27 Thread Logan Gunthorpe
Add EREMOTEIO error return to dma_map_sgtable() which will be used
by .map_sg() implementations that detect P2PDMA pages that the
underlying DMA device cannot access.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
---
 kernel/dma/mapping.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 9478eccd1c8e..c056a1468189 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -197,7 +197,7 @@ static int __dma_map_sg_attrs(struct device *dev, struct 
scatterlist *sg,
if (ents > 0)
debug_dma_map_sg(dev, sg, nents, ents, dir, attrs);
else if (WARN_ON_ONCE(ents != -EINVAL && ents != -ENOMEM &&
- ents != -EIO))
+ ents != -EIO && ents != -EREMOTEIO))
return -EIO;
 
return ents;
@@ -255,6 +255,8 @@ EXPORT_SYMBOL(dma_map_sg_attrs);
  * complete the mapping. Should succeed if retried later.
  *   -EIO  Legacy error code with an unknown meaning. eg. this is
  * returned if a lower level call returned DMA_MAPPING_ERROR.
+ *   -EREMOTEIOThe DMA device cannot access P2PDMA memory specified in
+ * the sg_table. This will not succeed if retried.
  */
 int dma_map_sgtable(struct device *dev, struct sg_table *sgt,
enum dma_data_direction dir, unsigned long attrs)
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 11/24] nvme-pci: check DMA ops when indicating support for PCI P2PDMA

2022-01-27 Thread Logan Gunthorpe
Introduce a supports_pci_p2pdma() operation in nvme_ctrl_ops to
replace the fixed NVME_F_PCI_P2PDMA flag such that the dma_map_ops
flags can be checked for PCI P2PDMA support.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Chaitanya Kulkarni 
---
 drivers/nvme/host/core.c |  3 ++-
 drivers/nvme/host/nvme.h |  2 +-
 drivers/nvme/host/pci.c  | 11 +--
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 5e0bfda04bd7..ecb01984b55d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3850,7 +3850,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
unsigned nsid,
blk_queue_flag_set(QUEUE_FLAG_STABLE_WRITES, ns->queue);
 
blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
-   if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
+   if (ctrl->ops->supports_pci_p2pdma &&
+   ctrl->ops->supports_pci_p2pdma(ctrl))
blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
 
ns->ctrl = ctrl;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index a162f6c6da6e..97efc6d0b146 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -486,7 +486,6 @@ struct nvme_ctrl_ops {
unsigned int flags;
 #define NVME_F_FABRICS (1 << 0)
 #define NVME_F_METADATA_SUPPORTED  (1 << 1)
-#define NVME_F_PCI_P2PDMA  (1 << 2)
int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
@@ -494,6 +493,7 @@ struct nvme_ctrl_ops {
void (*submit_async_event)(struct nvme_ctrl *ctrl);
void (*delete_ctrl)(struct nvme_ctrl *ctrl);
int (*get_address)(struct nvme_ctrl *ctrl, char *buf, int size);
+   bool (*supports_pci_p2pdma)(struct nvme_ctrl *ctrl);
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d8585df2c2fd..35e9b1e13d9f 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2965,17 +2965,24 @@ static int nvme_pci_get_address(struct nvme_ctrl *ctrl, 
char *buf, int size)
return snprintf(buf, size, "%s\n", dev_name(>dev));
 }
 
+static bool nvme_pci_supports_pci_p2pdma(struct nvme_ctrl *ctrl)
+{
+   struct nvme_dev *dev = to_nvme_dev(ctrl);
+
+   return dma_pci_p2pdma_supported(dev->dev);
+}
+
 static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
.name   = "pcie",
.module = THIS_MODULE,
-   .flags  = NVME_F_METADATA_SUPPORTED |
- NVME_F_PCI_P2PDMA,
+   .flags  = NVME_F_METADATA_SUPPORTED,
.reg_read32 = nvme_pci_reg_read32,
.reg_write32= nvme_pci_reg_write32,
.reg_read64 = nvme_pci_reg_read64,
.free_ctrl  = nvme_pci_free_ctrl,
.submit_async_event = nvme_pci_submit_async_event,
.get_address= nvme_pci_get_address,
+   .supports_pci_p2pdma= nvme_pci_supports_pci_p2pdma,
 };
 
 static int nvme_dev_map(struct nvme_dev *dev)
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 02/24] mm: remove extra ZONE_DEVICE struct page refcount

2022-01-27 Thread Logan Gunthorpe
From: Ralph Campbell 

ZONE_DEVICE struct pages have an extra reference count that complicates the
code for put_page() and several places in the kernel that need to check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't need to
be treated specially for ZONE_DEVICE.

[logang: dropped no longer used section from mm.h including
 page_is_devmap_managed, rebased on v5.17-rc1 (possibly poorly)]
Signed-off-by: Ralph Campbell 
Signed-off-by: Alex Sierra 
Signed-off-by: Logan Gunthorpe 
Reviewed-by: Christoph Hellwig 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c |  2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
 fs/dax.c   |  4 +-
 include/linux/dax.h|  2 +-
 include/linux/memremap.h   |  7 +--
 include/linux/mm.h | 44 
 lib/test_hmm.c |  2 +-
 mm/internal.h  |  8 +++
 mm/memcontrol.c|  6 +--
 mm/memremap.c  | 70 +++---
 mm/migrate.c   |  5 --
 mm/page_alloc.c|  3 ++
 mm/swap.c  | 45 ++---
 13 files changed, 46 insertions(+), 154 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index e414ca44839f..ec9b6f08943b 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -712,7 +712,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
 
dpage = pfn_to_page(uvmem_pfn);
dpage->zone_device_data = pvt;
-   get_page(dpage);
+   init_page_count(dpage);
lock_page(dpage);
return dpage;
 out_clear:
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 3828aafd3ac4..24cae839e5a4 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -324,7 +324,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
return NULL;
}
 
-   get_page(page);
+   init_page_count(page);
lock_page(page);
return page;
 }
diff --git a/fs/dax.c b/fs/dax.c
index d9b856cf6436..565ebff24e6e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -572,14 +572,14 @@ static void *grab_mapping_entry(struct xa_state *xas,
 
 /**
  * dax_layout_busy_page_range - find first pinned page in @mapping
- * @mapping: address space to scan for a page with ref count > 1
+ * @mapping: address space to scan for a page with ref count > 0
  * @start: Starting offset. Page containing 'start' is included.
  * @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX,
  *   pages from 'start' till the end of file are included.
  *
  * DAX requires ZONE_DEVICE mapped pages. These pages are never
  * 'onlined' to the page allocator so they are considered idle when
- * page->count == 1. A filesystem uses this interface to determine if
+ * page->count == 0. A filesystem uses this interface to determine if
  * any page in the mapping is busy, i.e. for DMA, or other
  * get_user_pages() usages.
  *
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 479939b3be40..5e80f3092d72 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -203,7 +203,7 @@ static inline bool dax_mapping(struct address_space 
*mapping)
 
 static inline bool dax_page_unused(struct page *page)
 {
-   return page_ref_count(page) == 1;
+   return page_ref_count(page) == 0;
 }
 
 #define dax_wait_page(_inode, _page, _wait_cb) \
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 1fafcc38acba..9965f6c6282a 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -66,9 +66,10 @@ enum memory_type {
 
 struct dev_pagemap_ops {
/*
-* Called once the page refcount reaches 1.  (ZONE_DEVICE pages never
-* reach 0 refcount unless there is a refcount bug. This allows the
-* device driver to implement its own memory management.)
+* Called once the page refcount reaches 0. The reference count
+* should be reset to one with init_page_count(page) before reusing
+* the page. This allows the device driver to implement its own
+* memory management.
 */
void (*page_free)(struct page *page);
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e1a84b1e6787..1ab20ed73678 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1094,39 +1094,6 @@ static inline bool is_zone_movable_page(const struct 
page *page)
return page_zonenum(page) == ZONE_MOVABLE;
 }
 
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-void free_devmap_managed_page(struct page *page);
-DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-
-static inline bool page_is_devmap_managed(st

[PATCH v5 00/24] Userspace P2PDMA with O_DIRECT NVMe devices

2022-01-27 Thread Logan Gunthorpe
Hi,

This patchset continues my work to add userspace P2PDMA access using
O_DIRECT NVMe devices. This posting fixes a lot of issues that were
addresed in the last posting, which is here[1].

The patchset is rebased onto v5.17-rc1 as well as a rebased version of
Ralph Campbell's patches to drop the ZONE_DEVICE page ref count offset.
My understanding is this patch has some problems that are yet to be
resolved but this will be the direction taken going forward and is
included here to show that it is compatible with this patchset.

The patchset enables userspace P2PDMA by allowing userspace to mmap()
allocated chunks of the CMB. The resulting VMA can be passed only
to O_DIRECT IO on NVMe backed files or block devices. A flag is added
to GUP() in Patch 16, then Patches 17 through 21 wire this flag up based
on whether the block queue indicates P2PDMA support. Patches 22
through 24 enable the CMB to be mapped into userspace by mmaping
the nvme char device.

This is relatively straightforward, however the one significant
problem is that, presently, pci_p2pdma_map_sg() requires a homogeneous
SGL with all P2PDMA pages or all regular pages. Enhancing GUP to
support enforcing this rule would require a huge hack that I don't
expect would be all that pallatable. So patches 3 to 16 add
support for P2PDMA pages to dma_map_sg[table]() to the dma-direct
and dma-iommu implementations. Thus systems without an IOMMU plus
Intel and AMD IOMMUs are supported. (Other IOMMU implementations would
then be unsupported, notably ARM and PowerPC but support would be added
when they convert to dma-iommu).

dma_map_sgtable() is preferred when dealing with P2PDMA memory as it
will return -EREMOTEIO when the DMA device cannot map specific P2PDMA
pages based on the existing rules in calc_map_type_and_dist().

The other issue is dma_unmap_sg() needs a flag to determine whether a
given dma_addr_t was mapped regularly or as a PCI bus address. To allow
this, a third flag is added to the page_link field in struct
scatterlist. This effectively means support for P2PDMA will now depend
on CONFIG_64BIT.

Feedback welcome.

This series is based on v5.17-rc1. A git branch is available here:

  https://github.com/sbates130272/linux-p2pmem/  p2pdma_user_cmb_v5

Thanks,

Logan

[1] https://lore.kernel.org/all/2027215410.3695-1-log...@deltatee.com/T/#u

--

Changes since v4:
  - Rebase onto v5.17-rc1.
  - Included Ralph Cambell's patches which removes the ZONE_DEVICE page
reference count offset. This is just to demonstrate that this
series is compatible with that direction.
  - Added a comment in pci_p2pdma_map_sg_attrs(), per Chaitanya and
included his Reviewed-by tags.
  - Patch 1 in the last series which cleaned up scatterlist.h
has been upstreamed.
  - Dropped NEED_SG_DMA_BUS_ADDR_FLAG seeing depends on doesn't
work with selected symbols, per Christoph.
  - Switched iov_iter_get_pages_[alloc_]flags to be exported with
EXPORT_SYMBOL_GPL, per Christoph.
  - Renamed zone_device_pages_are_mergeable() to
zone_device_pages_have_same_pgmap(), per Christoph.
  - Renamed .mmap_file_open operation in nvme_ctrl_ops to
cdev_file_open(), per Christoph.

Changes since v3:
  - Add some comment and commit message cleanups I had missed for v3,
also moved the prototypes for some of the p2pdma helpers to
dma-map-ops.h (which I missed in v3 and was suggested in v2).
  - Add separate cleanup patch for scatterlist.h and change the macros
to functions. (Suggested by Chaitanya and Jason, respectively)
  - Rename sg_dma_mark_pci_p2pdma() and sg_is_dma_pci_p2pdma() to
sg_dma_mark_bus_address() and sg_is_dma_bus_address() which
is a more generic name (As requested by Jason)
  - Fixes to some comments and commit messages as suggested by Bjorn
and Jason.
  - Ensure swiotlb is not used with P2PDMA pages. (Per Jason)
  - The sgtable coversion in RDMA was split out and sent upstream
separately, the new patch is only the removal. (Per Jason)
  - Moved the FOLL_PCI_P2PDMA check outside of get_dev_pagemap() as
Jason suggested this will be removed in the near term.
  - Add two patches to ensure that zone device pages with different
pgmaps are never merged in the block layer or
sg_alloc_append_table_from_pages() (Per Jason)
  - Ensure synchronize_rcu() or call_rcu() is used before returning
pages to the genalloc. (Jason pointed out that pages are not
gauranteed to be unused in all architectures until at least
after an RCU grace period, and that synchronize_rcu() was likely
too slow to use in the vma close operation.
  - Collected Acks and Reviews by Bjorn, Jason and Max.

Logan Gunthorpe (22):
  lib/scatterlist: add flag for indicating P2PDMA segments in an SGL
  PCI/P2PDMA: Attempt to set map_type if it has not been set
  PCI/P2PDMA: Expose pci_p2pdma_map_type()
  PCI/P2PDMA: Introduce helpers for dma_map_sg implementations
  dma-mapping: allow EREMOTEIO return code for P2PDMA transfers
  dma

[PATCH v5 03/24] lib/scatterlist: add flag for indicating P2PDMA segments in an SGL

2022-01-27 Thread Logan Gunthorpe
Make use of the third free LSB in scatterlist's page_link on 64bit systems.

The extra bit will be used by dma_[un]map_sg_p2pdma() to determine when a
given SGL segments dma_address points to a PCI bus address.
dma_unmap_sg_p2pdma() will need to perform different cleanup when a
segment is marked as a bus address.

Create a CONFIG_NEED_SG_DMA_BUS_ADDR_FLAG bool which depends on
CONFIG_64BIT (so there is space in the page link for the new flag).
CONFIG_PCI_P2PDMA will then depend on this so this means PCI P2PDMA will
require CONFIG_64BIT. This should be acceptable as the majority of P2PDMA
use cases are restricted to newer root complexes and roughly require the
extra address space for memory BARs used in the transactions.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Chaitanya Kulkarni 
---
 drivers/pci/Kconfig |  5 +
 include/linux/scatterlist.h | 44 -
 2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index d98fafdd0f99..3e837d9e1600 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -164,6 +164,11 @@ config PCI_PASID
 config PCI_P2PDMA
bool "PCI peer-to-peer transfer support"
depends on ZONE_DEVICE
+   #
+   # The need for the scatterlist DMA bus address flag means PCI P2PDMA
+   # requires 64bit
+   #
+   depends on 64BIT
select GENERIC_ALLOCATOR
help
  Enableѕ drivers to do PCI peer-to-peer transactions to and from
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 7ff9d6386c12..6561ca8aead8 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -64,12 +64,24 @@ struct sg_append_table {
 #define SG_CHAIN   0x01UL
 #define SG_END 0x02UL
 
+/*
+ * bit 2 is the third free bit in the page_link on 64bit systems which
+ * is used by dma_unmap_sg() to determine if the dma_address is a
+ * bus address when doing P2PDMA.
+ */
+#ifdef CONFIG_PCI_P2PDMA
+#define SG_DMA_BUS_ADDRESS 0x04UL
+static_assert(__alignof__(struct page) >= 8);
+#else
+#define SG_DMA_BUS_ADDRESS 0x00UL
+#endif
+
 /*
  * We overload the LSB of the page pointer to indicate whether it's
  * a valid sg entry, or whether it points to the start of a new scatterlist.
  * Those low bits are there for everyone! (thanks mason :-)
  */
-#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END)
+#define SG_PAGE_LINK_MASK (SG_CHAIN | SG_END | SG_DMA_BUS_ADDRESS)
 
 static inline unsigned int __sg_flags(struct scatterlist *sg)
 {
@@ -91,6 +103,11 @@ static inline bool sg_is_last(struct scatterlist *sg)
return __sg_flags(sg) & SG_END;
 }
 
+static inline bool sg_is_dma_bus_address(struct scatterlist *sg)
+{
+   return __sg_flags(sg) & SG_DMA_BUS_ADDRESS;
+}
+
 /**
  * sg_assign_page - Assign a given page to an SG entry
  * @sg:SG entry
@@ -245,6 +262,31 @@ static inline void sg_unmark_end(struct scatterlist *sg)
sg->page_link &= ~SG_END;
 }
 
+/**
+ * sg_dma_mark_bus address - Mark the scatterlist entry as a bus address
+ * @sg: SG entryScatterlist
+ *
+ * Description:
+ *   Marks the passed in sg entry to indicate that the dma_address is
+ *   a bus address and doesn't need to be unmapped.
+ **/
+static inline void sg_dma_mark_bus_address(struct scatterlist *sg)
+{
+   sg->page_link |= SG_DMA_BUS_ADDRESS;
+}
+
+/**
+ * sg_unmark_pci_p2pdma - Unmark the scatterlist entry as a bus address
+ * @sg: SG entryScatterlist
+ *
+ * Description:
+ *   Clears the bus address mark.
+ **/
+static inline void sg_dma_unmark_bus_address(struct scatterlist *sg)
+{
+   sg->page_link &= ~SG_DMA_BUS_ADDRESS;
+}
+
 /**
  * sg_phys - Return physical address of an sg entry
  * @sg: SG entry
-- 
2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v5 23/24] PCI/P2PDMA: Introduce pci_mmap_p2pmem()

2022-01-27 Thread Logan Gunthorpe
Introduce pci_mmap_p2pmem() which is a helper to allocate and mmap
a hunk of p2pmem into userspace.

Pages are allocated from the genalloc in bulk and their reference count
incremented. They are returned to the genalloc when the page is put.

The VMA does not take a reference to the pages when they are inserted
with vmf_insert_mixed() (which is necessary for zone device pages) so
the backing P2P memory is stored in a structures in vm_private_data.

A pseudo mount is used to allocate an inode for each PCI device. The
inode's address_space is used in the file doing the mmap so that all
VMAs are collected and can be unmapped if the PCI device is unbound.
After unmapping, the VMAs are iterated through and their pages are
put so the device can continue to be unbound. An active flag is used
to signal to VMAs not to allocate any further P2P memory once the
removal process starts. The flag is synchronized with concurrent
access with an RCU lock.

The VMAs and inode will survive after the unbind of the device, but no
pages will be present in the VMA and a subsequent access will result
in a SIGBUS error.

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
---
 drivers/pci/p2pdma.c   | 301 -
 include/linux/pci-p2pdma.h |  11 ++
 include/uapi/linux/magic.h |   1 +
 3 files changed, 311 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 3a24bf5099cf..d54068d6ce6a 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -17,14 +17,19 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
+#include 
 
 struct pci_p2pdma {
struct gen_pool *pool;
bool p2pmem_published;
struct xarray map_types;
+   struct inode *inode;
+   bool active;
 };
 
 struct pci_p2pdma_pagemap {
@@ -33,6 +38,15 @@ struct pci_p2pdma_pagemap {
u64 bus_offset;
 };
 
+struct pci_p2pdma_map {
+   struct kref ref;
+   struct rcu_head rcu;
+   struct pci_dev *pdev;
+   struct inode *inode;
+   void *kaddr;
+   size_t len;
+};
+
 static struct pci_p2pdma_pagemap *to_p2p_pgmap(struct dev_pagemap *pgmap)
 {
return container_of(pgmap, struct pci_p2pdma_pagemap, pgmap);
@@ -101,6 +115,26 @@ static const struct attribute_group p2pmem_group = {
.name = "p2pmem",
 };
 
+/*
+ * P2PDMA internal mount
+ * Fake an internal VFS mount-point in order to allocate struct address_space
+ * mappings to remove VMAs on unbind events.
+ */
+static int pci_p2pdma_fs_cnt;
+static struct vfsmount *pci_p2pdma_fs_mnt;
+
+static int pci_p2pdma_fs_init_fs_context(struct fs_context *fc)
+{
+   return init_pseudo(fc, P2PDMA_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type pci_p2pdma_fs_type = {
+   .name = "p2dma",
+   .owner = THIS_MODULE,
+   .init_fs_context = pci_p2pdma_fs_init_fs_context,
+   .kill_sb = kill_anon_super,
+};
+
 static void p2pdma_page_free(struct page *page)
 {
struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
@@ -129,6 +163,9 @@ static void pci_p2pdma_release(void *data)
gen_pool_destroy(p2pdma->pool);
sysfs_remove_group(>dev.kobj, _group);
xa_destroy(>map_types);
+
+   iput(p2pdma->inode);
+   simple_release_fs(_p2pdma_fs_mnt, _p2pdma_fs_cnt);
 }
 
 static int pci_p2pdma_setup(struct pci_dev *pdev)
@@ -146,17 +183,32 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
if (!p2p->pool)
goto out;
 
-   error = devm_add_action_or_reset(>dev, pci_p2pdma_release, pdev);
+   error = simple_pin_fs(_p2pdma_fs_type, _p2pdma_fs_mnt,
+ _p2pdma_fs_cnt);
if (error)
goto out_pool_destroy;
 
+   p2p->inode = alloc_anon_inode(pci_p2pdma_fs_mnt->mnt_sb);
+   if (IS_ERR(p2p->inode)) {
+   error = -ENOMEM;
+   goto out_unpin_fs;
+   }
+
+   error = devm_add_action_or_reset(>dev, pci_p2pdma_release, pdev);
+   if (error)
+   goto out_put_inode;
+
error = sysfs_create_group(>dev.kobj, _group);
if (error)
-   goto out_pool_destroy;
+   goto out_put_inode;
 
rcu_assign_pointer(pdev->p2pdma, p2p);
return 0;
 
+out_put_inode:
+   iput(p2p->inode);
+out_unpin_fs:
+   simple_release_fs(_p2pdma_fs_mnt, _p2pdma_fs_cnt);
 out_pool_destroy:
gen_pool_destroy(p2p->pool);
 out:
@@ -164,6 +216,54 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
return error;
 }
 
+static void pci_p2pdma_map_free_pages(struct pci_p2pdma_map *pmap)
+{
+   int i;
+
+   if (!pmap->kaddr)
+   return;
+
+   for (i = 0; i < pmap->len; i += PAGE_SIZE)
+   put_page(virt_to_page(pmap->kaddr + i));
+
+   pmap->kaddr = NULL;
+}
+
+static void pci_p2pdma_free_mappings(struct address_space *map

  1   2   3   4   5   6   >