from:"Leon Romanovsky"

Re: [PATCH 3/3] RDMA/hfi1: Use RMW accessors for changing LNKCTL2

2024-05-05 Thread Leon Romanovsky

On Fri, May 03, 2024 at 10:04:16AM -0300, Jason Gunthorpe wrote:
> On Fri, May 03, 2024 at 01:18:35PM +0300, Ilpo Järvinen wrote:
> > On Thu, 15 Feb 2024, Ilpo Järvinen wrote:
> > 
> > > Convert open coded RMW accesses for LNKCTL2 to use
> > > pcie_capability_clear_and_set_word() which makes its easier to
> > > understand what the code tries to do.
> > > 
> > > LNKCTL2 is not really owned by any driver because it is a collection of
> > > control bits that PCI core might need to touch. RMW accessors already
> > > have support for proper locking for a selected set of registers
> > > (LNKCTL2 is not yet among them but likely will be in the future) to
> > > avoid losing concurrent updates.
> > > 
> > > Suggested-by: Lukas Wunner 
> > > Signed-off-by: Ilpo Järvinen 
> > > Reviewed-by: Dean Luick 
> > 
> > I found out from Linux RDMA and InfiniBand patchwork that this patch had 
> > been silently closed as "Not Applicable". Is there some reason for
> > that?
> 
> It is part of a series that crosses subsystems, series like that
> usually go through some other trees.

Exactly, this is why I marked it as "Not Applicable".

> 
> If you want single patches applied then please send single
> patches.. It is hard to understand intent from mixed series.
> 
> Jason

Re: [PATCH v1 2/2] vfio/pci: Allow MMIO regions to be exported through dma-buf

2024-05-02 Thread Leon Romanovsky

On Thu, May 02, 2024 at 07:50:36AM +, Kasireddy, Vivek wrote:
> Hi Jason,

<...>

> > I'd rather we stick with the original design. Leon is working on DMA
> > API changes that should address half the issue.
> Ok, I'll keep an eye out for Leon's work.

The code for v1 is here:
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=dma-split-v1
It is constantly rebased till we will be ready to submit it.

v0 is here:
https://lore.kernel.org/linux-rdma/cover.1709635535.git.l...@kernel.org/

Thanks

> 
> Thanks,
> Vivek
> 
> > 
> > Jason
>

Re: [PATCH v0 01/14] IB/hfi1, IB/qib: Make I2C terminology more inclusive

2024-04-03 Thread Leon Romanovsky

On Fri, Mar 29, 2024 at 05:00:25PM +, Easwar Hariharan wrote:
> I2C v7, SMBus 3.2, and I3C specifications have replaced "master/slave"
> with more appropriate terms. Inspired by and following on to Wolfram's series
> to fix drivers/i2c[1], fix the terminology where I had a role to play, now 
> that
> the approved verbiage exists in the specification.
> 
> Compile tested, no functionality changes intended
> 
> [1]: 
> https://lore.kernel.org/all/20240322132619.6389-1-wsa+rene...@sang-engineering.com/
> 
> Signed-off-by: Easwar Hariharan 
> ---
>  drivers/infiniband/hw/hfi1/chip.c   |  6 ++--
>  drivers/infiniband/hw/hfi1/chip.h   |  2 +-
>  drivers/infiniband/hw/hfi1/chip_registers.h |  2 +-
>  drivers/infiniband/hw/hfi1/file_ops.c   |  2 +-
>  drivers/infiniband/hw/hfi1/firmware.c   | 22 ++---
>  drivers/infiniband/hw/hfi1/pcie.c   |  2 +-
>  drivers/infiniband/hw/hfi1/qsfp.c   | 36 ++---
>  drivers/infiniband/hw/hfi1/user_exp_rcv.c   |  2 +-
>  drivers/infiniband/hw/qib/qib_twsi.c|  6 ++--
>  9 files changed, 40 insertions(+), 40 deletions(-)

hfi1 and qib work perfectly fine with the current terminology. There is
no need to change old code just for the sake of change.

Let's drop this patch.

Thanks

Re: [RFC PATCH v2 02/11] netdev: implement netlink api to bind dma-buf to netdevice

2023-08-13 Thread Leon Romanovsky

On Wed, Aug 09, 2023 at 06:57:38PM -0700, Mina Almasry wrote:
> Add a netdev_dmabuf_binding struct which represents the
> dma-buf-to-netdevice binding. The netlink API will bind the dma-buf to
> an rx queue on the netdevice. On the binding, the dma_buf_attach
> & dma_buf_map_attachment will occur. The entries in the sg_table from
> mapping will be inserted into a genpool to make it ready
> for allocation.
> 
> The chunks in the genpool are owned by a dmabuf_chunk_owner struct which
> holds the dma-buf offset of the base of the chunk and the dma_addr of
> the chunk. Both are needed to use allocations that come from this chunk.
> 
> We create a new type that represents an allocation from the genpool:
> page_pool_iov. We setup the page_pool_iov allocation size in the
> genpool to PAGE_SIZE for simplicity: to match the PAGE_SIZE normally
> allocated by the page pool and given to the drivers.
> 
> The user can unbind the dmabuf from the netdevice by closing the netlink
> socket that established the binding. We do this so that the binding is
> automatically unbound even if the userspace process crashes.
> 
> The binding and unbinding leaves an indicator in struct netdev_rx_queue
> that the given queue is bound, but the binding doesn't take effect until
> the driver actually reconfigures its queues, and re-initializes its page
> pool. This issue/weirdness is highlighted in the memory provider
> proposal[1], and I'm hoping that some generic solution for all
> memory providers will be discussed; this patch doesn't address that
> weirdness again.
> 
> The netdev_dmabuf_binding struct is refcounted, and releases its
> resources only when all the refs are released.
> 
> [1] https://lore.kernel.org/netdev/20230707183935.997267-1-k...@kernel.org/
> 
> Signed-off-by: Willem de Bruijn 
> Signed-off-by: Kaiyuan Zhang 
> 
> Signed-off-by: Mina Almasry 
> ---
>  include/linux/netdevice.h |  57 
>  include/net/page_pool.h   |  27 ++
>  net/core/dev.c| 178 ++
>  net/core/netdev-genl.c| 101 -
>  4 files changed, 361 insertions(+), 2 deletions(-)

<...>

> +void __netdev_devmem_binding_free(struct netdev_dmabuf_binding *binding);
> +
> +static inline void
> +netdev_devmem_binding_get(struct netdev_dmabuf_binding *binding)
> +{
> + refcount_inc(>ref);
> +}
> +
> +static inline void
> +netdev_devmem_binding_put(struct netdev_dmabuf_binding *binding)
> +{
> + if (!refcount_dec_and_test(>ref))
> + return;
> +
> + __netdev_devmem_binding_free(binding);
> +}

Not a big deal, but it looks like reimplemented version of kref_get/kref_put to 
me.

Thanks

Re: [PATCH] net: ksz884x: Remove the unused function port_cfg_force_flow_ctrl()

2022-12-12 Thread Leon Romanovsky

On Mon, Dec 12, 2022 at 11:53:09AM +0800, Jiapeng Chong wrote:
> The function port_cfg_force_flow_ctrl() is defined in the ksz884x.c file,
> but not called elsewhere, so remove this unused function.
> 
> drivers/net/ethernet/micrel/ksz884x.c:2212:20: warning: unused function 
> 'port_cfg_force_flow_ctrl'.
> 
> Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3418
> Reported-by: Abaci Robot 
> Signed-off-by: Jiapeng Chong 
> ---
>  drivers/net/ethernet/micrel/ksz884x.c | 6 --
>  1 file changed, 6 deletions(-)
> 
> diff --git a/drivers/net/ethernet/micrel/ksz884x.c 
> b/drivers/net/ethernet/micrel/ksz884x.c
> index e6acd1e7b263..46f1fbf58b5a 100644
> --- a/drivers/net/ethernet/micrel/ksz884x.c
> +++ b/drivers/net/ethernet/micrel/ksz884x.c
> @@ -2209,12 +2209,6 @@ static inline void port_cfg_back_pressure(struct 
> ksz_hw *hw, int p, int set)
>   KS8842_PORT_CTRL_2_OFFSET, PORT_BACK_PRESSURE, set);
>  }
>  
> -static inline void port_cfg_force_flow_ctrl(struct ksz_hw *hw, int p, int 
> set)
> -{
> - port_cfg(hw, p,
> - KS8842_PORT_CTRL_2_OFFSET, PORT_FORCE_FLOW_CTRL, set);
> -}
> -
>  static inline int port_chk_back_pressure(struct ksz_hw *hw, int p)

This function is not called too. Many functions in that file can be
removed. Please do it in one patch.

Thanks

>  {
>   return port_chk(hw, p,
> -- 
> 2.20.1.7.g153144c
>

Re: [PATCH RFC 10/19] RDMA/umem: remove FOLL_FORCE usage

2022-11-14 Thread Leon Romanovsky

On Mon, Nov 07, 2022 at 05:17:31PM +0100, David Hildenbrand wrote:
> GUP now supports reliable R/O long-term pinning in COW mappings, such
> that we break COW early. MAP_SHARED VMAs only use the shared zeropage so
> far in one corner case (DAXFS file with holes), which can be ignored
> because GUP does not support long-term pinning in fsdax (see
> check_vma_flags()).
> 
> Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required
> for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop
> using FOLL_FORCE, which is really only for debugger access.
> 
> Cc: Jason Gunthorpe 
> Cc: Leon Romanovsky 
> Signed-off-by: David Hildenbrand 
> ---
>  drivers/infiniband/core/umem.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)

Thanks,
Tested-by: Leon Romanovsky  # Over mlx4 and mlx5.

Re: [PATCH][next] treewide: Replace zero-length arrays with flexible-array members

2022-02-15 Thread Leon Romanovsky

On Tue, Feb 15, 2022 at 01:21:10PM -0600, Gustavo A. R. Silva wrote:
> On Tue, Feb 15, 2022 at 10:17:40AM -0800, Kees Cook wrote:
> > On Tue, Feb 15, 2022 at 11:47:43AM -0600, Gustavo A. R. Silva wrote:
> > > There is a regular need in the kernel to provide a way to declare
> > > having a dynamically sized set of trailing elements in a structure.
> > > Kernel code should always use “flexible array members”[1] for these
> > > cases. The older style of one-element or zero-length arrays should
> > > no longer be used[2].
> > > 
> > > This code was transformed with the help of Coccinelle:
> > > (next-20220214$ spatch --jobs $(getconf _NPROCESSORS_ONLN) --sp-file 
> > > script.cocci --include-headers --dir . > output.patch)
> > > 
> > > @@
> > > identifier S, member, array;
> > > type T1, T2;
> > > @@
> > > 
> > > struct S {
> > >   ...
> > >   T1 member;
> > >   T2 array[
> > > - 0
> > >   ];
> > > };
> > 
> > These all look trivially correct to me. Only two didn't have the end of
> > the struct visible in the patch, and checking those showed them to be
> > trailing members as well, so:
> > 
> > Reviewed-by: Kees Cook 
> 
> I'll add this to my -next tree.

I would like to ask you to send mlx5 patch separately to netdev. We are working
to delete that file completely and prefer to avoid from unnecessary merge 
conflicts.

Thanks

> 
> Thanks!
> --
> Gustavo

Re: [PATCH rdma-next v4 0/3] SG fix together with update to RDMA umem

2021-08-30 Thread Leon Romanovsky

On Tue, Aug 24, 2021 at 05:25:28PM +0300, Maor Gottlieb wrote:
> From: Maor Gottlieb 
> 
> Changelog:
> v4:
>  * Unify sg_free_table_entries with __sg_free_table
> v3: https://lore.kernel.org/lkml/cover.1627551226.git.leo...@nvidia.com/
>  * Rewrote to new API suggestion
>  * Split for more patches
> v2: https://lore.kernel.org/lkml/cover.1626605893.git.leo...@nvidia.com
>  * Changed implementation of first patch, based on our discussion with
>  * Christoph.
>https://lore.kernel.org/lkml/ynwavtt0qmqdx...@infradead.org/
> v1: https://lore.kernel.org/lkml/cover.1624955710.git.leo...@nvidia.com/
>  * Fixed sg_page with a _dma_ API in the umem.c
> v0: https://lore.kernel.org/lkml/cover.1624361199.git.leo...@nvidia.com
> 
> Maor Gottlieb (3):
>   lib/scatterlist: Provide a dedicated function to support table append
>   lib/scatterlist: Fix wrong update of orig_nents
>   RDMA: Use the sg_table directly and remove the opencoded version from
> umem
> 
>  drivers/gpu/drm/drm_prime.c |  13 +-
>  drivers/gpu/drm/i915/gem/i915_gem_userptr.c |  11 +-
>  drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c  |  14 +-
>  drivers/infiniband/core/umem.c  |  56 ---
>  drivers/infiniband/core/umem_dmabuf.c   |   5 +-
>  drivers/infiniband/hw/hns/hns_roce_db.c |   4 +-
>  drivers/infiniband/hw/irdma/verbs.c |   2 +-
>  drivers/infiniband/hw/mlx4/doorbell.c   |   3 +-
>  drivers/infiniband/hw/mlx4/mr.c |   4 +-
>  drivers/infiniband/hw/mlx5/doorbell.c   |   3 +-
>  drivers/infiniband/hw/mlx5/mr.c |   3 +-
>  drivers/infiniband/hw/qedr/verbs.c  |   2 +-
>  drivers/infiniband/sw/rdmavt/mr.c   |   2 +-
>  drivers/infiniband/sw/rxe/rxe_mr.c  |   2 +-
>  include/linux/scatterlist.h |  56 +--
>  include/rdma/ib_umem.h  |  11 +-
>  include/rdma/ib_verbs.h |  28 
>  lib/scatterlist.c   | 155 
>  lib/sg_pool.c   |   3 +-
>  tools/testing/scatterlist/main.c|  38 +++--
>  20 files changed, 258 insertions(+), 157 deletions(-)

Jason,

Did you add these patches to the -next? I can't find them.

Thanks

> 
> -- 
> 2.25.4
>

Re: [PATCH rdma-next v4 0/3] SG fix together with update to RDMA umem

2021-08-30 Thread Leon Romanovsky

On Mon, Aug 30, 2021 at 10:31:28AM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 30, 2021 at 11:21:00AM +0300, Leon Romanovsky wrote:
> > On Tue, Aug 24, 2021 at 05:25:28PM +0300, Maor Gottlieb wrote:
> > > From: Maor Gottlieb 
> > > 
> > > Changelog:
> > > v4:
> > >  * Unify sg_free_table_entries with __sg_free_table
> > > v3: https://lore.kernel.org/lkml/cover.1627551226.git.leo...@nvidia.com/
> > >  * Rewrote to new API suggestion
> > >  * Split for more patches
> > > v2: https://lore.kernel.org/lkml/cover.1626605893.git.leo...@nvidia.com
> > >  * Changed implementation of first patch, based on our discussion with
> > >  * Christoph.
> > >https://lore.kernel.org/lkml/ynwavtt0qmqdx...@infradead.org/
> > > v1: https://lore.kernel.org/lkml/cover.1624955710.git.leo...@nvidia.com/
> > >  * Fixed sg_page with a _dma_ API in the umem.c
> > > v0: https://lore.kernel.org/lkml/cover.1624361199.git.leo...@nvidia.com
> > > 
> > > Maor Gottlieb (3):
> > >   lib/scatterlist: Provide a dedicated function to support table append
> > >   lib/scatterlist: Fix wrong update of orig_nents
> > >   RDMA: Use the sg_table directly and remove the opencoded version from
> > > umem
> > > 
> > >  drivers/gpu/drm/drm_prime.c |  13 +-
> > >  drivers/gpu/drm/i915/gem/i915_gem_userptr.c |  11 +-
> > >  drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c  |  14 +-
> > >  drivers/infiniband/core/umem.c  |  56 ---
> > >  drivers/infiniband/core/umem_dmabuf.c   |   5 +-
> > >  drivers/infiniband/hw/hns/hns_roce_db.c |   4 +-
> > >  drivers/infiniband/hw/irdma/verbs.c |   2 +-
> > >  drivers/infiniband/hw/mlx4/doorbell.c   |   3 +-
> > >  drivers/infiniband/hw/mlx4/mr.c |   4 +-
> > >  drivers/infiniband/hw/mlx5/doorbell.c   |   3 +-
> > >  drivers/infiniband/hw/mlx5/mr.c |   3 +-
> > >  drivers/infiniband/hw/qedr/verbs.c  |   2 +-
> > >  drivers/infiniband/sw/rdmavt/mr.c   |   2 +-
> > >  drivers/infiniband/sw/rxe/rxe_mr.c  |   2 +-
> > >  include/linux/scatterlist.h |  56 +--
> > >  include/rdma/ib_umem.h  |  11 +-
> > >  include/rdma/ib_verbs.h |  28 
> > >  lib/scatterlist.c   | 155 
> > >  lib/sg_pool.c   |   3 +-
> > >  tools/testing/scatterlist/main.c|  38 +++--
> > >  20 files changed, 258 insertions(+), 157 deletions(-)
> > 
> > Jason,
> > 
> > Did you add these patches to the -next? I can't find them.
> 
> They sat in linux-next for awhile outside the rdma-next tree
> 
> All synced up now

Thanks

> 
> Jason

[PATCH rdma-next v3 2/3] lib/scatterlist: Fix wrong update of orig_nents

2021-07-29 Thread Leon Romanovsky

From: Maor Gottlieb 

orig_nents should represent the number of entries with pages,
but __sg_alloc_table_from_pages sets orig_nents as the number of
total entries in the table. This is wrong when the API is used for
dynamic allocation where not all the table entries are mapped with
pages. It wasn't observed until now, since RDMA umem who uses this
API in the dynamic form doesn't use orig_nents implicit or explicit
by the scatterlist APIs.

Fix it by changing the append API to track the SG append table
state and have an API to free the append table according to the
total number of entries in the table.
Now all APIs set orig_nents as number of enries with pages.

Fixes: 07da1223ec93 ("lib/scatterlist: Add support in dynamic allocation of SG 
table from pages")
Signed-off-by: Maor Gottlieb 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/core/umem.c   |  34 ---
 include/linux/scatterlist.h  |  17 +++-
 include/rdma/ib_umem.h   |   1 +
 lib/scatterlist.c| 161 +++
 tools/testing/scatterlist/main.c |  45 +
 5 files changed, 154 insertions(+), 104 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index b741758e528f..42481e7a72e8 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -59,7 +59,7 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
unpin_user_page_range_dirty_lock(sg_page(sg),
DIV_ROUND_UP(sg->length, PAGE_SIZE), make_dirty);
 
-   sg_free_table(>sg_head);
+   sg_free_append_table(>sgt_append);
 }
 
 /**
@@ -155,8 +155,7 @@ struct ib_umem *ib_umem_get(struct ib_device *device, 
unsigned long addr,
unsigned long dma_attr = 0;
struct mm_struct *mm;
unsigned long npages;
-   int ret;
-   struct scatterlist *sg = NULL;
+   int pinned, ret;
unsigned int gup_flags = FOLL_WRITE;
 
/*
@@ -216,28 +215,33 @@ struct ib_umem *ib_umem_get(struct ib_device *device, 
unsigned long addr,
 
while (npages) {
cond_resched();
-   ret = pin_user_pages_fast(cur_base,
+   pinned = pin_user_pages_fast(cur_base,
  min_t(unsigned long, npages,
PAGE_SIZE /
sizeof(struct page *)),
  gup_flags | FOLL_LONGTERM, page_list);
-   if (ret < 0)
+   if (pinned < 0) {
+   ret = pinned;
goto umem_release;
+   }
 
-   cur_base += ret * PAGE_SIZE;
-   npages -= ret;
-   sg = sg_alloc_append_table_from_pages(>sg_head, page_list,
-   ret, 0, ret << PAGE_SHIFT,
-   ib_dma_max_seg_size(device), sg, npages,
-   GFP_KERNEL);
-   umem->sg_nents = umem->sg_head.nents;
-   if (IS_ERR(sg)) {
-   unpin_user_pages_dirty_lock(page_list, ret, 0);
-   ret = PTR_ERR(sg);
+   cur_base += pinned * PAGE_SIZE;
+   npages -= pinned;
+   ret = sg_alloc_append_table_from_pages(
+   >sgt_append, page_list, pinned, 0,
+   pinned << PAGE_SHIFT, ib_dma_max_seg_size(device),
+   npages, GFP_KERNEL);
+   umem->sg_nents = umem->sgt_append.sgt.nents;
+   if (ret) {
+   memcpy(>sg_head.sgl, >sgt_append.sgt,
+  sizeof(umem->sgt_append.sgt));
+   unpin_user_pages_dirty_lock(page_list, pinned, 0);
goto umem_release;
}
}
 
+   memcpy(>sg_head.sgl, >sgt_append.sgt,
+  sizeof(umem->sgt_append.sgt));
if (access & IB_ACCESS_RELAXED_ORDERING)
dma_attr |= DMA_ATTR_WEAK_ORDERING;
 
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 5c700f2a0d18..0c7aa5ccebfc 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -39,6 +39,12 @@ struct sg_table {
unsigned int orig_nents;/* original size of list */
 };
 
+struct sg_append_table {
+   struct sg_table sgt;/* The scatter list table */
+   struct scatterlist *prv;/* last populated sge in the table */
+   unsigned int total_nents;   /* Total entries in the table */
+};
+
 /*
  * Notes on SG table design.
  *
@@ -282,14 +288,15 @@ typedef void (sg_free_fn)(struct scatterlist *, unsigned 
int);
 void __sg_free_table(struct sg_table *, unsigned int, unsigned int,
 sg_free_fn *);
 void sg_free_table(struct sg_table *);
+

[PATCH rdma-next v3 3/3] RDMA: Use the sg_table directly and remove the opencoded version from umem

2021-07-29 Thread Leon Romanovsky

From: Maor Gottlieb 

This allows using the normal sg_table APIs and makes all the code
cleaner. Remove sgt, nents and nmapd from ib_umem.

Signed-off-by: Maor Gottlieb 
Signed-off-by: Leon Romanovsky 
Signed-off-by: Jason Gunthorpe 
---
 drivers/infiniband/core/umem.c  | 32 +
 drivers/infiniband/core/umem_dmabuf.c   |  5 ++--
 drivers/infiniband/hw/hns/hns_roce_db.c |  4 ++--
 drivers/infiniband/hw/irdma/verbs.c |  2 +-
 drivers/infiniband/hw/mlx4/doorbell.c   |  3 ++-
 drivers/infiniband/hw/mlx4/mr.c |  4 ++--
 drivers/infiniband/hw/mlx5/doorbell.c   |  3 ++-
 drivers/infiniband/hw/mlx5/mr.c |  3 ++-
 drivers/infiniband/hw/qedr/verbs.c  |  2 +-
 drivers/infiniband/sw/rdmavt/mr.c   |  2 +-
 drivers/infiniband/sw/rxe/rxe_mr.c  |  2 +-
 include/rdma/ib_umem.h  | 10 
 include/rdma/ib_verbs.h | 28 ++
 13 files changed, 59 insertions(+), 41 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 42481e7a72e8..86d479772fbc 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -51,11 +51,11 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
struct scatterlist *sg;
unsigned int i;
 
-   if (umem->nmap > 0)
-   ib_dma_unmap_sg(dev, umem->sg_head.sgl, umem->sg_nents,
-   DMA_BIDIRECTIONAL);
+   if (dirty)
+   ib_dma_unmap_sgtable_attrs(dev, >sgt_append.sgt,
+  DMA_BIDIRECTIONAL, 0);
 
-   for_each_sg(umem->sg_head.sgl, sg, umem->sg_nents, i)
+   for_each_sgtable_sg(>sgt_append.sgt, sg, i)
unpin_user_page_range_dirty_lock(sg_page(sg),
DIV_ROUND_UP(sg->length, PAGE_SIZE), make_dirty);
 
@@ -111,7 +111,7 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
/* offset into first SGL */
pgoff = umem->address & ~PAGE_MASK;
 
-   for_each_sg(umem->sg_head.sgl, sg, umem->nmap, i) {
+   for_each_sgtable_dma_sg(>sgt_append.sgt, sg, i) {
/* Walk SGL and reduce max page size if VA/PA bits differ
 * for any address.
 */
@@ -121,7 +121,7 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
 * the maximum possible page size as the low bits of the iova
 * must be zero when starting the next chunk.
 */
-   if (i != (umem->nmap - 1))
+   if (i != (umem->sgt_append.sgt.nents - 1))
mask |= va;
pgoff = 0;
}
@@ -231,30 +231,19 @@ struct ib_umem *ib_umem_get(struct ib_device *device, 
unsigned long addr,
>sgt_append, page_list, pinned, 0,
pinned << PAGE_SHIFT, ib_dma_max_seg_size(device),
npages, GFP_KERNEL);
-   umem->sg_nents = umem->sgt_append.sgt.nents;
if (ret) {
-   memcpy(>sg_head.sgl, >sgt_append.sgt,
-  sizeof(umem->sgt_append.sgt));
unpin_user_pages_dirty_lock(page_list, pinned, 0);
goto umem_release;
}
}
 
-   memcpy(>sg_head.sgl, >sgt_append.sgt,
-  sizeof(umem->sgt_append.sgt));
if (access & IB_ACCESS_RELAXED_ORDERING)
dma_attr |= DMA_ATTR_WEAK_ORDERING;
 
-   umem->nmap =
-   ib_dma_map_sg_attrs(device, umem->sg_head.sgl, umem->sg_nents,
-   DMA_BIDIRECTIONAL, dma_attr);
-
-   if (!umem->nmap) {
-   ret = -ENOMEM;
+   ret = ib_dma_map_sgtable_attrs(device, >sgt_append.sgt,
+  DMA_BIDIRECTIONAL, dma_attr);
+   if (ret)
goto umem_release;
-   }
-
-   ret = 0;
goto out;
 
 umem_release:
@@ -314,7 +303,8 @@ int ib_umem_copy_from(void *dst, struct ib_umem *umem, 
size_t offset,
return -EINVAL;
}
 
-   ret = sg_pcopy_to_buffer(umem->sg_head.sgl, umem->sg_nents, dst, length,
+   ret = sg_pcopy_to_buffer(umem->sgt_append.sgt.sgl,
+umem->sgt_append.sgt.orig_nents, dst, length,
 offset + ib_umem_offset(umem));
 
if (ret < 0)
diff --git a/drivers/infiniband/core/umem_dmabuf.c 
b/drivers/infiniband/core/umem_dmabuf.c
index c6e875619fac..e824baf4640d 100644
--- a/drivers/infiniband/core/umem_dmabuf.c
+++ b/drivers/infiniband/core/umem_dmabuf.c
@@ -55,9 +55,8 @@ int ib_umem_dmabuf_map_pages(struct ib_umem_dmabuf 
*umem_dmabuf)
cur += sg_dma_len(sg);
}
 
-   umem_dmabuf->umem.sg_head.sgl = umem_dmabu

[PATCH rdma-next v3 1/3] lib/scatterlist: Provide a dedicated function to support table append

2021-07-29 Thread Leon Romanovsky

From: Maor Gottlieb 

RDMA is the only in-kernel user that uses __sg_alloc_table_from_pages to
append pages dynamically. In the next patch. That mode will be extended
and that function will get more parameters. So separate it into a unique
function to make such change more clear.

Signed-off-by: Maor Gottlieb 
Signed-off-by: Leon Romanovsky 
---
 drivers/gpu/drm/drm_prime.c | 13 ---
 drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 11 +++---
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c  | 14 +++-
 drivers/infiniband/core/umem.c  |  4 +--
 include/linux/scatterlist.h | 39 ++---
 lib/scatterlist.c   | 36 ++-
 tools/testing/scatterlist/main.c| 25 +
 7 files changed, 90 insertions(+), 52 deletions(-)

diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
index 2a54f86856af..cf3278041f9c 100644
--- a/drivers/gpu/drm/drm_prime.c
+++ b/drivers/gpu/drm/drm_prime.c
@@ -807,8 +807,8 @@ struct sg_table *drm_prime_pages_to_sg(struct drm_device 
*dev,
   struct page **pages, unsigned int 
nr_pages)
 {
struct sg_table *sg;
-   struct scatterlist *sge;
size_t max_segment = 0;
+   int err;
 
sg = kmalloc(sizeof(struct sg_table), GFP_KERNEL);
if (!sg)
@@ -818,13 +818,12 @@ struct sg_table *drm_prime_pages_to_sg(struct drm_device 
*dev,
max_segment = dma_max_mapping_size(dev->dev);
if (max_segment == 0)
max_segment = UINT_MAX;
-   sge = __sg_alloc_table_from_pages(sg, pages, nr_pages, 0,
- nr_pages << PAGE_SHIFT,
- max_segment,
- NULL, 0, GFP_KERNEL);
-   if (IS_ERR(sge)) {
+   err = sg_alloc_table_from_pages_segment(sg, pages, nr_pages, 0,
+   nr_pages << PAGE_SHIFT,
+   max_segment, GFP_KERNEL);
+   if (err) {
kfree(sg);
-   sg = ERR_CAST(sge);
+   sg = ERR_PTR(err);
}
return sg;
 }
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c 
b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
index 7487bab11f0b..458f797a9e1e 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
@@ -133,7 +133,6 @@ static int i915_gem_userptr_get_pages(struct 
drm_i915_gem_object *obj)
unsigned int max_segment = i915_sg_segment_size();
struct sg_table *st;
unsigned int sg_page_sizes;
-   struct scatterlist *sg;
struct page **pvec;
int ret;
 
@@ -153,13 +152,11 @@ static int i915_gem_userptr_get_pages(struct 
drm_i915_gem_object *obj)
spin_unlock(>mm.notifier_lock);
 
 alloc_table:
-   sg = __sg_alloc_table_from_pages(st, pvec, num_pages, 0,
-num_pages << PAGE_SHIFT, max_segment,
-NULL, 0, GFP_KERNEL);
-   if (IS_ERR(sg)) {
-   ret = PTR_ERR(sg);
+   ret = sg_alloc_table_from_pages_segment(st, pvec, num_pages, 0,
+   num_pages << PAGE_SHIFT,
+   max_segment, GFP_KERNEL);
+   if (ret)
goto err;
-   }
 
ret = i915_gem_gtt_prepare_pages(obj, st);
if (ret) {
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c 
b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
index 0488042fb287..fc372d2e52a1 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
@@ -363,7 +363,6 @@ static int vmw_ttm_map_dma(struct vmw_ttm_tt *vmw_tt)
int ret = 0;
static size_t sgl_size;
static size_t sgt_size;
-   struct scatterlist *sg;
 
if (vmw_tt->mapped)
return 0;
@@ -386,15 +385,12 @@ static int vmw_ttm_map_dma(struct vmw_ttm_tt *vmw_tt)
if (unlikely(ret != 0))
return ret;
 
-   sg = __sg_alloc_table_from_pages(_tt->sgt, vsgt->pages,
-   vsgt->num_pages, 0,
-   (unsigned long) vsgt->num_pages << PAGE_SHIFT,
-   dma_get_max_seg_size(dev_priv->drm.dev),
-   NULL, 0, GFP_KERNEL);
-   if (IS_ERR(sg)) {
-   ret = PTR_ERR(sg);
+   ret = sg_alloc_table_from_pages_segment(
+   _tt->sgt, vsgt->pages, vsgt->num_pages, 0,
+   (unsigned long)vsgt->num_pages << PAGE_SHIFT,
+   dma_get_max_seg_size(dev_priv->drm.dev), GFP_KERNEL);
+   if (ret)
goto out_sg_a

[PATCH rdma-next v3 0/3] SG fix together with update to RDMA umem

2021-07-29 Thread Leon Romanovsky

From: Leon Romanovsky 

Changelog:
v3:
 * Rewrote to new API suggestion
 * Split for more patches
v2: https://lore.kernel.org/lkml/cover.1626605893.git.leo...@nvidia.com
 * Changed implementation of first patch, based on our discussion with 
Christoph.
   https://lore.kernel.org/lkml/ynwavtt0qmqdx...@infradead.org/
v1: https://lore.kernel.org/lkml/cover.1624955710.git.leo...@nvidia.com/
 * Fixed sg_page with a _dma_ API in the umem.c
v0: https://lore.kernel.org/lkml/cover.1624361199.git.leo...@nvidia.com


Maor Gottlieb (3):
  lib/scatterlist: Provide a dedicated function to support table append
  lib/scatterlist: Fix wrong update of orig_nents
  RDMA: Use the sg_table directly and remove the opencoded version from
umem

 drivers/gpu/drm/drm_prime.c |  13 +-
 drivers/gpu/drm/i915/gem/i915_gem_userptr.c |  11 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c  |  14 +-
 drivers/infiniband/core/umem.c  |  56 +++---
 drivers/infiniband/core/umem_dmabuf.c   |   5 +-
 drivers/infiniband/hw/hns/hns_roce_db.c |   4 +-
 drivers/infiniband/hw/irdma/verbs.c |   2 +-
 drivers/infiniband/hw/mlx4/doorbell.c   |   3 +-
 drivers/infiniband/hw/mlx4/mr.c |   4 +-
 drivers/infiniband/hw/mlx5/doorbell.c   |   3 +-
 drivers/infiniband/hw/mlx5/mr.c |   3 +-
 drivers/infiniband/hw/qedr/verbs.c  |   2 +-
 drivers/infiniband/sw/rdmavt/mr.c   |   2 +-
 drivers/infiniband/sw/rxe/rxe_mr.c  |   2 +-
 include/linux/scatterlist.h |  54 +-
 include/rdma/ib_umem.h  |  11 +-
 include/rdma/ib_verbs.h |  28 +++
 lib/scatterlist.c   | 189 
 tools/testing/scatterlist/main.c|  38 ++--
 19 files changed, 275 insertions(+), 169 deletions(-)

-- 
2.31.1

[PATCH rdma-next v2 1/2] lib/scatterlist: Fix wrong update of orig_nents

2021-07-18 Thread Leon Romanovsky

From: Maor Gottlieb 

orig_nents should represent the number of entries with pages,
but __sg_alloc_table_from_pages sets orig_nents as the number of
total entries in the table. This is wrong when the API is used for
dynamic allocation where not all the table entries are mapped with
pages. It wasn't observed until now, since RDMA umem who uses this
API in the dynamic form doesn't use orig_nents implicit or explicit
by the scatterlist APIs.

Fix it by:
1. Set orig_nents as number of entries with pages also in
   __sg_alloc_table_from_pages.
2. To fix the release flow, add a new output argument to
   __sg_alloc_table_from_pages which will be set to the
   number of total entries in the table. User like umem that use
   this function for dynamic allocation, should free the table by
   calling to sg_free_table_entries and pass the number of entries.

Fixes: 07da1223ec93 ("lib/scatterlist: Add support in dynamic allocation of SG 
table from pages")
Signed-off-by: Maor Gottlieb 
Signed-off-by: Leon Romanovsky 
---
 drivers/gpu/drm/drm_prime.c |  2 +-
 drivers/gpu/drm/i915/gem/i915_gem_userptr.c |  2 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c  |  2 +-
 drivers/infiniband/core/umem.c  |  5 +-
 include/linux/scatterlist.h |  3 +-
 include/rdma/ib_umem.h  |  3 +-
 lib/scatterlist.c   | 88 ++---
 tools/testing/scatterlist/main.c| 15 +++-
 8 files changed, 80 insertions(+), 40 deletions(-)

diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
index 2a54f86856af..1739d10a2c55 100644
--- a/drivers/gpu/drm/drm_prime.c
+++ b/drivers/gpu/drm/drm_prime.c
@@ -821,7 +821,7 @@ struct sg_table *drm_prime_pages_to_sg(struct drm_device 
*dev,
sge = __sg_alloc_table_from_pages(sg, pages, nr_pages, 0,
  nr_pages << PAGE_SHIFT,
  max_segment,
- NULL, 0, GFP_KERNEL);
+ NULL, 0, GFP_KERNEL, NULL);
if (IS_ERR(sge)) {
kfree(sg);
sg = ERR_CAST(sge);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c 
b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
index 7487bab11f0b..c341d3e3386c 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
@@ -155,7 +155,7 @@ static int i915_gem_userptr_get_pages(struct 
drm_i915_gem_object *obj)
 alloc_table:
sg = __sg_alloc_table_from_pages(st, pvec, num_pages, 0,
 num_pages << PAGE_SHIFT, max_segment,
-NULL, 0, GFP_KERNEL);
+NULL, 0, GFP_KERNEL, NULL);
if (IS_ERR(sg)) {
ret = PTR_ERR(sg);
goto err;
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c 
b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
index 0488042fb287..2ad889272b0a 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
@@ -390,7 +390,7 @@ static int vmw_ttm_map_dma(struct vmw_ttm_tt *vmw_tt)
vsgt->num_pages, 0,
(unsigned long) vsgt->num_pages << PAGE_SHIFT,
dma_get_max_seg_size(dev_priv->drm.dev),
-   NULL, 0, GFP_KERNEL);
+   NULL, 0, GFP_KERNEL, NULL);
if (IS_ERR(sg)) {
ret = PTR_ERR(sg);
goto out_sg_alloc_fail;
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 0eb40025075f..cf4197363346 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -59,7 +59,7 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
unpin_user_page_range_dirty_lock(sg_page(sg),
DIV_ROUND_UP(sg->length, PAGE_SIZE), make_dirty);
 
-   sg_free_table(>sg_head);
+   sg_free_table_entries(>sg_head, umem->total_nents);
 }
 
 /**
@@ -229,8 +229,7 @@ struct ib_umem *ib_umem_get(struct ib_device *device, 
unsigned long addr,
sg = __sg_alloc_table_from_pages(>sg_head, page_list, ret,
0, ret << PAGE_SHIFT,
ib_dma_max_seg_size(device), sg, npages,
-   GFP_KERNEL);
-   umem->sg_nents = umem->sg_head.nents;
+   GFP_KERNEL, >total_nents);
if (IS_ERR(sg)) {
unpin_user_pages_dirty_lock(page_list, ret, 0);
ret = PTR_ERR(sg);
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index ecf87484814f..c796c165d5ee 100644
--- a/include/lin

[PATCH rdma-next v2 2/2] RDMA: Use dma_map_sgtable for map umem pages

2021-07-18 Thread Leon Romanovsky

From: Maor Gottlieb 

In order to avoid incorrect usage of sg_table fields, change umem to
use dma_map_sgtable for map the pages for DMA. Since dma_map_sgtable
update the nents field (number of DMA entries), there is no need
anymore for nmap variable, hence do some cleanups accordingly.

Signed-off-by: Maor Gottlieb 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/core/umem.c| 28 +++
 drivers/infiniband/core/umem_dmabuf.c |  1 -
 drivers/infiniband/hw/mlx4/mr.c   |  4 ++--
 drivers/infiniband/hw/mlx5/mr.c   |  3 ++-
 drivers/infiniband/sw/rdmavt/mr.c |  2 +-
 drivers/infiniband/sw/rxe/rxe_mr.c|  3 ++-
 include/rdma/ib_umem.h|  3 ++-
 include/rdma/ib_verbs.h   | 28 +++
 8 files changed, 48 insertions(+), 24 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index cf4197363346..77c2df1c91d1 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -51,11 +51,11 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
struct scatterlist *sg;
unsigned int i;
 
-   if (umem->nmap > 0)
-   ib_dma_unmap_sg(dev, umem->sg_head.sgl, umem->sg_nents,
-   DMA_BIDIRECTIONAL);
+   if (dirty)
+   ib_dma_unmap_sgtable_attrs(dev, >sg_head,
+  DMA_BIDIRECTIONAL, 0);
 
-   for_each_sg(umem->sg_head.sgl, sg, umem->sg_nents, i)
+   for_each_sgtable_sg(>sg_head, sg, i)
unpin_user_page_range_dirty_lock(sg_page(sg),
DIV_ROUND_UP(sg->length, PAGE_SIZE), make_dirty);
 
@@ -111,7 +111,7 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
/* offset into first SGL */
pgoff = umem->address & ~PAGE_MASK;
 
-   for_each_sg(umem->sg_head.sgl, sg, umem->nmap, i) {
+   for_each_sgtable_dma_sg(>sg_head, sg, i) {
/* Walk SGL and reduce max page size if VA/PA bits differ
 * for any address.
 */
@@ -121,7 +121,7 @@ unsigned long ib_umem_find_best_pgsz(struct ib_umem *umem,
 * the maximum possible page size as the low bits of the iova
 * must be zero when starting the next chunk.
 */
-   if (i != (umem->nmap - 1))
+   if (i != (umem->sg_head.nents - 1))
mask |= va;
pgoff = 0;
}
@@ -240,16 +240,10 @@ struct ib_umem *ib_umem_get(struct ib_device *device, 
unsigned long addr,
if (access & IB_ACCESS_RELAXED_ORDERING)
dma_attr |= DMA_ATTR_WEAK_ORDERING;
 
-   umem->nmap =
-   ib_dma_map_sg_attrs(device, umem->sg_head.sgl, umem->sg_nents,
-   DMA_BIDIRECTIONAL, dma_attr);
-
-   if (!umem->nmap) {
-   ret = -ENOMEM;
+   ret = ib_dma_map_sgtable_attrs(device, >sg_head,
+  DMA_BIDIRECTIONAL, dma_attr);
+   if (ret)
goto umem_release;
-   }
-
-   ret = 0;
goto out;
 
 umem_release:
@@ -309,8 +303,8 @@ int ib_umem_copy_from(void *dst, struct ib_umem *umem, 
size_t offset,
return -EINVAL;
}
 
-   ret = sg_pcopy_to_buffer(umem->sg_head.sgl, umem->sg_nents, dst, length,
-offset + ib_umem_offset(umem));
+   ret = sg_pcopy_to_buffer(umem->sg_head.sgl, umem->sg_head.orig_nents,
+dst, length, offset + ib_umem_offset(umem));
 
if (ret < 0)
return ret;
diff --git a/drivers/infiniband/core/umem_dmabuf.c 
b/drivers/infiniband/core/umem_dmabuf.c
index c6e875619fac..2884e4526d78 100644
--- a/drivers/infiniband/core/umem_dmabuf.c
+++ b/drivers/infiniband/core/umem_dmabuf.c
@@ -57,7 +57,6 @@ int ib_umem_dmabuf_map_pages(struct ib_umem_dmabuf 
*umem_dmabuf)
 
umem_dmabuf->umem.sg_head.sgl = umem_dmabuf->first_sg;
umem_dmabuf->umem.sg_head.nents = nmap;
-   umem_dmabuf->umem.nmap = nmap;
umem_dmabuf->sgt = sgt;
 
 wait_fence:
diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
index 50becc0e4b62..ab5dc8eac7f8 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -200,7 +200,7 @@ int mlx4_ib_umem_write_mtt(struct mlx4_ib_dev *dev, struct 
mlx4_mtt *mtt,
mtt_shift = mtt->page_shift;
mtt_size = 1ULL << mtt_shift;
 
-   for_each_sg(umem->sg_head.sgl, sg, umem->nmap, i) {
+   for_each_sgtable_dma_sg(>sg_head, sg, i) {
if (cur_start_addr + len == sg_dma_address(sg)) {
/* still the same block */
len += sg_dma_len(sg);
@@ -273,7 +273,7 @@ int mlx4_ib_

[PATCH rdma-next v2 0/2] SG fix together with update to RDMA umem

2021-07-18 Thread Leon Romanovsky

From: Leon Romanovsky 

Changelog:
v2:
 * Changed implementation of first patch, based on our discussion with 
Christoph.
   https://lore.kernel.org/lkml/ynwavtt0qmqdx...@infradead.org/
v1: https://lore.kernel.org/lkml/cover.1624955710.git.leo...@nvidia.com/
 * Fixed sg_page with a _dma_ API in the umem.c
v0: https://lore.kernel.org/lkml/cover.1624361199.git.leo...@nvidia.com

Maor Gottlieb (2):
  lib/scatterlist: Fix wrong update of orig_nents
  RDMA: Use dma_map_sgtable for map umem pages

 drivers/gpu/drm/drm_prime.c |  2 +-
 drivers/gpu/drm/i915/gem/i915_gem_userptr.c |  2 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c  |  2 +-
 drivers/infiniband/core/umem.c  | 33 +++-
 drivers/infiniband/core/umem_dmabuf.c   |  1 -
 drivers/infiniband/hw/mlx4/mr.c |  4 +-
 drivers/infiniband/hw/mlx5/mr.c |  3 +-
 drivers/infiniband/sw/rdmavt/mr.c   |  2 +-
 drivers/infiniband/sw/rxe/rxe_mr.c  |  3 +-
 include/linux/scatterlist.h |  3 +-
 include/rdma/ib_umem.h  |  6 +-
 include/rdma/ib_verbs.h | 28 +++
 lib/scatterlist.c   | 88 ++---
 tools/testing/scatterlist/main.c| 15 +++-
 14 files changed, 128 insertions(+), 64 deletions(-)

-- 
2.31.1

Re: [PATCH 02/13] vfio: Introduce a vfio_uninit_group_dev() API call

2021-07-15 Thread Leon Romanovsky

On Wed, Jul 14, 2021 at 09:20:31PM -0300, Jason Gunthorpe wrote:
> From: Max Gurtovoy 
> 
> This pairs with vfio_init_group_dev() and allows undoing any state that is
> stored in the vfio_device unrelated to registration. Add appropriately
> placed calls to all the drivers.
> 
> The following patch will use this to add pre-registration state for the
> device set.
> 
> Signed-off-by: Max Gurtovoy 
> Signed-off-by: Jason Gunthorpe 
> ---
>  Documentation/driver-api/vfio.rst|  4 ++-
>  drivers/vfio/fsl-mc/vfio_fsl_mc.c|  6 +++--
>  drivers/vfio/mdev/vfio_mdev.c| 13 +++---
>  drivers/vfio/pci/vfio_pci.c  |  6 +++--
>  drivers/vfio/platform/vfio_platform_common.c |  7 +++--
>  drivers/vfio/vfio.c  |  5 
>  include/linux/vfio.h |  1 +
>  samples/vfio-mdev/mbochs.c   |  2 ++
>  samples/vfio-mdev/mdpy.c | 25 ++
>  samples/vfio-mdev/mtty.c | 27 
>  10 files changed, 64 insertions(+), 32 deletions(-)

<...>

> @@ -674,6 +675,7 @@ static int vfio_fsl_mc_remove(struct fsl_mc_device 
> *mc_dev)
>  
>   dprc_remove_devices(mc_dev, NULL, 0);
>   vfio_fsl_uninit_device(vdev);
> + vfio_uninit_group_dev(>vdev);

This is wrong place, the _uninit_ should be after vfio_fsl_mc_reflck_put().

>   vfio_fsl_mc_reflck_put(vdev->reflck);
>  
>   kfree(vdev);

Re: [RFC v1 PATCH 3/3] driver: update all the code that use soc_device_match

2021-04-18 Thread Leon Romanovsky

On Mon, Apr 19, 2021 at 12:27:22PM +0800, Alice Guo (OSS) wrote:
> From: Alice Guo 
> 
> Update all the code that use soc_device_match because add support for
> soc_device_match returning -EPROBE_DEFER.
> 
> Signed-off-by: Alice Guo 
> ---
>  drivers/bus/ti-sysc.c |  2 +-
>  drivers/clk/renesas/r8a7795-cpg-mssr.c|  4 +++-
>  drivers/clk/renesas/rcar-gen2-cpg.c   |  2 +-
>  drivers/clk/renesas/rcar-gen3-cpg.c   |  2 +-
>  drivers/dma/fsl-dpaa2-qdma/dpaa2-qdma.c   |  7 ++-
>  drivers/dma/ti/k3-psil.c  |  3 +++
>  drivers/dma/ti/k3-udma.c  |  2 +-
>  drivers/gpu/drm/bridge/nwl-dsi.c  |  2 +-
>  drivers/gpu/drm/meson/meson_drv.c |  4 +++-
>  drivers/gpu/drm/omapdrm/dss/dispc.c   |  2 +-
>  drivers/gpu/drm/omapdrm/dss/dpi.c |  4 +++-
>  drivers/gpu/drm/omapdrm/dss/dsi.c |  3 +++
>  drivers/gpu/drm/omapdrm/dss/dss.c |  3 +++
>  drivers/gpu/drm/omapdrm/dss/hdmi4_core.c  |  3 +++
>  drivers/gpu/drm/omapdrm/dss/venc.c|  4 +++-
>  drivers/gpu/drm/omapdrm/omap_drv.c|  3 +++
>  drivers/gpu/drm/rcar-du/rcar_du_crtc.c|  4 +++-
>  drivers/gpu/drm/rcar-du/rcar_lvds.c   |  2 +-
>  drivers/gpu/drm/tidss/tidss_dispc.c   |  4 +++-
>  drivers/iommu/ipmmu-vmsa.c|  7 +--
>  drivers/media/platform/rcar-vin/rcar-core.c   |  2 +-
>  drivers/media/platform/rcar-vin/rcar-csi2.c   |  2 +-
>  drivers/media/platform/vsp1/vsp1_uif.c|  4 +++-
>  drivers/mmc/host/renesas_sdhi_core.c  |  2 +-
>  drivers/mmc/host/renesas_sdhi_internal_dmac.c |  2 +-
>  drivers/mmc/host/sdhci-of-esdhc.c | 21 ++-
>  drivers/mmc/host/sdhci-omap.c |  2 +-
>  drivers/mmc/host/sdhci_am654.c|  2 +-
>  drivers/net/ethernet/renesas/ravb_main.c  |  4 +++-
>  drivers/net/ethernet/ti/am65-cpsw-nuss.c  |  2 +-
>  drivers/net/ethernet/ti/cpsw.c|  2 +-
>  drivers/net/ethernet/ti/cpsw_new.c|  2 +-
>  drivers/phy/ti/phy-omap-usb2.c|  4 +++-
>  drivers/pinctrl/renesas/core.c|  2 +-
>  drivers/pinctrl/renesas/pfc-r8a7790.c |  5 -
>  drivers/pinctrl/renesas/pfc-r8a7794.c |  5 -
>  drivers/soc/fsl/dpio/dpio-driver.c| 13 
>  drivers/soc/renesas/r8a774c0-sysc.c   |  5 -
>  drivers/soc/renesas/r8a7795-sysc.c|  2 +-
>  drivers/soc/renesas/r8a77990-sysc.c   |  5 -
>  drivers/soc/ti/k3-ringacc.c   |  2 +-
>  drivers/staging/mt7621-pci/pci-mt7621.c   |  2 +-
>  drivers/thermal/rcar_gen3_thermal.c   |  4 +++-
>  drivers/thermal/ti-soc-thermal/ti-bandgap.c   | 10 +++--
>  drivers/usb/gadget/udc/renesas_usb3.c |  2 +-
>  drivers/usb/host/ehci-platform.c  |  4 +++-
>  drivers/usb/host/xhci-rcar.c  |  2 +-
>  drivers/watchdog/renesas_wdt.c|  2 +-
>  48 files changed, 131 insertions(+), 52 deletions(-)
> 
> diff --git a/drivers/bus/ti-sysc.c b/drivers/bus/ti-sysc.c
> index 5fae60f8c135..00c59aa217c1 100644
> --- a/drivers/bus/ti-sysc.c
> +++ b/drivers/bus/ti-sysc.c
> @@ -2909,7 +2909,7 @@ static int sysc_init_soc(struct sysc *ddata)
>   }
>  
>   match = soc_device_match(sysc_soc_feat_match);
> - if (!match)
> + if (!match || IS_ERR(match))
>   return 0;
>  
>   if (match->data)
> diff --git a/drivers/clk/renesas/r8a7795-cpg-mssr.c 
> b/drivers/clk/renesas/r8a7795-cpg-mssr.c
> index c32d2c678046..90a18336a4c3 100644
> --- a/drivers/clk/renesas/r8a7795-cpg-mssr.c
> +++ b/drivers/clk/renesas/r8a7795-cpg-mssr.c
> @@ -439,6 +439,7 @@ static const unsigned int r8a7795es2_mod_nullify[] 
> __initconst = {
>  
>  static int __init r8a7795_cpg_mssr_init(struct device *dev)
>  {
> + const struct soc_device_attribute *match;
>   const struct rcar_gen3_cpg_pll_config *cpg_pll_config;
>   u32 cpg_mode;
>   int error;
> @@ -453,7 +454,8 @@ static int __init r8a7795_cpg_mssr_init(struct device 
> *dev)
>   return -EINVAL;
>   }
>  
> - if (soc_device_match(r8a7795es1)) {
> + match = soc_device_match(r8a7795es1);
> + if (!IS_ERR(match) && match) {

"if (!IS_ERR_OR_NULL(match))" in all places.

Thanks
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH rdma-core v7 4/6] pyverbs: Add dma-buf based MR support

2021-02-02 Thread Leon Romanovsky

On Tue, Feb 02, 2021 at 04:24:45PM +0100, Daniel Vetter wrote:
> On Mon, Feb 01, 2021 at 05:03:44PM +, Xiong, Jianxin wrote:
> > > -Original Message-
> > > From: Jason Gunthorpe 
> > > Sent: Monday, February 01, 2021 7:29 AM
> > > To: Daniel Vetter 
> > > Cc: Leon Romanovsky ; Gal Pressman 
> > > ; Xiong, Jianxin ; Yishai 
> > > Hadas
> > > ; linux-rdma ; Edward 
> > > Srouji ; dri-devel  > > de...@lists.freedesktop.org>; Christian Koenig 
> > > ; Doug Ledford ; Vetter, 
> > > Daniel
> > > 
> > > Subject: Re: [PATCH rdma-core v7 4/6] pyverbs: Add dma-buf based MR 
> > > support
> > >
> > > On Mon, Feb 01, 2021 at 03:10:00PM +0100, Daniel Vetter wrote:
> > > > On Mon, Feb 1, 2021 at 7:16 AM Leon Romanovsky  wrote:
> > > > >
> > > > > On Sun, Jan 31, 2021 at 05:31:16PM +0200, Gal Pressman wrote:
> > > > > > On 25/01/2021 21:57, Jianxin Xiong wrote:
> > > > > > > Define a new sub-class of 'MR' that uses dma-buf object for the
> > > > > > > memory region. Define a new class 'DmaBuf' as a wrapper for
> > > > > > > dma-buf allocation mechanism implemented in C.
> > > > > > >
> > > > > > > Update the cmake function for cython modules to allow building
> > > > > > > modules with mixed cython and c source files.
> > > > > > >
> > > > > > > Signed-off-by: Jianxin Xiong 
> > > > > > > buildlib/pyverbs_functions.cmake |  78 +++
> > > > > > >  pyverbs/CMakeLists.txt   |  11 +-
> > > > > > >  pyverbs/dmabuf.pxd   |  15 +++
> > > > > > >  pyverbs/dmabuf.pyx   |  73 ++
> > > > > > >  pyverbs/dmabuf_alloc.c   | 278 
> > > > > > > +++
> > > > > > >  pyverbs/dmabuf_alloc.h   |  19 +++
> > > > > > >  pyverbs/libibverbs.pxd   |   2 +
> > > > > > >  pyverbs/mr.pxd   |   6 +
> > > > > > >  pyverbs/mr.pyx   | 105 ++-
> > > > > > >  9 files changed, 557 insertions(+), 30 deletions(-)  create
> > > > > > > mode 100644 pyverbs/dmabuf.pxd  create mode 100644
> > > > > > > pyverbs/dmabuf.pyx  create mode 100644 pyverbs/dmabuf_alloc.c
> > > > > > > create mode 100644 pyverbs/dmabuf_alloc.h
> > > > >
> > > > > <...>
> > > > >
> > > > > > > index 000..05eae75
> > > > > > > +++ b/pyverbs/dmabuf_alloc.c
> > > > > > > @@ -0,0 +1,278 @@
> > > > > > > +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
> > > > > > > +/*
> > > > > > > + * Copyright 2020 Intel Corporation. All rights reserved. See
> > > > > > > +COPYING file  */
> > > > > > > +
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > > > +#include 
> > > > > >
> > > > > > I assume these should come from the kernel headers package, right?
> > > > >
> > > > > This is gross, all kernel headers should be placed in
> > > > > kernel-headers/* and "update" script needs to be extended to take 
> > > > > drm/* files too :(.
> > > >
> > > > drm kernel headers are in the libdrm package. You need that anyway for
> > > > doing the ioctls (if you don't hand-roll the restarting yourself).
> > > >
> > > > Also our userspace has gone over to just outright copying the driver
> > > > headers. Not the generic headers, but for the rendering side of gpus,
> > > > which is the topic here, there's really not much generic stuff.
> > > >
> > > > > Jianxin, are you fixing it?
> > > >
> > > > So fix is either to depend upon libdrm for building, or have copies of
> > > > the headers included in the package for the i915/amdgpu/radeon headers
> > > > (drm/drm.h probably not so good idea).
> > >
> > > We should have a cmake test to not build the drm parts if it can't be 
> > > built, and pyverbs should skip the tests.
> > >
> >
> > Yes, I will add a test for that. Also, on SLES, the headers could be under 
> > /usr/include/libdrm instead of /usr/include/drm. The make test should check 
> > that and use proper path.
>
> Please use pkgconfig for this, libdrm installs a .pc file to make sure you
> can find the right headers.

rdma-core uses cmake build system and in our case cmake find_library()
is preferable over pkgconfig.

Thanks

> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH rdma-core v7 4/6] pyverbs: Add dma-buf based MR support

2021-01-31 Thread Leon Romanovsky

On Sun, Jan 31, 2021 at 05:31:16PM +0200, Gal Pressman wrote:
> On 25/01/2021 21:57, Jianxin Xiong wrote:
> > Define a new sub-class of 'MR' that uses dma-buf object for the memory
> > region. Define a new class 'DmaBuf' as a wrapper for dma-buf allocation
> > mechanism implemented in C.
> >
> > Update the cmake function for cython modules to allow building modules
> > with mixed cython and c source files.
> >
> > Signed-off-by: Jianxin Xiong 
> > ---
> >  buildlib/pyverbs_functions.cmake |  78 +++
> >  pyverbs/CMakeLists.txt   |  11 +-
> >  pyverbs/dmabuf.pxd   |  15 +++
> >  pyverbs/dmabuf.pyx   |  73 ++
> >  pyverbs/dmabuf_alloc.c   | 278 
> > +++
> >  pyverbs/dmabuf_alloc.h   |  19 +++
> >  pyverbs/libibverbs.pxd   |   2 +
> >  pyverbs/mr.pxd   |   6 +
> >  pyverbs/mr.pyx   | 105 ++-
> >  9 files changed, 557 insertions(+), 30 deletions(-)
> >  create mode 100644 pyverbs/dmabuf.pxd
> >  create mode 100644 pyverbs/dmabuf.pyx
> >  create mode 100644 pyverbs/dmabuf_alloc.c
> >  create mode 100644 pyverbs/dmabuf_alloc.h

<...>

> > index 000..05eae75
> > --- /dev/null
> > +++ b/pyverbs/dmabuf_alloc.c
> > @@ -0,0 +1,278 @@
> > +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
> > +/*
> > + * Copyright 2020 Intel Corporation. All rights reserved. See COPYING file
> > + */
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
>
> I assume these should come from the kernel headers package, right?

This is gross, all kernel headers should be placed in kernel-headers/*
and "update" script needs to be extended to take drm/* files too :(.

Jianxin, are you fixing it?

Thanks
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v14 4/4] RDMA/mlx5: Support dma-buf based userspace memory region

2020-12-08 Thread Leon Romanovsky

On Tue, Dec 08, 2020 at 02:39:15PM -0800, Jianxin Xiong wrote:
> Implement the new driver method 'reg_user_mr_dmabuf'.  Utilize the core
> functions to import dma-buf based memory region and update the mappings.
>
> Add code to handle dma-buf related page fault.
>
> Signed-off-by: Jianxin Xiong 
> Reviewed-by: Sean Hefty 
> Acked-by: Michael J. Ruhl 
> Acked-by: Christian Koenig 
> Acked-by: Daniel Vetter 
> ---
>  drivers/infiniband/hw/mlx5/main.c|   2 +
>  drivers/infiniband/hw/mlx5/mlx5_ib.h |  18 +
>  drivers/infiniband/hw/mlx5/mr.c  | 128 
> +--
>  drivers/infiniband/hw/mlx5/odp.c |  86 +--
>  4 files changed, 225 insertions(+), 9 deletions(-)

<...>

>
> +
> + umem = ib_umem_dmabuf_get(>ib_dev, offset, length, fd, 
> access_flags,
> +   _ib_dmabuf_attach_ops);
> + if (IS_ERR(umem)) {
> + mlx5_ib_dbg(dev, "umem get failed (%ld)\n", PTR_ERR(umem));
> + return ERR_PTR(PTR_ERR(umem));

return ERR_CAST(umem);

> + }

<...>

> + dma_resv_lock(umem_dmabuf->attach->dmabuf->resv, NULL);
> + err = ib_umem_dmabuf_map_pages(umem_dmabuf);
> + if (!err) {
> + page_size = mlx5_umem_find_best_pgsz(_dmabuf->umem, mkc,
> +  log_page_size, 0,
> +  umem_dmabuf->umem.iova);
> + if (unlikely(page_size < PAGE_SIZE)) {
> + ib_umem_dmabuf_unmap_pages(umem_dmabuf);
> + err = -EINVAL;
> + } else {
> + err = mlx5_ib_update_mr_pas(mr, xlt_flags);
> + }
> + }
> + dma_resv_unlock(umem_dmabuf->attach->dmabuf->resv);

Let's write this section in kernel coding style, please

dma_resv_lock(umem_dmabuf->attach->dmabuf->resv, NULL);
err = ib_umem_dmabuf_map_pages(umem_dmabuf);
if (err) {
  dma_resv_unlock(umem_dmabuf->attach->dmabuf->resv);
  return err;
}
.

Thanks
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v14 1/4] RDMA/umem: Support importing dma-buf as user memory region

2020-12-08 Thread Leon Romanovsky

On Tue, Dec 08, 2020 at 02:39:12PM -0800, Jianxin Xiong wrote:
> Dma-buf is a standard cross-driver buffer sharing mechanism that can be
> used to support peer-to-peer access from RDMA devices.
>
> Device memory exported via dma-buf is associated with a file descriptor.
> This is passed to the user space as a property associated with the
> buffer allocation. When the buffer is registered as a memory region,
> the file descriptor is passed to the RDMA driver along with other
> parameters.
>
> Implement the common code for importing dma-buf object and mapping
> dma-buf pages.
>
> Signed-off-by: Jianxin Xiong 
> Reviewed-by: Sean Hefty 
> Acked-by: Michael J. Ruhl 
> Acked-by: Christian Koenig 
> Acked-by: Daniel Vetter 
> ---
>  drivers/infiniband/core/Makefile  |   2 +-
>  drivers/infiniband/core/umem.c|   3 +
>  drivers/infiniband/core/umem_dmabuf.c | 174 
> ++
>  include/rdma/ib_umem.h|  47 -
>  4 files changed, 222 insertions(+), 4 deletions(-)
>  create mode 100644 drivers/infiniband/core/umem_dmabuf.c

<...>

> +int ib_umem_dmabuf_map_pages(struct ib_umem_dmabuf *umem_dmabuf)
> +{
> + struct sg_table *sgt;
> + struct scatterlist *sg;
> + struct dma_fence *fence;
> + unsigned long start, end, cur = 0;
> + unsigned int nmap = 0;
> + int i;
> +
> + dma_resv_assert_held(umem_dmabuf->attach->dmabuf->resv);
> +
> + if (umem_dmabuf->sgt)
> + goto wait_fence;
> +
> + sgt = dma_buf_map_attachment(umem_dmabuf->attach, DMA_BIDIRECTIONAL);
> + if (IS_ERR(sgt))
> + return PTR_ERR(sgt);
> +
> + /* modify the sg list in-place to match umem address and length */
> +
> + start = ALIGN_DOWN(umem_dmabuf->umem.address, PAGE_SIZE);
> + end = ALIGN(umem_dmabuf->umem.address + umem_dmabuf->umem.length,
> + PAGE_SIZE);
> + for_each_sgtable_dma_sg(sgt, sg, i) {
> + if (start < cur + sg_dma_len(sg) && cur < end)
> + nmap++;
> + if (cur <= start && start < cur + sg_dma_len(sg)) {
> + unsigned long offset = start - cur;
> +
> + umem_dmabuf->first_sg = sg;
> + umem_dmabuf->first_sg_offset = offset;
> + sg_dma_address(sg) += offset;
> + sg_dma_len(sg) -= offset;
> + cur += offset;
> + }
> + if (cur < end && end <= cur + sg_dma_len(sg)) {
> + unsigned long trim = cur + sg_dma_len(sg) - end;
> +
> + umem_dmabuf->last_sg = sg;
> + umem_dmabuf->last_sg_trim = trim;
> + sg_dma_len(sg) -= trim;
> + break;
> + }
> + cur += sg_dma_len(sg);
> + }
> +
> + umem_dmabuf->umem.sg_head.sgl = umem_dmabuf->first_sg;
> + umem_dmabuf->umem.sg_head.nents = nmap;
> + umem_dmabuf->umem.nmap = nmap;
> + umem_dmabuf->sgt = sgt;
> +
> +wait_fence:
> + /*
> +  * Although the sg list is valid now, the content of the pages
> +  * may be not up-to-date. Wait for the exporter to finish
> +  * the migration.
> +  */
> + fence = dma_resv_get_excl(umem_dmabuf->attach->dmabuf->resv);
> + if (fence)
> + return dma_fence_wait(fence, false);

You called to dma_buf_map_attachment() earlier in this function, so if
you return an error here, the dma_buf won't be unmapped in pagefault_dmabuf_mr()

Thanks
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v13 1/4] RDMA/umem: Support importing dma-buf as user memory region

2020-12-08 Thread Leon Romanovsky

On Tue, Dec 08, 2020 at 06:13:20PM +, Xiong, Jianxin wrote:
> > -Original Message-
> > From: Leon Romanovsky 
> > Sent: Monday, December 07, 2020 11:06 PM
> > To: Xiong, Jianxin 
> > Cc: linux-r...@vger.kernel.org; dri-devel@lists.freedesktop.org; Doug 
> > Ledford ; Jason Gunthorpe ;
> > Sumit Semwal ; Christian Koenig 
> > ; Vetter, Daniel 
> > Subject: Re: [PATCH v13 1/4] RDMA/umem: Support importing dma-buf as user 
> > memory region
> >
> > On Mon, Dec 07, 2020 at 02:15:50PM -0800, Jianxin Xiong wrote:
> > > Dma-buf is a standard cross-driver buffer sharing mechanism that can
> > > be used to support peer-to-peer access from RDMA devices.
> > >
> > > Device memory exported via dma-buf is associated with a file descriptor.
> > > This is passed to the user space as a property associated with the
> > > buffer allocation. When the buffer is registered as a memory region,
> > > the file descriptor is passed to the RDMA driver along with other
> > > parameters.
> > >
> > > Implement the common code for importing dma-buf object and mapping
> > > dma-buf pages.
> > >
> > > Signed-off-by: Jianxin Xiong 
> > > Reviewed-by: Sean Hefty 
> > > Acked-by: Michael J. Ruhl 
> > > Acked-by: Christian Koenig 
> > > Acked-by: Daniel Vetter 
> > >
> > > Conflicts:
> > >   include/rdma/ib_umem.h
> >

<...>

> > > + /*
> > > +  * Although the sg list is valid now, the content of the pages
> > > +  * may be not up-to-date. Wait for the exporter to finish
> > > +  * the migration.
> > > +  */
> > > + fence = dma_resv_get_excl(umem_dmabuf->attach->dmabuf->resv);
> > > + if (fence)
> > > + dma_fence_wait(fence, false);
> >
> > Any reason do not check return result from dma_fence_wait()?
>
> This is called with interruptible flag set to false and normally should only 
> return 0.
> I do see similar usage cases that check the result and don't check the 
> result. Maybe
> we can add a WARN_ON here?

I have no idea :), just saw that other places check returned value.

<...>

> > > +
> > > +struct ib_umem *ib_umem_dmabuf_get(struct ib_device *device,
> > > +unsigned long offset, size_t size,
> > > +int fd, int access,
> > > +const struct dma_buf_attach_ops *ops) {
> > > + struct dma_buf *dmabuf;
> > > + struct ib_umem_dmabuf *umem_dmabuf;
> > > + struct ib_umem *umem;
> > > + unsigned long end;
> > > + long ret = -EINVAL;
> >
> > It is wrong type for the returned value. One of the possible options is to 
> > declare "struct ib_umem *ret;" and set ret = ERR_PTR(-EINVAL) or
> > ret = ERR_CAST(dmabuf);
>
> At the actual point the value is returned, ERR_PTR(ret) is used. I think we 
> can change the
> variable name to "err" instead to avoid confusion.

The point is that "ret" should be declared as "struct ib_umem *" and not
as "long" and ERR_CAST() should be used instead of (void *).

<...>

> > > +static inline struct ib_umem *ib_umem_dmabuf_get(struct ib_device 
> > > *device,
> > > +  unsigned long offset,
> > > +  size_t size, int fd,
> > > +  int access,
> > > +  struct dma_buf_attach_ops 
> > > *ops) {
> > > + return ERR_PTR(-EINVAL);
> >
> > Probably, It should be EOPNOTSUPP and not EINVAL.
>
> EINVAL is used here to be consistent with existing definitions in the same 
> file.

ok

Thanks
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v13 2/4] RDMA/core: Add device method for registering dma-buf based memory region

2020-12-07 Thread Leon Romanovsky

On Mon, Dec 07, 2020 at 02:15:51PM -0800, Jianxin Xiong wrote:
> Dma-buf based memory region requires one extra parameter and is processed
> quite differently. Adding a separate method allows clean separation from
> regular memory regions.
>
> Signed-off-by: Jianxin Xiong 
> Reviewed-by: Sean Hefty 
> Acked-by: Michael J. Ruhl 
> Acked-by: Christian Koenig 
> Acked-by: Daniel Vetter 
> ---
>  drivers/infiniband/core/device.c | 1 +
>  include/rdma/ib_verbs.h  | 6 +-
>  2 files changed, 6 insertions(+), 1 deletion(-)
>

Thanks,
Reviewed-by: Leon Romanovsky 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v13 3/4] RDMA/uverbs: Add uverbs command for dma-buf based MR registration

2020-12-07 Thread Leon Romanovsky

On Mon, Dec 07, 2020 at 02:15:52PM -0800, Jianxin Xiong wrote:
> Implement a new uverbs ioctl method for memory registration with file
> descriptor as an extra parameter.
>
> Signed-off-by: Jianxin Xiong 
> Reviewed-by: Sean Hefty 
> Acked-by: Michael J. Ruhl 
> Acked-by: Christian Koenig 
> Acked-by: Daniel Vetter 
> ---
>  drivers/infiniband/core/uverbs_std_types_mr.c | 117 
> +-
>  include/uapi/rdma/ib_user_ioctl_cmds.h|  14 +++
>  2 files changed, 129 insertions(+), 2 deletions(-)
>

Thanks,
Reviewed-by: Leon Romanovsky 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v13 1/4] RDMA/umem: Support importing dma-buf as user memory region

2020-12-07 Thread Leon Romanovsky

On Mon, Dec 07, 2020 at 02:15:50PM -0800, Jianxin Xiong wrote:
> Dma-buf is a standard cross-driver buffer sharing mechanism that can be
> used to support peer-to-peer access from RDMA devices.
>
> Device memory exported via dma-buf is associated with a file descriptor.
> This is passed to the user space as a property associated with the
> buffer allocation. When the buffer is registered as a memory region,
> the file descriptor is passed to the RDMA driver along with other
> parameters.
>
> Implement the common code for importing dma-buf object and mapping
> dma-buf pages.
>
> Signed-off-by: Jianxin Xiong 
> Reviewed-by: Sean Hefty 
> Acked-by: Michael J. Ruhl 
> Acked-by: Christian Koenig 
> Acked-by: Daniel Vetter 
>
> Conflicts:
>   include/rdma/ib_umem.h

This probably leftover from rebase, am I right?

> ---
>  drivers/infiniband/core/Makefile  |   2 +-
>  drivers/infiniband/core/umem.c|   3 +
>  drivers/infiniband/core/umem_dmabuf.c | 173 
> ++
>  include/rdma/ib_umem.h|  43 -
>  4 files changed, 219 insertions(+), 2 deletions(-)
>  create mode 100644 drivers/infiniband/core/umem_dmabuf.c
>
> diff --git a/drivers/infiniband/core/Makefile 
> b/drivers/infiniband/core/Makefile
> index ccf2670..8ab4eea 100644
> --- a/drivers/infiniband/core/Makefile
> +++ b/drivers/infiniband/core/Makefile
> @@ -40,5 +40,5 @@ ib_uverbs-y :=  uverbs_main.o 
> uverbs_cmd.o uverbs_marshall.o \
>   uverbs_std_types_srq.o \
>   uverbs_std_types_wq.o \
>   uverbs_std_types_qp.o
> -ib_uverbs-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
> +ib_uverbs-$(CONFIG_INFINIBAND_USER_MEM) += umem.o umem_dmabuf.o
>  ib_uverbs-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 7ca4112..cc131f8 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -2,6 +2,7 @@
>   * Copyright (c) 2005 Topspin Communications.  All rights reserved.
>   * Copyright (c) 2005 Cisco Systems.  All rights reserved.
>   * Copyright (c) 2005 Mellanox Technologies. All rights reserved.
> + * Copyright (c) 2020 Intel Corporation. All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -278,6 +279,8 @@ void ib_umem_release(struct ib_umem *umem)
>  {
>   if (!umem)
>   return;
> + if (umem->is_dmabuf)
> + return ib_umem_dmabuf_release(to_ib_umem_dmabuf(umem));
>   if (umem->is_odp)
>   return ib_umem_odp_release(to_ib_umem_odp(umem));
>
> diff --git a/drivers/infiniband/core/umem_dmabuf.c 
> b/drivers/infiniband/core/umem_dmabuf.c
> new file mode 100644
> index 000..e50b955
> --- /dev/null
> +++ b/drivers/infiniband/core/umem_dmabuf.c
> @@ -0,0 +1,173 @@
> +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> +/*
> + * Copyright (c) 2020 Intel Corporation. All rights reserved.
> + */
> +
> +#include 
> +#include 
> +#include 
> +
> +#include "uverbs.h"
> +
> +int ib_umem_dmabuf_map_pages(struct ib_umem_dmabuf *umem_dmabuf)
> +{
> + struct sg_table *sgt;
> + struct scatterlist *sg;
> + struct dma_fence *fence;
> + unsigned long start, end, cur;
> + unsigned int nmap;
> + int i;
> +
> + dma_resv_assert_held(umem_dmabuf->attach->dmabuf->resv);
> +
> + if (umem_dmabuf->sgt)
> + return 0;
> +
> + sgt = dma_buf_map_attachment(umem_dmabuf->attach, DMA_BIDIRECTIONAL);
> + if (IS_ERR(sgt))
> + return PTR_ERR(sgt);
> +
> + /* modify the sg list in-place to match umem address and length */
> +
> + start = ALIGN_DOWN(umem_dmabuf->umem.address, PAGE_SIZE);
> + end = ALIGN(umem_dmabuf->umem.address + umem_dmabuf->umem.length,
> + PAGE_SIZE);
> + cur = 0;
> + nmap = 0;

Better to put as part of variable initialization.

> + for_each_sgtable_dma_sg(sgt, sg, i) {
> + if (start < cur + sg_dma_len(sg) && cur < end)
> + nmap++;
> + if (cur <= start && start < cur + sg_dma_len(sg)) {
> + unsigned long offset = start - cur;
> +
> + umem_dmabuf->first_sg = sg;
> + umem_dmabuf->first_sg_offset = offset;
> + sg_dma_address(sg) += offset;
> + sg_dma_len(sg) -= offset;
> + cur += offset;
> + }
> + if (cur < end && end <= cur + sg_dma_len(sg)) {
> + unsigned long trim = cur + sg_dma_len(sg) - end;
> +
> + umem_dmabuf->last_sg = sg;
> + umem_dmabuf->last_sg_trim = trim;
> + sg_dma_len(sg) -= trim;
> + break;
> + }
> + cur +=

[PATCH rdma-next v5 2/4] tools/testing/scatterlist: Rejuvenate bit-rotten test

2020-10-04 Thread Leon Romanovsky

From: Tvrtko Ursulin 

A couple small tweaks are needed to make the test build and run
on current kernels.

Signed-off-by: Tvrtko Ursulin 
Cc: Maor Gottlieb 
Signed-off-by: Leon Romanovsky 
---
 tools/testing/scatterlist/Makefile   |  3 ++-
 tools/testing/scatterlist/linux/mm.h | 35 
 2 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/tools/testing/scatterlist/Makefile 
b/tools/testing/scatterlist/Makefile
index cbb003d9305e..c65233876622 100644
--- a/tools/testing/scatterlist/Makefile
+++ b/tools/testing/scatterlist/Makefile
@@ -14,7 +14,7 @@ targets: include $(TARGETS)
 main: $(OFILES)

 clean:
-   $(RM) $(TARGETS) $(OFILES) scatterlist.c linux/scatterlist.h 
linux/highmem.h linux/kmemleak.h asm/io.h
+   $(RM) $(TARGETS) $(OFILES) scatterlist.c linux/scatterlist.h 
linux/highmem.h linux/kmemleak.h linux/slab.h asm/io.h
@rmdir asm

 scatterlist.c: ../../../lib/scatterlist.c
@@ -28,4 +28,5 @@ include: ../../../include/linux/scatterlist.h
@touch asm/io.h
@touch linux/highmem.h
@touch linux/kmemleak.h
+   @touch linux/slab.h
@cp $< linux/scatterlist.h
diff --git a/tools/testing/scatterlist/linux/mm.h 
b/tools/testing/scatterlist/linux/mm.h
index 6f9ac14aa800..6ae907f375d2 100644
--- a/tools/testing/scatterlist/linux/mm.h
+++ b/tools/testing/scatterlist/linux/mm.h
@@ -114,6 +114,12 @@ static inline void *kmalloc(unsigned int size, unsigned 
int flags)
return malloc(size);
 }

+static inline void *
+kmalloc_array(unsigned int n, unsigned int size, unsigned int flags)
+{
+   return malloc(n * size);
+}
+
 #define kfree(x) free(x)

 #define kmemleak_alloc(a, b, c, d)
@@ -122,4 +128,33 @@ static inline void *kmalloc(unsigned int size, unsigned 
int flags)
 #define PageSlab(p) (0)
 #define flush_kernel_dcache_page(p)

+#define MAX_ERRNO  4095
+
+#define IS_ERR_VALUE(x) unlikely((unsigned long)(void *)(x) >= (unsigned 
long)-MAX_ERRNO)
+
+static inline void * __must_check ERR_PTR(long error)
+{
+   return (void *) error;
+}
+
+static inline long __must_check PTR_ERR(__force const void *ptr)
+{
+   return (long) ptr;
+}
+
+static inline bool __must_check IS_ERR(__force const void *ptr)
+{
+   return IS_ERR_VALUE((unsigned long)ptr);
+}
+
+static inline int __must_check PTR_ERR_OR_ZERO(__force const void *ptr)
+{
+   if (IS_ERR(ptr))
+   return PTR_ERR(ptr);
+   else
+   return 0;
+}
+
+#define IS_ENABLED(x) (0)
+
 #endif
--
2.26.2

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH rdma-next v5 1/4] lib/scatterlist: Add support in dynamic allocation of SG table from pages

2020-10-04 Thread Leon Romanovsky

From: Maor Gottlieb 

Extend __sg_alloc_table_from_pages to support dynamic allocation of
SG table from pages. It should be used by drivers that can't supply
all the pages at one time.

This function returns the last populated SGE in the table. Users should
pass it as an argument to the function from the second call and forward.
As before, nents will be equal to the number of populated SGEs (chunks).

With this new extension, drivers can benefit the optimization of merging
contiguous pages without a need to allocate all pages in advance and
hold them in a large buffer.

E.g. with the Infiniband driver that allocates a single page for hold the
pages. For 1TB memory registration, the temporary buffer would consume only
4KB, instead of 2GB.

Signed-off-by: Maor Gottlieb 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Leon Romanovsky 
---
 drivers/gpu/drm/i915/gem/i915_gem_userptr.c |  12 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c  |  15 ++-
 include/linux/scatterlist.h |  38 +++---
 lib/scatterlist.c   | 125 
 tools/testing/scatterlist/main.c|   9 +-
 5 files changed, 142 insertions(+), 57 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c 
b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
index 12b30075134a..f2eaed6aca3d 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
@@ -403,6 +403,7 @@ __i915_gem_userptr_alloc_pages(struct drm_i915_gem_object 
*obj,
unsigned int max_segment = i915_sg_segment_size();
struct sg_table *st;
unsigned int sg_page_sizes;
+   struct scatterlist *sg;
int ret;

st = kmalloc(sizeof(*st), GFP_KERNEL);
@@ -410,13 +411,12 @@ __i915_gem_userptr_alloc_pages(struct drm_i915_gem_object 
*obj,
return ERR_PTR(-ENOMEM);

 alloc_table:
-   ret = __sg_alloc_table_from_pages(st, pvec, num_pages,
- 0, num_pages << PAGE_SHIFT,
- max_segment,
- GFP_KERNEL);
-   if (ret) {
+   sg = __sg_alloc_table_from_pages(st, pvec, num_pages, 0,
+num_pages << PAGE_SHIFT, max_segment,
+NULL, 0, GFP_KERNEL);
+   if (IS_ERR(sg)) {
kfree(st);
-   return ERR_PTR(ret);
+   return ERR_CAST(sg);
}

ret = i915_gem_gtt_prepare_pages(obj, st);
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c 
b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
index ab524ab3b0b4..f22acd398b1f 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
@@ -419,6 +419,7 @@ static int vmw_ttm_map_dma(struct vmw_ttm_tt *vmw_tt)
int ret = 0;
static size_t sgl_size;
static size_t sgt_size;
+   struct scatterlist *sg;

if (vmw_tt->mapped)
return 0;
@@ -441,13 +442,15 @@ static int vmw_ttm_map_dma(struct vmw_ttm_tt *vmw_tt)
if (unlikely(ret != 0))
return ret;

-   ret = __sg_alloc_table_from_pages
-   (_tt->sgt, vsgt->pages, vsgt->num_pages, 0,
-(unsigned long) vsgt->num_pages << PAGE_SHIFT,
-dma_get_max_seg_size(dev_priv->dev->dev),
-GFP_KERNEL);
-   if (unlikely(ret != 0))
+   sg = __sg_alloc_table_from_pages(_tt->sgt, vsgt->pages,
+   vsgt->num_pages, 0,
+   (unsigned long) vsgt->num_pages << PAGE_SHIFT,
+   dma_get_max_seg_size(dev_priv->dev->dev),
+   NULL, 0, GFP_KERNEL);
+   if (IS_ERR(sg)) {
+   ret = PTR_ERR(sg);
goto out_sg_alloc_fail;
+   }

if (vsgt->num_pages > vmw_tt->sgt.nents) {
uint64_t over_alloc =
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 45cf7b69d852..36c47e7e66a2 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -165,6 +165,22 @@ static inline void sg_set_buf(struct scatterlist *sg, 
const void *buf,
 #define for_each_sgtable_dma_sg(sgt, sg, i)\
for_each_sg((sgt)->sgl, sg, (sgt)->nents, i)

+static inline void __sg_chain(struct scatterlist *chain_sg,
+ struct scatterlist *sgl)
+{
+   /*
+* offset and length are unused for chain entry. Clear them.
+*/
+   chain_sg->offset = 0;
+   chain_sg->length = 0;
+
+   /*
+* Set lowest bit to indicate a link pointer, and make sure to clear
+* the termination bit if it happens to be set.

[PATCH rdma-next v5 4/4] RDMA/umem: Move to allocate SG table from pages

2020-10-04 Thread Leon Romanovsky

From: Maor Gottlieb 

Remove the implementation of ib_umem_add_sg_table and instead
call to __sg_alloc_table_from_pages which already has the logic to
merge contiguous pages.

Besides that it removes duplicated functionality, it reduces the
memory consumption of the SG table significantly. Prior to this
patch, the SG table was allocated in advance regardless consideration
of contiguous pages.

In huge pages system of 2MB page size, without this change, the SG table
would contain x512 SG entries.
E.g. for 100GB memory registration:

 Number of entries  Size
Before26214400  600.0MB
After512001.2MB

Signed-off-by: Maor Gottlieb 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/core/umem.c | 94 +-
 1 file changed, 12 insertions(+), 82 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index c1ab6a4f2bc3..e9fecbdf391b 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -61,73 +61,6 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
sg_free_table(>sg_head);
 }

-/* ib_umem_add_sg_table - Add N contiguous pages to scatter table
- *
- * sg: current scatterlist entry
- * page_list: array of npage struct page pointers
- * npages: number of pages in page_list
- * max_seg_sz: maximum segment size in bytes
- * nents: [out] number of entries in the scatterlist
- *
- * Return new end of scatterlist
- */
-static struct scatterlist *ib_umem_add_sg_table(struct scatterlist *sg,
-   struct page **page_list,
-   unsigned long npages,
-   unsigned int max_seg_sz,
-   int *nents)
-{
-   unsigned long first_pfn;
-   unsigned long i = 0;
-   bool update_cur_sg = false;
-   bool first = !sg_page(sg);
-
-   /* Check if new page_list is contiguous with end of previous page_list.
-* sg->length here is a multiple of PAGE_SIZE and sg->offset is 0.
-*/
-   if (!first && (page_to_pfn(sg_page(sg)) + (sg->length >> PAGE_SHIFT) ==
-  page_to_pfn(page_list[0])))
-   update_cur_sg = true;
-
-   while (i != npages) {
-   unsigned long len;
-   struct page *first_page = page_list[i];
-
-   first_pfn = page_to_pfn(first_page);
-
-   /* Compute the number of contiguous pages we have starting
-* at i
-*/
-   for (len = 0; i != npages &&
- first_pfn + len == page_to_pfn(page_list[i]) &&
- len < (max_seg_sz >> PAGE_SHIFT);
-len++)
-   i++;
-
-   /* Squash N contiguous pages from page_list into current sge */
-   if (update_cur_sg) {
-   if ((max_seg_sz - sg->length) >= (len << PAGE_SHIFT)) {
-   sg_set_page(sg, sg_page(sg),
-   sg->length + (len << PAGE_SHIFT),
-   0);
-   update_cur_sg = false;
-   continue;
-   }
-   update_cur_sg = false;
-   }
-
-   /* Squash N contiguous pages into next sge or first sge */
-   if (!first)
-   sg = sg_next(sg);
-
-   (*nents)++;
-   sg_set_page(sg, first_page, len << PAGE_SHIFT, 0);
-   first = false;
-   }
-
-   return sg;
-}
-
 /**
  * ib_umem_find_best_pgsz - Find best HW page size to use for this MR
  *
@@ -217,7 +150,7 @@ struct ib_umem *ib_umem_get(struct ib_device *device, 
unsigned long addr,
struct mm_struct *mm;
unsigned long npages;
int ret;
-   struct scatterlist *sg;
+   struct scatterlist *sg = NULL;
unsigned int gup_flags = FOLL_WRITE;

/*
@@ -272,15 +205,9 @@ struct ib_umem *ib_umem_get(struct ib_device *device, 
unsigned long addr,

cur_base = addr & PAGE_MASK;

-   ret = sg_alloc_table(>sg_head, npages, GFP_KERNEL);
-   if (ret)
-   goto vma;
-
if (!umem->writable)
gup_flags |= FOLL_FORCE;

-   sg = umem->sg_head.sgl;
-
while (npages) {
cond_resched();
ret = pin_user_pages_fast(cur_base,
@@ -292,15 +219,19 @@ struct ib_umem *ib_umem_get(struct ib_device *device, 
unsigned long addr,
goto umem_release;

cur_base += ret * PAGE_SIZE;
-   npages   -= ret;
-
-   sg = ib_umem_add_sg_table(sg, page_list, ret,
-

[PATCH rdma-next v5 0/4] Dynamicaly allocate SG table from the pages

2020-10-04 Thread Leon Romanovsky

From: Leon Romanovsky 

Changelog:
v5:
 * Use sg_init_table to allocate table and avoid changes is __sg_alloc_table
 * Fix offset issue
v4: https://lore.kernel.org/lkml/20200927064647.3106737-1-l...@kernel.org
 * Fixed formatting in first patch.
 * Added fix (clear tmp_netnts) in first patch to fix i915 failure.
 * Added test patches
v3: https://lore.kernel.org/linux-rdma/20200922083958.2150803-1-l...@kernel.org/
 * Squashed Christopher's suggestion to avoid introduced new API, but extend 
existing one.
v2: https://lore.kernel.org/linux-rdma/20200916140726.839377-1-l...@kernel.org
 * Fixed indentations and comments
 * Deleted sg_alloc_next()
 * Squashed lib/scatterlist patches into one
v1: https://lore.kernel.org/lkml/20200910134259.1304543-1-l...@kernel.org
 * Changed _sg_chain to be __sg_chain
 * Added dependency on ARCH_NO_SG_CHAIN
 * Removed struct sg_append
v0:
 * https://lore.kernel.org/lkml/20200903121853.1145976-1-l...@kernel.org

--
>From Maor:

This series extends __sg_alloc_table_from_pages to allow chaining of
new pages to already initialized SG table.

This allows for the drivers to utilize the optimization of merging contiguous
pages without a need to pre allocate all the pages and hold them in
a very large temporary buffer prior to the call to SG table initialization.

The second patch changes the Infiniband driver to use the new API. It
removes duplicate functionality from the code and benefits the
optimization of allocating dynamic SG table from pages.

In huge pages system of 2MB page size, without this change, the SG table
would contain x512 SG entries.
E.g. for 100GB memory registration:

 Number of entries  Size
Before26214400  600.0MB
After512001.2MB

Thanks

Maor Gottlieb (2):
  lib/scatterlist: Add support in dynamic allocation of SG table from
pages
  RDMA/umem: Move to allocate SG table from pages

Tvrtko Ursulin (2):
  tools/testing/scatterlist: Rejuvenate bit-rotten test
  tools/testing/scatterlist: Show errors in human readable form

 drivers/gpu/drm/i915/gem/i915_gem_userptr.c |  12 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c  |  15 ++-
 drivers/infiniband/core/umem.c  |  94 ++-
 include/linux/scatterlist.h |  38 +++---
 lib/scatterlist.c   | 125 
 tools/testing/scatterlist/Makefile  |   3 +-
 tools/testing/scatterlist/linux/mm.h|  35 ++
 tools/testing/scatterlist/main.c|  53 ++---
 8 files changed, 225 insertions(+), 150 deletions(-)

--
2.26.2

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH rdma-next v5 3/4] tools/testing/scatterlist: Show errors in human readable form

2020-10-04 Thread Leon Romanovsky

From: Tvrtko Ursulin 

Instead of just asserting dump some more useful info about what the test
saw versus what it expected to see.

Signed-off-by: Tvrtko Ursulin 
Cc: Maor Gottlieb 
Signed-off-by: Leon Romanovsky 
---
 tools/testing/scatterlist/main.c | 44 
 1 file changed, 34 insertions(+), 10 deletions(-)

diff --git a/tools/testing/scatterlist/main.c b/tools/testing/scatterlist/main.c
index 4899359a31ac..b2c7e9f7b8d3 100644
--- a/tools/testing/scatterlist/main.c
+++ b/tools/testing/scatterlist/main.c
@@ -5,6 +5,15 @@

 #define MAX_PAGES (64)

+struct test {
+   int alloc_ret;
+   unsigned num_pages;
+   unsigned *pfn;
+   unsigned size;
+   unsigned int max_seg;
+   unsigned int expected_segments;
+};
+
 static void set_pages(struct page **pages, const unsigned *array, unsigned num)
 {
unsigned int i;
@@ -17,17 +26,32 @@ static void set_pages(struct page **pages, const unsigned 
*array, unsigned num)

 #define pfn(...) (unsigned []){ __VA_ARGS__ }

+static void fail(struct test *test, struct sg_table *st, const char *cond)
+{
+   unsigned int i;
+
+   fprintf(stderr, "Failed on '%s'!\n\n", cond);
+
+   printf("size = %u, max segment = %u, expected nents = %u\nst->nents = 
%u, st->orig_nents= %u\n",
+  test->size, test->max_seg, test->expected_segments, st->nents,
+  st->orig_nents);
+
+   printf("%u input PFNs:", test->num_pages);
+   for (i = 0; i < test->num_pages; i++)
+   printf(" %x", test->pfn[i]);
+   printf("\n");
+
+   exit(1);
+}
+
+#define VALIDATE(cond, st, test) \
+   if (!(cond)) \
+   fail((test), (st), #cond);
+
 int main(void)
 {
const unsigned int sgmax = SCATTERLIST_MAX_SEGMENT;
-   struct test {
-   int alloc_ret;
-   unsigned num_pages;
-   unsigned *pfn;
-   unsigned size;
-   unsigned int max_seg;
-   unsigned int expected_segments;
-   } *test, tests[] = {
+   struct test *test, tests[] = {
{ -EINVAL, 1, pfn(0), PAGE_SIZE, PAGE_SIZE + 1, 1 },
{ -EINVAL, 1, pfn(0), PAGE_SIZE, 0, 1 },
{ -EINVAL, 1, pfn(0), PAGE_SIZE, sgmax + 1, 1 },
@@ -66,8 +90,8 @@ int main(void)
if (test->alloc_ret)
continue;

-   assert(st.nents == test->expected_segments);
-   assert(st.orig_nents == test->expected_segments);
+   VALIDATE(st.nents == test->expected_segments, , test);
+   VALIDATE(st.orig_nents == test->expected_segments, , test);

sg_free_table();
}
--
2.26.2

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH rdma-next v4 4/4] RDMA/umem: Move to allocate SG table from pages

2020-09-30 Thread Leon Romanovsky

On Wed, Sep 30, 2020 at 12:14:06PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 30, 2020 at 06:05:15PM +0300, Maor Gottlieb wrote:
> > This is right only for the last iteration. E.g. in the first iteration in
> > case that there are more pages (left_pages), then we allocate
> > SG_MAX_SINGLE_ALLOC.  We don't know how many pages from the second iteration
> > will be squashed to the SGE from the first iteration.
>
> Well, it is 0 or 1 SGE's. Check if the first page is mergable and
> subtract one from the required length?
>
> I dislike this sg_mark_end() it is something that should be internal,
> IMHO.

I don't think so, but Maor provided possible solution.
Can you take the patches?

Thanks

>
> Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH rdma-next v4 4/4] RDMA/umem: Move to allocate SG table from pages

2020-09-30 Thread Leon Romanovsky

On Tue, Sep 29, 2020 at 04:59:29PM -0300, Jason Gunthorpe wrote:
> On Sun, Sep 27, 2020 at 09:46:47AM +0300, Leon Romanovsky wrote:
> > @@ -296,11 +223,17 @@ static struct ib_umem *__ib_umem_get(struct ib_device 
> > *device,
> > goto umem_release;
> >
> > cur_base += ret * PAGE_SIZE;
> > -   npages   -= ret;
> > -
> > -   sg = ib_umem_add_sg_table(sg, page_list, ret,
> > -   dma_get_max_seg_size(device->dma_device),
> > -   >sg_nents);
> > +   npages -= ret;
> > +   sg = __sg_alloc_table_from_pages(
> > +   >sg_head, page_list, ret, 0, ret << PAGE_SHIFT,
> > +   dma_get_max_seg_size(device->dma_device), sg, npages,
> > +   GFP_KERNEL);
> > +   umem->sg_nents = umem->sg_head.nents;
> > +   if (IS_ERR(sg)) {
> > +   unpin_user_pages_dirty_lock(page_list, ret, 0);
> > +   ret = PTR_ERR(sg);
> > +   goto umem_release;
> > +   }
> > }
> >
> > sg_mark_end(sg);
>
> Does it still need the sg_mark_end?

It is preserved here for correctness, the release logic doesn't rely on
this marker, but it is better to leave it.

Thanks

>
> Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH rdma-next v4 1/4] lib/scatterlist: Add support in dynamic allocation of SG table from pages

2020-09-27 Thread Leon Romanovsky

From: Maor Gottlieb 

Extend __sg_alloc_table_from_pages to support dynamic allocation of
SG table from pages. It should be used by drivers that can't supply
all the pages at one time.

This function returns the last populated SGE in the table. Users should
pass it as an argument to the function from the second call and forward.
As before, nents will be equal to the number of populated SGEs (chunks).

With this new extension, drivers can benefit the optimization of merging
contiguous pages without a need to allocate all pages in advance and
hold them in a large buffer.

E.g. with the Infiniband driver that allocates a single page for hold the
pages. For 1TB memory registration, the temporary buffer would consume only
4KB, instead of 2GB.

Signed-off-by: Maor Gottlieb 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Leon Romanovsky 
---
 drivers/gpu/drm/i915/gem/i915_gem_userptr.c |  12 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c  |  15 +-
 include/linux/scatterlist.h |  43 +++---
 lib/scatterlist.c   | 160 +++-
 lib/sg_pool.c   |   3 +-
 tools/testing/scatterlist/main.c|   9 +-
 6 files changed, 165 insertions(+), 77 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c 
b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
index 12b30075134a..f2eaed6aca3d 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
@@ -403,6 +403,7 @@ __i915_gem_userptr_alloc_pages(struct drm_i915_gem_object 
*obj,
unsigned int max_segment = i915_sg_segment_size();
struct sg_table *st;
unsigned int sg_page_sizes;
+   struct scatterlist *sg;
int ret;

st = kmalloc(sizeof(*st), GFP_KERNEL);
@@ -410,13 +411,12 @@ __i915_gem_userptr_alloc_pages(struct drm_i915_gem_object 
*obj,
return ERR_PTR(-ENOMEM);

 alloc_table:
-   ret = __sg_alloc_table_from_pages(st, pvec, num_pages,
- 0, num_pages << PAGE_SHIFT,
- max_segment,
- GFP_KERNEL);
-   if (ret) {
+   sg = __sg_alloc_table_from_pages(st, pvec, num_pages, 0,
+num_pages << PAGE_SHIFT, max_segment,
+NULL, 0, GFP_KERNEL);
+   if (IS_ERR(sg)) {
kfree(st);
-   return ERR_PTR(ret);
+   return ERR_CAST(sg);
}

ret = i915_gem_gtt_prepare_pages(obj, st);
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c 
b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
index ab524ab3b0b4..f22acd398b1f 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
@@ -419,6 +419,7 @@ static int vmw_ttm_map_dma(struct vmw_ttm_tt *vmw_tt)
int ret = 0;
static size_t sgl_size;
static size_t sgt_size;
+   struct scatterlist *sg;

if (vmw_tt->mapped)
return 0;
@@ -441,13 +442,15 @@ static int vmw_ttm_map_dma(struct vmw_ttm_tt *vmw_tt)
if (unlikely(ret != 0))
return ret;

-   ret = __sg_alloc_table_from_pages
-   (_tt->sgt, vsgt->pages, vsgt->num_pages, 0,
-(unsigned long) vsgt->num_pages << PAGE_SHIFT,
-dma_get_max_seg_size(dev_priv->dev->dev),
-GFP_KERNEL);
-   if (unlikely(ret != 0))
+   sg = __sg_alloc_table_from_pages(_tt->sgt, vsgt->pages,
+   vsgt->num_pages, 0,
+   (unsigned long) vsgt->num_pages << PAGE_SHIFT,
+   dma_get_max_seg_size(dev_priv->dev->dev),
+   NULL, 0, GFP_KERNEL);
+   if (IS_ERR(sg)) {
+   ret = PTR_ERR(sg);
goto out_sg_alloc_fail;
+   }

if (vsgt->num_pages > vmw_tt->sgt.nents) {
uint64_t over_alloc =
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 45cf7b69d852..c24cc667b56b 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -165,6 +165,22 @@ static inline void sg_set_buf(struct scatterlist *sg, 
const void *buf,
 #define for_each_sgtable_dma_sg(sgt, sg, i)\
for_each_sg((sgt)->sgl, sg, (sgt)->nents, i)

+static inline void __sg_chain(struct scatterlist *chain_sg,
+ struct scatterlist *sgl)
+{
+   /*
+* offset and length are unused for chain entry. Clear them.
+*/
+   chain_sg->offset = 0;
+   chain_sg->length = 0;
+
+   /*
+* Set lowest bit to indicate a link pointer, and make sure to clear
+* the terminati

[PATCH rdma-next v4 4/4] RDMA/umem: Move to allocate SG table from pages

2020-09-27 Thread Leon Romanovsky

From: Maor Gottlieb 

Remove the implementation of ib_umem_add_sg_table and instead
call to __sg_alloc_table_from_pages which already has the logic to
merge contiguous pages.

Besides that it removes duplicated functionality, it reduces the
memory consumption of the SG table significantly. Prior to this
patch, the SG table was allocated in advance regardless consideration
of contiguous pages.

In huge pages system of 2MB page size, without this change, the SG table
would contain x512 SG entries. E.g. for 100GB memory registration:

 Number of entries  Size
Before26214400  600.0MB
After512001.2MB

Signed-off-by: Maor Gottlieb 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/core/umem.c | 92 +-
 1 file changed, 12 insertions(+), 80 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 01b680b62846..0ef736970aba 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -63,73 +63,6 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
sg_free_table(>sg_head);
 }

-/* ib_umem_add_sg_table - Add N contiguous pages to scatter table
- *
- * sg: current scatterlist entry
- * page_list: array of npage struct page pointers
- * npages: number of pages in page_list
- * max_seg_sz: maximum segment size in bytes
- * nents: [out] number of entries in the scatterlist
- *
- * Return new end of scatterlist
- */
-static struct scatterlist *ib_umem_add_sg_table(struct scatterlist *sg,
-   struct page **page_list,
-   unsigned long npages,
-   unsigned int max_seg_sz,
-   int *nents)
-{
-   unsigned long first_pfn;
-   unsigned long i = 0;
-   bool update_cur_sg = false;
-   bool first = !sg_page(sg);
-
-   /* Check if new page_list is contiguous with end of previous page_list.
-* sg->length here is a multiple of PAGE_SIZE and sg->offset is 0.
-*/
-   if (!first && (page_to_pfn(sg_page(sg)) + (sg->length >> PAGE_SHIFT) ==
-  page_to_pfn(page_list[0])))
-   update_cur_sg = true;
-
-   while (i != npages) {
-   unsigned long len;
-   struct page *first_page = page_list[i];
-
-   first_pfn = page_to_pfn(first_page);
-
-   /* Compute the number of contiguous pages we have starting
-* at i
-*/
-   for (len = 0; i != npages &&
- first_pfn + len == page_to_pfn(page_list[i]) &&
- len < (max_seg_sz >> PAGE_SHIFT);
-len++)
-   i++;
-
-   /* Squash N contiguous pages from page_list into current sge */
-   if (update_cur_sg) {
-   if ((max_seg_sz - sg->length) >= (len << PAGE_SHIFT)) {
-   sg_set_page(sg, sg_page(sg),
-   sg->length + (len << PAGE_SHIFT),
-   0);
-   update_cur_sg = false;
-   continue;
-   }
-   update_cur_sg = false;
-   }
-
-   /* Squash N contiguous pages into next sge or first sge */
-   if (!first)
-   sg = sg_next(sg);
-
-   (*nents)++;
-   sg_set_page(sg, first_page, len << PAGE_SHIFT, 0);
-   first = false;
-   }
-
-   return sg;
-}
-
 /**
  * ib_umem_find_best_pgsz - Find best HW page size to use for this MR
  *
@@ -221,7 +154,7 @@ static struct ib_umem *__ib_umem_get(struct ib_device 
*device,
struct mm_struct *mm;
unsigned long npages;
int ret;
-   struct scatterlist *sg;
+   struct scatterlist *sg = NULL;
unsigned int gup_flags = FOLL_WRITE;

/*
@@ -276,15 +209,9 @@ static struct ib_umem *__ib_umem_get(struct ib_device 
*device,

cur_base = addr & PAGE_MASK;

-   ret = sg_alloc_table(>sg_head, npages, GFP_KERNEL);
-   if (ret)
-   goto vma;
-
if (!umem->writable)
gup_flags |= FOLL_FORCE;

-   sg = umem->sg_head.sgl;
-
while (npages) {
cond_resched();
ret = pin_user_pages_fast(cur_base,
@@ -296,11 +223,17 @@ static struct ib_umem *__ib_umem_get(struct ib_device 
*device,
goto umem_release;

cur_base += ret * PAGE_SIZE;
-   npages   -= ret;
-
-   sg = ib_umem_add_sg_table(sg, page_list, ret,
-   dma_get_max_seg_

[PATCH rdma-next v4 0/4] Dynamicaly allocate SG table from the pages

2020-09-27 Thread Leon Romanovsky

From: Leon Romanovsky 


Changelog:
v4:
 * Fixed formatting in first patch.
 * Added fix (clear tmp_netnts) in first patch to fix i915 failure.
 * Added test patches
v3: https://lore.kernel.org/linux-rdma/20200922083958.2150803-1-l...@kernel.org/
 * Squashed Christopher's suggestion to avoid introduced new API, but extend 
existing one.
v2: https://lore.kernel.org/linux-rdma/20200916140726.839377-1-l...@kernel.org
 * Fixed indentations and comments
 * Deleted sg_alloc_next()
 * Squashed lib/scatterlist patches into one
v1: https://lore.kernel.org/lkml/20200910134259.1304543-1-l...@kernel.org
 * Changed _sg_chain to be __sg_chain
 * Added dependency on ARCH_NO_SG_CHAIN
 * Removed struct sg_append
v0:
 * https://lore.kernel.org/lkml/20200903121853.1145976-1-l...@kernel.org

--
>From Maor:

This series extends __sg_alloc_table_from_pages to allow chaining of
new pages to already initialized SG table.

This allows drivers to utilize the optimization of merging contiguous
pages without a need to pre allocate all the pages and hold them in
a very large temporary buffer prior to the call to SG table initialization.

The second patch changes the Infiniband driver to use the new API. It
removes duplicate functionality from the code and benefits the
optimization of allocating dynamic SG table from pages.

In huge pages system of 2MB page size, without this change, the SG table
would contain x512 SG entries.
E.g. for 100GB memory registration:

 Number of entries  Size
Before26214400  600.0MB
After512001.2MB

Thanks

Maor Gottlieb (2):
  lib/scatterlist: Add support in dynamic allocation of SG table from
pages
  RDMA/umem: Move to allocate SG table from pages

Tvrtko Ursulin (2):
  tools/testing/scatterlist: Rejuvenate bit-rotten test
  tools/testing/scatterlist: Show errors in human readable form

 drivers/gpu/drm/i915/gem/i915_gem_userptr.c |  12 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c  |  15 +-
 drivers/infiniband/core/umem.c  |  92 ++-
 include/linux/scatterlist.h |  43 +++---
 lib/scatterlist.c   | 160 +++-
 lib/sg_pool.c   |   3 +-
 tools/testing/scatterlist/Makefile  |   3 +-
 tools/testing/scatterlist/linux/mm.h|  35 +
 tools/testing/scatterlist/main.c|  53 +--
 9 files changed, 248 insertions(+), 168 deletions(-)

--
2.26.2

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH rdma-next v4 2/4] tools/testing/scatterlist: Rejuvenate bit-rotten test

2020-09-27 Thread Leon Romanovsky

From: Tvrtko Ursulin 

A couple small tweaks are needed to make the test build and run
on current kernels.

Signed-off-by: Tvrtko Ursulin 
Signed-off-by: Maor Gottlieb 
Signed-off-by: Leon Romanovsky 
---
 tools/testing/scatterlist/Makefile   |  3 ++-
 tools/testing/scatterlist/linux/mm.h | 35 
 2 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/tools/testing/scatterlist/Makefile 
b/tools/testing/scatterlist/Makefile
index cbb003d9305e..c65233876622 100644
--- a/tools/testing/scatterlist/Makefile
+++ b/tools/testing/scatterlist/Makefile
@@ -14,7 +14,7 @@ targets: include $(TARGETS)
 main: $(OFILES)

 clean:
-   $(RM) $(TARGETS) $(OFILES) scatterlist.c linux/scatterlist.h 
linux/highmem.h linux/kmemleak.h asm/io.h
+   $(RM) $(TARGETS) $(OFILES) scatterlist.c linux/scatterlist.h 
linux/highmem.h linux/kmemleak.h linux/slab.h asm/io.h
@rmdir asm

 scatterlist.c: ../../../lib/scatterlist.c
@@ -28,4 +28,5 @@ include: ../../../include/linux/scatterlist.h
@touch asm/io.h
@touch linux/highmem.h
@touch linux/kmemleak.h
+   @touch linux/slab.h
@cp $< linux/scatterlist.h
diff --git a/tools/testing/scatterlist/linux/mm.h 
b/tools/testing/scatterlist/linux/mm.h
index 6f9ac14aa800..6ae907f375d2 100644
--- a/tools/testing/scatterlist/linux/mm.h
+++ b/tools/testing/scatterlist/linux/mm.h
@@ -114,6 +114,12 @@ static inline void *kmalloc(unsigned int size, unsigned 
int flags)
return malloc(size);
 }

+static inline void *
+kmalloc_array(unsigned int n, unsigned int size, unsigned int flags)
+{
+   return malloc(n * size);
+}
+
 #define kfree(x) free(x)

 #define kmemleak_alloc(a, b, c, d)
@@ -122,4 +128,33 @@ static inline void *kmalloc(unsigned int size, unsigned 
int flags)
 #define PageSlab(p) (0)
 #define flush_kernel_dcache_page(p)

+#define MAX_ERRNO  4095
+
+#define IS_ERR_VALUE(x) unlikely((unsigned long)(void *)(x) >= (unsigned 
long)-MAX_ERRNO)
+
+static inline void * __must_check ERR_PTR(long error)
+{
+   return (void *) error;
+}
+
+static inline long __must_check PTR_ERR(__force const void *ptr)
+{
+   return (long) ptr;
+}
+
+static inline bool __must_check IS_ERR(__force const void *ptr)
+{
+   return IS_ERR_VALUE((unsigned long)ptr);
+}
+
+static inline int __must_check PTR_ERR_OR_ZERO(__force const void *ptr)
+{
+   if (IS_ERR(ptr))
+   return PTR_ERR(ptr);
+   else
+   return 0;
+}
+
+#define IS_ENABLED(x) (0)
+
 #endif
--
2.26.2

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH rdma-next v4 3/4] tools/testing/scatterlist: Show errors in human readable form

2020-09-27 Thread Leon Romanovsky

From: Tvrtko Ursulin 

Instead of just asserting dump some more useful info about what the test
saw versus what it expected to see.

Signed-off-by: Tvrtko Ursulin 
Signed-off-by: Maor Gottlieb 
Signed-off-by: Leon Romanovsky 
---
 tools/testing/scatterlist/main.c | 44 
 1 file changed, 34 insertions(+), 10 deletions(-)

diff --git a/tools/testing/scatterlist/main.c b/tools/testing/scatterlist/main.c
index 4899359a31ac..b2c7e9f7b8d3 100644
--- a/tools/testing/scatterlist/main.c
+++ b/tools/testing/scatterlist/main.c
@@ -5,6 +5,15 @@

 #define MAX_PAGES (64)

+struct test {
+   int alloc_ret;
+   unsigned num_pages;
+   unsigned *pfn;
+   unsigned size;
+   unsigned int max_seg;
+   unsigned int expected_segments;
+};
+
 static void set_pages(struct page **pages, const unsigned *array, unsigned num)
 {
unsigned int i;
@@ -17,17 +26,32 @@ static void set_pages(struct page **pages, const unsigned 
*array, unsigned num)

 #define pfn(...) (unsigned []){ __VA_ARGS__ }

+static void fail(struct test *test, struct sg_table *st, const char *cond)
+{
+   unsigned int i;
+
+   fprintf(stderr, "Failed on '%s'!\n\n", cond);
+
+   printf("size = %u, max segment = %u, expected nents = %u\nst->nents = 
%u, st->orig_nents= %u\n",
+  test->size, test->max_seg, test->expected_segments, st->nents,
+  st->orig_nents);
+
+   printf("%u input PFNs:", test->num_pages);
+   for (i = 0; i < test->num_pages; i++)
+   printf(" %x", test->pfn[i]);
+   printf("\n");
+
+   exit(1);
+}
+
+#define VALIDATE(cond, st, test) \
+   if (!(cond)) \
+   fail((test), (st), #cond);
+
 int main(void)
 {
const unsigned int sgmax = SCATTERLIST_MAX_SEGMENT;
-   struct test {
-   int alloc_ret;
-   unsigned num_pages;
-   unsigned *pfn;
-   unsigned size;
-   unsigned int max_seg;
-   unsigned int expected_segments;
-   } *test, tests[] = {
+   struct test *test, tests[] = {
{ -EINVAL, 1, pfn(0), PAGE_SIZE, PAGE_SIZE + 1, 1 },
{ -EINVAL, 1, pfn(0), PAGE_SIZE, 0, 1 },
{ -EINVAL, 1, pfn(0), PAGE_SIZE, sgmax + 1, 1 },
@@ -66,8 +90,8 @@ int main(void)
if (test->alloc_ret)
continue;

-   assert(st.nents == test->expected_segments);
-   assert(st.orig_nents == test->expected_segments);
+   VALIDATE(st.nents == test->expected_segments, , test);
+   VALIDATE(st.orig_nents == test->expected_segments, , test);

sg_free_table();
}
--
2.26.2

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [Intel-gfx] [PATCH rdma-next v3 1/2] lib/scatterlist: Add support in dynamic allocation of SG table from pages

2020-09-25 Thread Leon Romanovsky

On Thu, Sep 24, 2020 at 09:21:20AM +0100, Tvrtko Ursulin wrote:
>
> On 22/09/2020 09:39, Leon Romanovsky wrote:
> > From: Maor Gottlieb 
> >
> > Extend __sg_alloc_table_from_pages to support dynamic allocation of
> > SG table from pages. It should be used by drivers that can't supply
> > all the pages at one time.
> >
> > This function returns the last populated SGE in the table. Users should
> > pass it as an argument to the function from the second call and forward.
> > As before, nents will be equal to the number of populated SGEs (chunks).
>
> So it's appending and growing the "list", did I get that right? Sounds handy
> indeed. Some comments/questions below.

Yes, we (RDMA) use this function to chain contiguous pages.

>
> >
> > With this new extension, drivers can benefit the optimization of merging
> > contiguous pages without a need to allocate all pages in advance and
> > hold them in a large buffer.
> >
> > E.g. with the Infiniband driver that allocates a single page for hold
> > the
> > pages. For 1TB memory registration, the temporary buffer would consume
> > only
> > 4KB, instead of 2GB.
> >
> > Signed-off-by: Maor Gottlieb 
> > Signed-off-by: Leon Romanovsky 
> > ---
> >   drivers/gpu/drm/i915/gem/i915_gem_userptr.c |  12 +-
> >   drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c  |  15 +-
> >   include/linux/scatterlist.h |  43 +++---
> >   lib/scatterlist.c   | 158 +++-
> >   lib/sg_pool.c   |   3 +-
> >   tools/testing/scatterlist/main.c|   9 +-
> >   6 files changed, 163 insertions(+), 77 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c 
> > b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> > index 12b30075134a..f2eaed6aca3d 100644
> > --- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> > +++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
> > @@ -403,6 +403,7 @@ __i915_gem_userptr_alloc_pages(struct 
> > drm_i915_gem_object *obj,
> > unsigned int max_segment = i915_sg_segment_size();
> > struct sg_table *st;
> > unsigned int sg_page_sizes;
> > +   struct scatterlist *sg;
> > int ret;
> >
> > st = kmalloc(sizeof(*st), GFP_KERNEL);
> > @@ -410,13 +411,12 @@ __i915_gem_userptr_alloc_pages(struct 
> > drm_i915_gem_object *obj,
> > return ERR_PTR(-ENOMEM);
> >
> >   alloc_table:
> > -   ret = __sg_alloc_table_from_pages(st, pvec, num_pages,
> > - 0, num_pages << PAGE_SHIFT,
> > - max_segment,
> > - GFP_KERNEL);
> > -   if (ret) {
> > +   sg = __sg_alloc_table_from_pages(st, pvec, num_pages, 0,
> > +num_pages << PAGE_SHIFT, max_segment,
> > +NULL, 0, GFP_KERNEL);
> > +   if (IS_ERR(sg)) {
> > kfree(st);
> > -   return ERR_PTR(ret);
> > +   return ERR_CAST(sg);
> > }
> >
> > ret = i915_gem_gtt_prepare_pages(obj, st);
> > diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c 
> > b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
> > index ab524ab3b0b4..f22acd398b1f 100644
> > --- a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
> > +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
> > @@ -419,6 +419,7 @@ static int vmw_ttm_map_dma(struct vmw_ttm_tt *vmw_tt)
> > int ret = 0;
> > static size_t sgl_size;
> > static size_t sgt_size;
> > +   struct scatterlist *sg;
> >
> > if (vmw_tt->mapped)
> > return 0;
> > @@ -441,13 +442,15 @@ static int vmw_ttm_map_dma(struct vmw_ttm_tt *vmw_tt)
> > if (unlikely(ret != 0))
> > return ret;
> >
> > -   ret = __sg_alloc_table_from_pages
> > -   (_tt->sgt, vsgt->pages, vsgt->num_pages, 0,
> > -(unsigned long) vsgt->num_pages << PAGE_SHIFT,
> > -dma_get_max_seg_size(dev_priv->dev->dev),
> > -GFP_KERNEL);
> > -   if (unlikely(ret != 0))
> > +   sg = __sg_alloc_table_from_pages(_tt->sgt, vsgt->pages,
> > +   vsgt->num_pages, 0,
> > +   (unsigned long) vsgt->num_pages << PAGE_SHIFT,
> > +   dma_get_max_seg_size(dev_priv->dev->dev),
> &g

[PATCH rdma-next v3 1/2] lib/scatterlist: Add support in dynamic allocation of SG table from pages

2020-09-22 Thread Leon Romanovsky

From: Maor Gottlieb 

Extend __sg_alloc_table_from_pages to support dynamic allocation of
SG table from pages. It should be used by drivers that can't supply
all the pages at one time.

This function returns the last populated SGE in the table. Users should
pass it as an argument to the function from the second call and forward.
As before, nents will be equal to the number of populated SGEs (chunks).

With this new extension, drivers can benefit the optimization of merging
contiguous pages without a need to allocate all pages in advance and
hold them in a large buffer.

E.g. with the Infiniband driver that allocates a single page for hold
the
pages. For 1TB memory registration, the temporary buffer would consume
only
4KB, instead of 2GB.

Signed-off-by: Maor Gottlieb 
Signed-off-by: Leon Romanovsky 
---
 drivers/gpu/drm/i915/gem/i915_gem_userptr.c |  12 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c  |  15 +-
 include/linux/scatterlist.h |  43 +++---
 lib/scatterlist.c   | 158 +++-
 lib/sg_pool.c   |   3 +-
 tools/testing/scatterlist/main.c|   9 +-
 6 files changed, 163 insertions(+), 77 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c 
b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
index 12b30075134a..f2eaed6aca3d 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
@@ -403,6 +403,7 @@ __i915_gem_userptr_alloc_pages(struct drm_i915_gem_object 
*obj,
unsigned int max_segment = i915_sg_segment_size();
struct sg_table *st;
unsigned int sg_page_sizes;
+   struct scatterlist *sg;
int ret;

st = kmalloc(sizeof(*st), GFP_KERNEL);
@@ -410,13 +411,12 @@ __i915_gem_userptr_alloc_pages(struct drm_i915_gem_object 
*obj,
return ERR_PTR(-ENOMEM);

 alloc_table:
-   ret = __sg_alloc_table_from_pages(st, pvec, num_pages,
- 0, num_pages << PAGE_SHIFT,
- max_segment,
- GFP_KERNEL);
-   if (ret) {
+   sg = __sg_alloc_table_from_pages(st, pvec, num_pages, 0,
+num_pages << PAGE_SHIFT, max_segment,
+NULL, 0, GFP_KERNEL);
+   if (IS_ERR(sg)) {
kfree(st);
-   return ERR_PTR(ret);
+   return ERR_CAST(sg);
}

ret = i915_gem_gtt_prepare_pages(obj, st);
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c 
b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
index ab524ab3b0b4..f22acd398b1f 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c
@@ -419,6 +419,7 @@ static int vmw_ttm_map_dma(struct vmw_ttm_tt *vmw_tt)
int ret = 0;
static size_t sgl_size;
static size_t sgt_size;
+   struct scatterlist *sg;

if (vmw_tt->mapped)
return 0;
@@ -441,13 +442,15 @@ static int vmw_ttm_map_dma(struct vmw_ttm_tt *vmw_tt)
if (unlikely(ret != 0))
return ret;

-   ret = __sg_alloc_table_from_pages
-   (_tt->sgt, vsgt->pages, vsgt->num_pages, 0,
-(unsigned long) vsgt->num_pages << PAGE_SHIFT,
-dma_get_max_seg_size(dev_priv->dev->dev),
-GFP_KERNEL);
-   if (unlikely(ret != 0))
+   sg = __sg_alloc_table_from_pages(_tt->sgt, vsgt->pages,
+   vsgt->num_pages, 0,
+   (unsigned long) vsgt->num_pages << PAGE_SHIFT,
+   dma_get_max_seg_size(dev_priv->dev->dev),
+   NULL, 0, GFP_KERNEL);
+   if (IS_ERR(sg)) {
+   ret = PTR_ERR(sg);
goto out_sg_alloc_fail;
+   }

if (vsgt->num_pages > vmw_tt->sgt.nents) {
uint64_t over_alloc =
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 45cf7b69d852..c24cc667b56b 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -165,6 +165,22 @@ static inline void sg_set_buf(struct scatterlist *sg, 
const void *buf,
 #define for_each_sgtable_dma_sg(sgt, sg, i)\
for_each_sg((sgt)->sgl, sg, (sgt)->nents, i)

+static inline void __sg_chain(struct scatterlist *chain_sg,
+ struct scatterlist *sgl)
+{
+   /*
+* offset and length are unused for chain entry. Clear them.
+*/
+   chain_sg->offset = 0;
+   chain_sg->length = 0;
+
+   /*
+* Set lowest bit to indicate a link pointer, and make sure to clear
+* the termination bit if it happens to be set.

[PATCH rdma-next v3 0/2] Dynamicaly allocate SG table from the pages

2020-09-22 Thread Leon Romanovsky

From: Leon Romanovsky 

Changelog:
v3:
 * Squashed Christopher's suggestion to avoid introduced new API, but extend 
existing one.
v2: https://lore.kernel.org/linux-rdma/20200916140726.839377-1-l...@kernel.org
 * Fixed indentations and comments
 * Deleted sg_alloc_next()
 * Squashed lib/scatterlist patches into one
v1: https://lore.kernel.org/lkml/20200910134259.1304543-1-l...@kernel.org
 * Changed _sg_chain to be __sg_chain
 * Added dependency on ARCH_NO_SG_CHAIN
 * Removed struct sg_append
v0:
 * https://lore.kernel.org/lkml/20200903121853.1145976-1-l...@kernel.org

--
>From Maor:

This series extends __sg_alloc_table_from_pages to allow chaining of
new pages to already initialized SG table.

This allows drivers to utilize the optimization of merging contiguous
pages without a need to pre allocate all the pages and hold them in
a very large temporary buffer prior to the call to SG table initialization.

The second patch changes the Infiniband driver to use the new API. It
removes duplicate functionality from the code and benefits the
optimization of allocating dynamic SG table from pages.

In huge pages system of 2MB page size, without this change, the SG table
would contain x512 SG entries.
E.g. for 100GB memory registration:

 Number of entries  Size
Before26214400  600.0MB
After512001.2MB

Thanks

Maor Gottlieb (2):
  lib/scatterlist: Add support in dynamic allocation of SG table from
pages
  RDMA/umem: Move to allocate SG table from pages

 drivers/gpu/drm/i915/gem/i915_gem_userptr.c |  12 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c  |  15 +-
 drivers/infiniband/core/umem.c  |  92 ++--
 include/linux/scatterlist.h |  43 +++---
 lib/scatterlist.c   | 158 +++-
 lib/sg_pool.c   |   3 +-
 tools/testing/scatterlist/main.c|   9 +-
 7 files changed, 175 insertions(+), 157 deletions(-)

--
2.26.2

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-24 Thread Leon Romanovsky

On Tue, Dec 24, 2019 at 06:03:50PM -0800, John Hubbard wrote:
> On 12/22/19 5:23 AM, Leon Romanovsky wrote:
> > On Fri, Dec 20, 2019 at 03:54:55PM -0800, John Hubbard wrote:
> > > On 12/20/19 10:29 AM, Leon Romanovsky wrote:
> > > ...
> > > > > $ ./build.sh
> > > > > $ build/bin/run_tests.py
> > > > >
> > > > > If you get things that far I think Leon can get a reproduction for you
> > > >
> > > > I'm not so optimistic about that.
> > > >
> > >
> > > OK, I'm going to proceed for now on the assumption that I've got an 
> > > overflow
> > > problem that happens when huge pages are pinned. If I can get more 
> > > information,
> > > great, otherwise it's probably enough.
> > >
> > > One thing: for your repro, if you know the huge page size, and the system
> > > page size for that case, that would really help. Also the number of pins 
> > > per
> > > page, more or less, that you'd expect. Because Jason says that only 2M 
> > > huge
> > > pages are used...
> > >
> > > Because the other possibility is that the refcount really is going 
> > > negative,
> > > likely due to a mismatched pin/unpin somehow.
> > >
> > > If there's not an obvious repro case available, but you do have one (is 
> > > it easy
> > > to repro, though?), then *if* you have the time, I could point you to a 
> > > github
> > > branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this:
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index bb44c4d2ada7..8526fd03b978 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -1077,7 +1077,7 @@ static inline void put_page(struct page *page)
> > >* get_user_pages and page_mkclean and other calls that race to set up 
> > > page
> > >* table entries.
> > >*/
> > > -#define GUP_PIN_COUNTING_BIAS (1U << 10)
> > > +#define GUP_PIN_COUNTING_BIAS (1U << 8)
> > >
> > >   void unpin_user_page(struct page *page);
> > >   void unpin_user_pages_dirty_lock(struct page **pages, unsigned long 
> > > npages,
> > >
> > > If that fails to repro, then we would be zeroing in on the root cause.
> > >
> > > The branch is here (I just tested it and it seems healthy):
> > >
> > > g...@github.com:johnhubbard/linux.git  
> > > pin_user_pages_tracking_v11_with_diags
> >
> > Hi,
> >
> > We tested the following branch and here comes results:
>
> Thanks for this testing run!
>
> > [root@server consume_mtts]# (master) $ grep foll_pin /proc/vmstat
> > nr_foll_pin_requested 0
> > nr_foll_pin_returned 0
> >
>
> Zero pinned pages!

Maybe we are missing some CONFIG_* option?
https://lore.kernel.org/linux-rdma/12a28917-f8c9-5092-2f01-92bb74714...@nvidia.com/T/#mf900896f5dfc86cdee9246219990c632ed77115f

>
> ...now I'm confused. Somehow FOLL_PIN and pin_user_pages*() calls are
> not happening. And although the backtraces below show some of my new
> routines (like try_grab_page), they also confirm the above: there is no
> pin_user_page*() call in the stack.
>
> In particular, it looks like ib_umem_get() is calling through to
> get_user_pages*(), rather than pin_user_pages*(). I don't see how this
> is possible, because the code on my screen shows ib_umem_get() calling
> pin_user_pages_fast().
>
> Any thoughts or ideas are welcome here.
>
> However, glossing over all of that and assuming that the new
> GUP_PIN_COUNTING_BIAS of 256 is applied, it's interesting that we still
> see any overflow. I'm less confident now that this is a true refcount
> overflow.

Earlier in this email thread, I posted possible function call chain which
doesn't involve refcount overflow, but for some reason the refcount
overflow was chosen as a way to explore.

>
> Also, any information that would get me closer to being able to attempt
> my own reproduction of the problem are *very* welcome. :)

It is ancient verification test (~10y) which is not an easy task to
make it understandable and standalone :).

>
> thanks,
> --
> John Hubbard
> NVIDIA
>
> > [root@serer consume_mtts]# (master) $ dmesg
> > [  425.221459] [ cut here ]
> > [  425.225894] WARNING: CPU: 1 PID: 6738 at mm/gup.c:61 
> > try_grab_compound_head+0x90/0xa0
> > [  425.228021] Modules linked in: mlx5_ib mlx5_core mlxfw mlx4_ib mlx4_en 
> > ptp pps_core mlx4_core bonding ip6_gre ip6_tunnel tunnel6 i

Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-22 Thread Leon Romanovsky

On Fri, Dec 20, 2019 at 03:54:55PM -0800, John Hubbard wrote:
> On 12/20/19 10:29 AM, Leon Romanovsky wrote:
> ...
> >> $ ./build.sh
> >> $ build/bin/run_tests.py
> >>
> >> If you get things that far I think Leon can get a reproduction for you
> >
> > I'm not so optimistic about that.
> >
>
> OK, I'm going to proceed for now on the assumption that I've got an overflow
> problem that happens when huge pages are pinned. If I can get more 
> information,
> great, otherwise it's probably enough.
>
> One thing: for your repro, if you know the huge page size, and the system
> page size for that case, that would really help. Also the number of pins per
> page, more or less, that you'd expect. Because Jason says that only 2M huge
> pages are used...
>
> Because the other possibility is that the refcount really is going negative,
> likely due to a mismatched pin/unpin somehow.
>
> If there's not an obvious repro case available, but you do have one (is it 
> easy
> to repro, though?), then *if* you have the time, I could point you to a github
> branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this:
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index bb44c4d2ada7..8526fd03b978 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1077,7 +1077,7 @@ static inline void put_page(struct page *page)
>   * get_user_pages and page_mkclean and other calls that race to set up page
>   * table entries.
>   */
> -#define GUP_PIN_COUNTING_BIAS (1U << 10)
> +#define GUP_PIN_COUNTING_BIAS (1U << 8)
>
>  void unpin_user_page(struct page *page);
>  void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>
> If that fails to repro, then we would be zeroing in on the root cause.
>
> The branch is here (I just tested it and it seems healthy):
>
> g...@github.com:johnhubbard/linux.git  pin_user_pages_tracking_v11_with_diags

Hi,

We tested the following branch and here comes results:
[root@server consume_mtts]# (master) $ grep foll_pin /proc/vmstat
nr_foll_pin_requested 0
nr_foll_pin_returned 0

[root@serer consume_mtts]# (master) $ dmesg
[  425.221459] [ cut here ]
[  425.225894] WARNING: CPU: 1 PID: 6738 at mm/gup.c:61 
try_grab_compound_head+0x90/0xa0
[  425.228021] Modules linked in: mlx5_ib mlx5_core mlxfw mlx4_ib mlx4_en ptp 
pps_core mlx4_core bonding ip6_gre ip6_tunnel tunnel6 ip_gre gre ip_tunnel 
rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm ib_uverbs ib_ipoib ib_umad ib_srp 
scsi_transport_srp rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm 
ib_cm ib_core [last unloaded: mlxfw]
[  425.235266] CPU: 1 PID: 6738 Comm: consume_mtts Tainted: G   O  
5.5.0-rc2+ #1
[  425.237480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1ubuntu1 04/01/2014
[  425.239738] RIP: 0010:try_grab_compound_head+0x90/0xa0
[  425.241170] Code: 06 48 8d 4f 34 f0 0f b1 57 34 74 cd 85 c0 74 cf 8d 14 06 
f0 0f b1 11 74 c0 eb f1 8d 14 06 f0 0f b1 11 74 b5 85 c0 75 f3 eb b5 <0f> 0b 31 
c0 c3 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41
[  425.245739] RSP: 0018:c96878a8 EFLAGS: 00010082
[  425.247124] RAX: 8001 RBX: 7f780488a000 RCX: 0bb0
[  425.248956] RDX: ea000e031087 RSI: 8a00 RDI: ea000dc58000
[  425.250761] RBP: ea000e031080 R08: c9687974 R09: 000fffe0
[  425.252661] R10:  R11: 88836256 R12: 008a
[  425.254487] R13: 8003716000e7 R14: 7f780488a000 R15: c9687974
[  425.256309] FS:  7f780d9d3740() GS:8883b1c8() 
knlGS:
[  425.258401] CS:  0010 DS:  ES:  CR0: 80050033
[  425.259949] CR2: 02334048 CR3: 00039c68c001 CR4: 001606a0
[  425.261884] Call Trace:
[  425.262735]  gup_pgd_range+0x517/0x5a0
[  425.263819]  internal_get_user_pages_fast+0x210/0x250
[  425.265193]  ib_umem_get+0x298/0x550 [ib_uverbs]
[  425.266476]  mr_umem_get+0xc9/0x260 [mlx5_ib]
[  425.267699]  mlx5_ib_reg_user_mr+0xcc/0x7e0 [mlx5_ib]
[  425.269134]  ? xas_load+0x8/0x80
[  425.270074]  ? xa_load+0x48/0x90
[  425.271038]  ? lookup_get_idr_uobject.part.10+0x12/0x70 [ib_uverbs]
[  425.272757]  ib_uverbs_reg_mr+0x127/0x280 [ib_uverbs]
[  425.274120]  ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xc2/0xf0 
[ib_uverbs]
[  425.276058]  ib_uverbs_cmd_verbs.isra.6+0x5be/0xbe0 [ib_uverbs]
[  425.277657]  ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs]
[  425.279155]  ? __alloc_pages_nodemask+0x148/0x2b0
[  425.280445]  ib_uverbs_ioctl+0xc0/0x120 [ib_uverbs]
[  425.281755]  do_vfs_ioctl+0x9d/0x650
[  425.282766]  ksys_ioctl+0x70/0x80
[  425.283745]  __x64_sys_ioctl+0x16/0x20
[  425.284912]  do_syscall_64+0x42/0x130
[  425.285973]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  425

Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-21 Thread Leon Romanovsky

On Fri, Dec 20, 2019 at 03:54:55PM -0800, John Hubbard wrote:
> On 12/20/19 10:29 AM, Leon Romanovsky wrote:
> ...
> >> $ ./build.sh
> >> $ build/bin/run_tests.py
> >>
> >> If you get things that far I think Leon can get a reproduction for you
> >
> > I'm not so optimistic about that.
> >
>
> OK, I'm going to proceed for now on the assumption that I've got an overflow
> problem that happens when huge pages are pinned. If I can get more 
> information,
> great, otherwise it's probably enough.
>
> One thing: for your repro, if you know the huge page size, and the system
> page size for that case, that would really help. Also the number of pins per
> page, more or less, that you'd expect. Because Jason says that only 2M huge
> pages are used...
>
> Because the other possibility is that the refcount really is going negative,
> likely due to a mismatched pin/unpin somehow.
>
> If there's not an obvious repro case available, but you do have one (is it 
> easy
> to repro, though?), then *if* you have the time, I could point you to a github
> branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this:

I'll see what I can do this Sunday.

>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index bb44c4d2ada7..8526fd03b978 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1077,7 +1077,7 @@ static inline void put_page(struct page *page)
>   * get_user_pages and page_mkclean and other calls that race to set up page
>   * table entries.
>   */
> -#define GUP_PIN_COUNTING_BIAS (1U << 10)
> +#define GUP_PIN_COUNTING_BIAS (1U << 8)
>
>  void unpin_user_page(struct page *page);
>  void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>
> If that fails to repro, then we would be zeroing in on the root cause.
>
> The branch is here (I just tested it and it seems healthy):
>
> g...@github.com:johnhubbard/linux.git  pin_user_pages_tracking_v11_with_diags
>
>
>
> thanks,
> --
> John Hubbard
> NVIDIA
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-20 Thread Leon Romanovsky

On Thu, Dec 19, 2019 at 02:58:43PM -0800, John Hubbard wrote:
> On 12/19/19 1:07 PM, Jason Gunthorpe wrote:
> ...
> > > 3. It would be nice if I could reproduce this. I have a two-node mlx5 
> > > Infiniband
> > > test setup, but I have done only the tiniest bit of user space IB coding, 
> > > so
> > > if you have any test programs that aren't too hard to deal with that could
> > > possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in 
> > > mind
> > > that I'm not an advanced IB programmer. At all. :)
> >
> > Clone this:
> >
> > https://github.com/linux-rdma/rdma-core.git
> >
> > Install all the required deps to build it (notably cython), see the 
> > README.md
> >
> > $ ./build.sh
> > $ build/bin/run_tests.py
> >
> > If you get things that far I think Leon can get a reproduction for you
> >
>
> Cool, it's up and running (1 failure, 3 skipped, out of 67 tests).
>
> This is a great test suite to have running, I'll add it to my scripts. Here's 
> the
> full output in case the failure or skip cases are a problem:
>
> $ sudo ./build/bin/run_tests.py --verbose
>
> test_create_ah (tests.test_addr.AHTest) ... ok
> test_create_ah_roce (tests.test_addr.AHTest) ... skipped "Can't run RoCE 
> tests on IB link layer"
> test_destroy_ah (tests.test_addr.AHTest) ... ok
> test_create_comp_channel (tests.test_cq.CCTest) ... ok
> test_destroy_comp_channel (tests.test_cq.CCTest) ... ok
> test_create_cq_ex (tests.test_cq.CQEXTest) ... ok
> test_create_cq_ex_bad_flow (tests.test_cq.CQEXTest) ... ok
> test_destroy_cq_ex (tests.test_cq.CQEXTest) ... ok
> test_create_cq (tests.test_cq.CQTest) ... ok
> test_create_cq_bad_flow (tests.test_cq.CQTest) ... ok
> test_destroy_cq (tests.test_cq.CQTest) ... ok
> test_rc_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok
> test_ud_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok
> test_xrc_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok
> test_create_dm (tests.test_device.DMTest) ... ok
> test_create_dm_bad_flow (tests.test_device.DMTest) ... ok
> test_destroy_dm (tests.test_device.DMTest) ... ok
> test_destroy_dm_bad_flow (tests.test_device.DMTest) ... ok
> test_dm_read (tests.test_device.DMTest) ... ok
> test_dm_write (tests.test_device.DMTest) ... ok
> test_dm_write_bad_flow (tests.test_device.DMTest) ... ok
> test_dev_list (tests.test_device.DeviceTest) ... ok
> test_open_dev (tests.test_device.DeviceTest) ... ok
> test_query_device (tests.test_device.DeviceTest) ... ok
> test_query_device_ex (tests.test_device.DeviceTest) ... ok
> test_query_gid (tests.test_device.DeviceTest) ... ok
> test_query_port (tests.test_device.DeviceTest) ... FAIL
> test_query_port_bad_flow (tests.test_device.DeviceTest) ... ok
> test_create_dm_mr (tests.test_mr.DMMRTest) ... ok
> test_destroy_dm_mr (tests.test_mr.DMMRTest) ... ok
> test_buffer (tests.test_mr.MRTest) ... ok
> test_dereg_mr (tests.test_mr.MRTest) ... ok
> test_dereg_mr_twice (tests.test_mr.MRTest) ... ok
> test_lkey (tests.test_mr.MRTest) ... ok
> test_read (tests.test_mr.MRTest) ... ok
> test_reg_mr (tests.test_mr.MRTest) ... ok
> test_reg_mr_bad_flags (tests.test_mr.MRTest) ... ok
> test_reg_mr_bad_flow (tests.test_mr.MRTest) ... ok
> test_rkey (tests.test_mr.MRTest) ... ok
> test_write (tests.test_mr.MRTest) ... ok
> test_dereg_mw_type1 (tests.test_mr.MWTest) ... ok
> test_dereg_mw_type2 (tests.test_mr.MWTest) ... ok
> test_reg_mw_type1 (tests.test_mr.MWTest) ... ok
> test_reg_mw_type2 (tests.test_mr.MWTest) ... ok
> test_reg_mw_wrong_type (tests.test_mr.MWTest) ... ok
> test_odp_rc_traffic (tests.test_odp.OdpTestCase) ... ok
> test_odp_ud_traffic (tests.test_odp.OdpTestCase) ... skipped 'ODP is not 
> supported - ODP recv not supported'
> test_odp_xrc_traffic (tests.test_odp.OdpTestCase) ... ok
> test_default_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok
> test_mem_align_allocators (tests.test_parent_domain.ParentDomainTestCase) ... 
> ok
> test_without_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok
> test_alloc_pd (tests.test_pd.PDTest) ... ok
> test_create_pd_none_ctx (tests.test_pd.PDTest) ... ok
> test_dealloc_pd (tests.test_pd.PDTest) ... ok
> test_destroy_pd_twice (tests.test_pd.PDTest) ... ok
> test_multiple_pd_creation (tests.test_pd.PDTest) ... ok
> test_create_qp_ex_no_attr (tests.test_qp.QPTest) ... ok
> test_create_qp_ex_no_attr_connected (tests.test_qp.QPTest) ... ok
> test_create_qp_ex_with_attr (tests.test_qp.QPTest) ... ok
> test_create_qp_ex_with_attr_connected (tests.test_qp.QPTest) ... ok
> test_create_qp_no_attr (tests.test_qp.QPTest) ... ok
> test_create_qp_no_attr_connected (tests.test_qp.QPTest) ... ok
> test_create_qp_with_attr (tests.test_qp.QPTest) ... ok
> test_create_qp_with_attr_connected (tests.test_qp.QPTest) ... ok
> test_modify_qp (tests.test_qp.QPTest) ... ok
> test_query_qp (tests.test_qp.QPTest) ... ok
> test_rdmacm_sync_traffic (tests.test_rdmacm.CMTestCase) ... skipped 'No 
> devices with net interface'
>
>

Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-20 Thread Leon Romanovsky

On Thu, Dec 19, 2019 at 05:07:43PM -0400, Jason Gunthorpe wrote:
> On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote:
> > On 12/19/19 5:26 AM, Leon Romanovsky wrote:
> > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> > > > Hi,
> > > >
> > > > This implements an API naming change (put_user_page*() -->
> > > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> > > > extends that tracking to a few select subsystems. More subsystems will
> > > > be added in follow up work.
> > >
> > > Hi John,
> > >
> > > The patchset generates kernel panics in our IB testing. In our tests, we
> > > allocated single memory block and registered multiple MRs using the single
> > > block.
> > >
> > > The possible bad flow is:
> > >   ib_umem_geti() ->
> > >pin_user_pages_fast(FOLL_WRITE) ->
> > > internal_get_user_pages_fast(FOLL_WRITE) ->
> > >  gup_pgd_range() ->
> > >   gup_huge_pd() ->
> > >gup_hugepte() ->
> > > try_grab_compound_head() ->
> >
> > Hi Leon,
> >
> > Thanks very much for the detailed report! So we're overflowing...
> >
> > At first look, this seems likely to be hitting a weak point in the
> > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
> > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines
> > 99-121). Basically it's pretty easy to overflow the page->_refcount
> > with huge pages if the pages have a *lot* of subpages.
> >
> > We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
>
> Considering that establishing these pins is entirely under user
> control, we can't have a limit here.
>
> If the number of allowed pins are exhausted then the
> pin_user_pages_fast() must fail back to the user.
>
> > 3. It would be nice if I could reproduce this. I have a two-node mlx5 
> > Infiniband
> > test setup, but I have done only the tiniest bit of user space IB coding, so
> > if you have any test programs that aren't too hard to deal with that could
> > possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind
> > that I'm not an advanced IB programmer. At all. :)
>
> Clone this:
>
> https://github.com/linux-rdma/rdma-core.git
>
> Install all the required deps to build it (notably cython), see the README.md
>
> $ ./build.sh
> $ build/bin/run_tests.py
>
> If you get things that far I think Leon can get a reproduction for you

I'm not so optimistic about that.

Thanks

>
> Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-19 Thread Leon Romanovsky

On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> Hi,
>
> This implements an API naming change (put_user_page*() -->
> unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> extends that tracking to a few select subsystems. More subsystems will
> be added in follow up work.

Hi John,

The patchset generates kernel panics in our IB testing. In our tests, we
allocated single memory block and registered multiple MRs using the single
block.

The possible bad flow is:
 ib_umem_geti() ->
  pin_user_pages_fast(FOLL_WRITE) ->
   internal_get_user_pages_fast(FOLL_WRITE) ->
gup_pgd_range() ->
 gup_huge_pd() ->
  gup_hugepte() ->
   try_grab_compound_head() ->

 108 static __maybe_unused struct page *try_grab_compound_head(struct page 
*page,
 109   int refs,
 110   unsigned int 
flags)
 111 {
 112 if (flags & FOLL_GET)
 113 return try_get_compound_head(page, refs);
 114 else if (flags & FOLL_PIN)
 115 return try_pin_compound_head(page, refs);
 116
 117 WARN_ON_ONCE(1);
 118 return NULL;
 119 }

# (master) $ dmesg
[10924.70] mlx5_core :00:08.0 eth2: Link up
[10924.725383] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[10960.902254] [ cut here ]
[10960.905614] WARNING: CPU: 3 PID: 8838 at mm/gup.c:61 
try_grab_compound_head+0x92/0xd0
[10960.907313] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss 
nfsv4 dns_resolver nfs lockd grace fscache ib_isert iscsi_target_mod ib_srpt 
target_core_mod ib_srp rpcrdma rdma_ucm ib_iser ib_umad rdma_cm ib_ipoib iw_cm 
ib_cm mlx5_ib ib_uverbs ib_core kvm_intel mlx5_core rfkill mlxfw sunrpc 
virtio_net pci_hyperv_intf kvm irqbypass net_failover crc32_pclmul i2c_piix4 
ptp crc32c_intel failover pcspkr ghash_clmulni_intel i2c_core pps_core 
sch_fq_codel ip_tables ata_generic pata_acpi serio_raw ata_piix floppy [last 
unloaded: mlxkvl]
[10960.917806] CPU: 3 PID: 8838 Comm: consume_mtts Tainted: G   OE 
5.5.0-rc2-for-upstream-perf-2019-12-18_10-06-50-78 #1
[10960.920530] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1ubuntu1 04/01/2014
[10960.923024] RIP: 0010:try_grab_compound_head+0x92/0xd0
[10960.924329] Code: e4 8d 14 06 48 8d 4f 34 f0 0f b1 57 34 0f 94 c2 84 d2 75 
cb 85 c0 74 cd 8d 14 06 f0 0f b1 11 0f 94 c2 84 d2 75 b9 66 90 eb ea <0f> 0b 31 
ff eb b7 85 c0 66 0f 1f 44 00 00 74 ab 8d 14 06 f0 0f b1
[10960.928512] RSP: 0018:c9000129f880 EFLAGS: 00010082
[10960.929831] RAX: 8001 RBX: 7f6397446000 RCX: 000fffe0
[10960.931422] RDX: 0004 RSI: 00011800 RDI: ea000f5d8000
[10960.933005] RBP: c9000129f93c R08: c9000129f93c R09: 0020
[10960.934584] R10: 88840774b200 R11: 88800230 R12: 7f6397446000
[10960.936212] R13: 0046 R14: 8003d76000e7 R15: 0080
[10960.937793] FS:  7f63a0590740() GS:88842f98() 
knlGS:
[10960.939962] CS:  0010 DS:  ES:  CR0: 80050033
[10960.941367] CR2: 023e9008 CR3: 000406d0a002 CR4: 007606e0
[10960.942975] DR0:  DR1:  DR2: 
[10960.944654] DR3:  DR6: fffe0ff0 DR7: 0400
[10960.946394] PKRU: 5554
[10960.947310] Call Trace:
[10960.948193]  gup_pgd_range+0x61e/0x950
[10960.949585]  internal_get_user_pages_fast+0x98/0x1c0
[10960.951313]  ib_umem_get+0x2b3/0x5a0 [ib_uverbs]
[10960.952929]  mr_umem_get+0xd8/0x280 [mlx5_ib]
[10960.954150]  ? xas_store+0x49/0x550
[10960.955187]  mlx5_ib_reg_user_mr+0x149/0x7a0 [mlx5_ib]
[10960.956478]  ? xas_load+0x9/0x80
[10960.957474]  ? xa_load+0x54/0x90
[10960.958465]  ? lookup_get_idr_uobject.part.10+0x12/0x80 [ib_uverbs]
[10960.959926]  ib_uverbs_reg_mr+0x138/0x2a0 [ib_uverbs]
[10960.961192]  ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xb1/0xf0 
[ib_uverbs]
[10960.963208]  ib_uverbs_cmd_verbs.isra.8+0x997/0xb30 [ib_uverbs]
[10960.964603]  ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs]
[10960.965949]  ? mem_cgroup_commit_charge+0x6a/0x140
[10960.967177]  ? page_add_new_anon_rmap+0x58/0xc0
[10960.968360]  ib_uverbs_ioctl+0xbc/0x130 [ib_uverbs]
[10960.969595]  do_vfs_ioctl+0xa6/0x640
[10960.970631]  ? syscall_trace_enter+0x1f8/0x2e0
[10960.971829]  ksys_ioctl+0x60/0x90
[10960.972825]  __x64_sys_ioctl+0x16/0x20
[10960.973888]  do_syscall_64+0x48/0x130
[10960.974949]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10960.976219] RIP: 0033:0x7f639fe9b267
[10960.977260] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 
c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 
f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48
[10960.981413] RSP: 002b:7fff5335ca08 EFLAGS: 0246 ORIG_RAX: 
0010
[10960.983472] RAX:

Re: [PATCH v7 07/24] IB/umem: use get_user_pages_fast() to pin DMA pages

2019-11-24 Thread Leon Romanovsky

On Thu, Nov 21, 2019 at 10:36:43AM -0400, Jason Gunthorpe wrote:
> On Thu, Nov 21, 2019 at 12:07:46AM -0800, Christoph Hellwig wrote:
> > On Wed, Nov 20, 2019 at 11:13:37PM -0800, John Hubbard wrote:
> > > And get rid of the mmap_sem calls, as part of that. Note
> > > that get_user_pages_fast() will, if necessary, fall back to
> > > __gup_longterm_unlocked(), which takes the mmap_sem as needed.
> > >
> > > Reviewed-by: Jan Kara 
> > > Reviewed-by: Jason Gunthorpe 
> > > Reviewed-by: Ira Weiny 
> > > Signed-off-by: John Hubbard 
> >
> > Looks fine,
> >
> > Reviewed-by: Christoph Hellwig 
> >
> > Jason, can you queue this up for 5.5 to reduce this patch stack a bit?
>
> Yes, I said I'd do this in an earlier revision. Now that it is clear this
> won't go through Andrew's tree, applied to rdma for-next

Jason,

This patch broke RDMA completely.
Change from get_user_pages() to get_user_pages_fast() causes to endless
amount of splats due to combination of the following code:

189 struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
190 size_t size, int access)
...
263 if (!umem->writable)
264 gup_flags |= FOLL_FORCE;
265

and

2398 int get_user_pages_fast(unsigned long start, int nr_pages,
2399 unsigned int gup_flags, struct page **pages)
2400 {
2401 unsigned long addr, len, end;
2402 int nr = 0, ret = 0;
2403
2404 if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
2405 return -EINVAL;


[   72.623266] [ cut here ]
[   72.624143] WARNING: CPU: 1 PID: 2557 at mm/gup.c:2404 
get_user_pages_fast+0x115/0x180
[   72.625426] Modules linked in: xt_MASQUERADE iptable_nat nf_nat
   br_netfilter overlay ib_srp scsi_transport_srp rpcrdma rdma_ucm
   ib_iser libiscsi scsi_transport_iscsi rdma_cm ib_umad iw_cm ib_ipoib
   ib_cm mlx5_ib ib_uverbs ib_core mlx5_core mlxfw ptp pps_core
[   72.629024] CPU: 1 PID: 2557 Comm: python3 Not tainted 
5.4.0-rc5-J6120-Geeb831ec5b80 #1
[   72.630435] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1ubuntu1 04/01/2014
[   72.631973] RIP: 0010:get_user_pages_fast+0x115/0x180
[   72.632873] Code: 8d 0c 10 85 c0 89 d0 0f 49 c1 eb 97 fa 4c 8d
   44 24 0c 4c 89 e9 44 89 e2 48 89 df e8 25 dd ff ff fb 8b 44 24 0c
   e9 75 ff ff ff <0f> 0b b8 ea ff ff ff e9 6d ff ff ff 65 4c 8b 2c
   25 00 5d 01 00 49
[   72.635967] RSP: 0018:c9edf930 EFLAGS: 00010202
[   72.636896] RAX:  RBX: 7f931c392000 RCX: 8883f4909000
[   72.638117] RDX: 00010011 RSI: 0001 RDI: 7f931c392000
[   72.639353] RBP: 7f931c392000 R08: 0020 R09: 8883f4909000
[   72.640602] R10:  R11: 8884068d6760 R12: 888409e45ba0
[   72.641858] R13: 8884068d6760 R14: 1600 R15: 8883f4909000
[   72.643115] FS:  7f9323754700() GS:88842fa8() 
knlGS:
[   72.644586] CS:  0010 DS:  ES:  CR0: 80050033
[   72.645628] CR2: 7f931c3ca000 CR3: 0003f83d6006 CR4: 007606a0
[   72.646875] DR0:  DR1:  DR2: 
[   72.648120] DR3:  DR6: fffe0ff0 DR7: 0400
[   72.649374] PKRU: 5554
[   72.649935] Call Trace:
[   72.650477]  ib_umem_get+0x298/0x550 [ib_uverbs]
[   72.651347]  mlx5_ib_db_map_user+0xad/0x130 [mlx5_ib]
[   72.652279]  mlx5_ib_create_cq+0x1e8/0xaa0 [mlx5_ib]
[   72.653207]  create_cq+0x1c8/0x2d0 [ib_uverbs]
[   72.654050]  ib_uverbs_create_cq+0x70/0xa0 [ib_uverbs]
[   72.654988]  ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xc2/0xf0 
[ib_uverbs]
[   72.656331]  ib_uverbs_cmd_verbs.isra.6+0x5be/0xbe0 [ib_uverbs]
[   72.657398]  ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs]
[   72.658423]  ? kvm_clock_get_cycles+0xd/0x10
[   72.659229]  ? kmem_cache_alloc+0x176/0x1c0
[   72.660025]  ? filemap_map_pages+0x18c/0x350
[   72.660838]  ib_uverbs_ioctl+0xc0/0x120 [ib_uverbs]
[   72.661756]  do_vfs_ioctl+0xa1/0x610
[   72.662458]  ksys_ioctl+0x70/0x80
[   72.663119]  __x64_sys_ioctl+0x16/0x20
[   72.663850]  do_syscall_64+0x42/0x110
[   72.664562]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   72.665492] RIP: 0033:0x7f9332958267
[   72.666185] Code: b3 66 90 48
8b 05 19 3c 2c 00 64 c7 00 26 00
00 00 48 c7 c0 ff ff ff ff c3 66
2e 0f 1f 84 00 00 00 00 00 b8 10
00 00 00 0f 05 <48> 3d 01 f0 ff
ff 73 01 c3 48 8b 0d e9 3b 2c 00
f7 d8 64 89 01 48
[   72.669347] RSP: 002b:7f9323751928 EFLAGS: 0246 ORIG_RAX: 
0010
[   72.670706] RAX: ffda RBX: 7f93237519b8 RCX: 7f9332958267
[   72.671937] RDX: 7f93237519a0 RSI: c0181b01 RDI: 0008
[   72.673176] RBP: 7f9323751980 R08: 0005 R09: 7f931c3ffef0
[   72.674415] R10: 1000 R11: 0246 R12: 7f931c400030
[

Re: [PATCH v15 13/17] IB, arm64: untag user pointers in ib_uverbs_(re)reg_mr()

2019-05-07 Thread Leon Romanovsky

On Mon, May 06, 2019 at 04:50:20PM -0300, Jason Gunthorpe wrote:
> On Mon, May 06, 2019 at 06:30:59PM +0200, Andrey Konovalov wrote:
> > This patch is a part of a series that extends arm64 kernel ABI to allow to
> > pass tagged user pointers (with the top byte set to something else other
> > than 0x00) as syscall arguments.
> >
> > ib_uverbs_(re)reg_mr() use provided user pointers for vma lookups (through
> > e.g. mlx4_get_umem_mr()), which can only by done with untagged pointers.
> >
> > Untag user pointers in these functions.
> >
> > Signed-off-by: Andrey Konovalov 
> > ---
> >  drivers/infiniband/core/uverbs_cmd.c | 4 
> >  1 file changed, 4 insertions(+)
>
> I think this is OK.. We should really get it tested though.. Leon?

It can be done after v5.2-rc1.

Thanks

>
> Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH 0/5] cgroup support for GPU devices

2019-05-05 Thread Leon Romanovsky

On Sun, May 05, 2019 at 12:34:16PM -0400, Kenny Ho wrote:
> (sent again.  Not sure why my previous email was just a reply instead
> of reply-all.)
>
> On Sun, May 5, 2019 at 12:05 PM Leon Romanovsky  wrote:
> > We are talking about two different access patterns for this device
> > memory (DM). One is to use this device memory (DM) and second to 
> > configure/limit.
> > Usually those actions will be performed by different groups.
> >
> > First group (programmers) is using special API [1] through libibverbs [2]
> > without any notion of cgroups or any limitations. Second group (sysadmins)
> > is less interested in application specifics and for them "device memory" 
> > means
> > "memory" and not "rdma, nic specific, internal memory".
> Um... I am not sure that answered it, especially in the context of
> cgroup (this is just for my curiosity btw, I don't know much about
> rdma.)  You said sysadmins are less interested in application
> specifics but then how would they make the judgement call on how much
> "device memory" is provisioned to one application/container over
> another (let say you have 5 cgroup sharing an rdma device)?  What are
> the consequences of under provisioning "device memory" to an
> application?  And if they are all just memory, can a sysadmin
> provision more system memory in place of device memory (like, are they
> interchangeable)?  I guess I am confused because if device memory is
> just memory (not rdma, nic specific) to sysadmins how would they know
> to set the right amount?

One of the immediate usages of this DM that come to my mind is very
fast spinlocks for MPI applications. In such case, the amount of DM
will be property of network topology in given MPI cluster.

In this scenario, precise amount of memory will ensure that all jobs
will continue to give maximal performance despite any programmer's
error in DM allocation.

For under provisioning scenario and if application is written correctly,
users will experience more latency and less performance, due to the PCI
accesses.

Slide 3 in Liran's presentation gives brief overview about motivation.

Thanks

>
> Regards,
> Kenny
>
> > [1] ibv_alloc_dm()
> > http://man7.org/linux/man-pages/man3/ibv_alloc_dm.3.html
> > https://www.openfabrics.org/images/2018workshop/presentations/304_LLiss_OnDeviceMemory.pdf
> > [2] https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/
> >
> > >
> > > I think we need to be careful about drawing the line between
> > > duplication and over couplings between subsystems.  I have other
> > > thoughts and concerns and I will try to organize them into a response
> > > in the next few days.
> > >
> > > Regards,
> > > Kenny
> > >
> > >
> > > > >
> > > > > Is AMD interested in collaborating to help shape this framework?
> > > > > It is intended to be device-neutral, so could be leveraged by various
> > > > > types of devices.
> > > > > If you have an alternative solution well underway, then maybe
> > > > > we can work together to merge our efforts into one.
> > > > > In the end, the DRM community is best served with common solution.
> > > > >
> > > > >
> > > > > >
> > > > > >>> and with future work, we could extend to:
> > > > > >>> *  track and control share of GPU time (reuse of cpu/cpuacct)
> > > > > >>> *  apply mask of allowed execution engines (reuse of cpusets)
> > > > > >>>
> > > > > >>> Instead of introducing a new cgroup subsystem for GPU devices, a 
> > > > > >>> new
> > > > > >>> framework is proposed to allow devices to register with existing 
> > > > > >>> cgroup
> > > > > >>> controllers, which creates per-device cgroup_subsys_state within 
> > > > > >>> the
> > > > > >>> cgroup.  This gives device drivers their own private cgroup 
> > > > > >>> controls
> > > > > >>> (such as memory limits or other parameters) to be applied to 
> > > > > >>> device
> > > > > >>> resources instead of host system resources.
> > > > > >>> Device drivers (GPU or other) are then able to reuse the existing 
> > > > > >>> cgroup
> > > > > >>> controls, instead of inventing similar ones.
> > > > > >>>
> > >

Re: [RFC PATCH 0/5] cgroup support for GPU devices

2019-05-05 Thread Leon Romanovsky

On Fri, May 03, 2019 at 02:14:33PM -0700, Welty, Brian wrote:
>
> On 5/2/2019 3:48 PM, Kenny Ho wrote:
> > On 5/2/2019 1:34 AM, Leon Romanovsky wrote:
> >> Count us (Mellanox) too, our RDMA devices are exposing special and
> >> limited in size device memory to the users and we would like to provide
> >> an option to use cgroup to control its exposure.
>
> Hi Leon, great to hear and happy to work with you and RDMA community
> to shape this framework for use by RDMA devices as well.  The intent
> was to support more than GPU devices.
>
> Incidentally, I also wanted to ask about the rdma cgroup controller
> and if there is interest in updating the device registration implemented
> in that controller.  It could use the cgroup_device_register() that is
> proposed here.   But this is perhaps future work, so can discuss separately.

I'll try to take a look later this week.

>
>
> > Doesn't RDMA already has a separate cgroup?  Why not implement it there?
> >
>
> Hi Kenny, I can't answer for Leon, but I'm hopeful he agrees with rationale
> I gave in the cover letter.  Namely, to implement in rdma controller, would
> mean duplicating existing memcg controls there.

Exactly, I didn't feel comfortable to add notion of "device memory"
to RDMA cgroup and postponed that decision to later point of time.
RDMA operates with verbs objects and all our user space API is based around
that concept. At the end, system administrator will have hard time to
understand the differences between memcg and RDMA memory.

>
> Is AMD interested in collaborating to help shape this framework?
> It is intended to be device-neutral, so could be leveraged by various
> types of devices.
> If you have an alternative solution well underway, then maybe
> we can work together to merge our efforts into one.
> In the end, the DRM community is best served with common solution.
>
>
> >
> >>> and with future work, we could extend to:
> >>> *  track and control share of GPU time (reuse of cpu/cpuacct)
> >>> *  apply mask of allowed execution engines (reuse of cpusets)
> >>>
> >>> Instead of introducing a new cgroup subsystem for GPU devices, a new
> >>> framework is proposed to allow devices to register with existing cgroup
> >>> controllers, which creates per-device cgroup_subsys_state within the
> >>> cgroup.  This gives device drivers their own private cgroup controls
> >>> (such as memory limits or other parameters) to be applied to device
> >>> resources instead of host system resources.
> >>> Device drivers (GPU or other) are then able to reuse the existing cgroup
> >>> controls, instead of inventing similar ones.
> >>>
> >>> Per-device controls would be exposed in cgroup filesystem as:
> >>> mount//.devices//
> >>> such as (for example):
> >>> mount//memory.devices//memory.max
> >>> mount//memory.devices//memory.current
> >>> mount//cpu.devices//cpu.stat
> >>> mount//cpu.devices//cpu.weight
> >>>
> >>> The drm/i915 patch in this series is based on top of other RFC work [1]
> >>> for i915 device memory support.
> >>>
> >>> AMD [2] and Intel [3] have proposed related work in this area within the
> >>> last few years, listed below as reference.  This new RFC reuses existing
> >>> cgroup controllers and takes a different approach than prior work.
> >>>
> >>> Finally, some potential discussion points for this series:
> >>> * merge proposed .devices into a single devices directory?
> >>> * allow devices to have multiple registrations for subsets of resources?
> >>> * document a 'common charging policy' for device drivers to follow?
> >>>
> >>> [1] https://patchwork.freedesktop.org/series/56683/
> >>> [2] 
> >>> https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
> >>> [3] 
> >>> https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html
> >>>
> >>>
> >>> Brian Welty (5):
> >>>   cgroup: Add cgroup_subsys per-device registration framework
> >>>   cgroup: Change kernfs_node for directories to store
> >>> cgroup_subsys_state
> >>>   memcg: Add per-device support to memory cgroup subsystem
> >>>   drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit
> >>>   drm/i915: Use memory cgroup for enforcing device memory limit
> >>>
> >>>  drivers/

Re: [RFC PATCH 0/5] cgroup support for GPU devices

2019-05-02 Thread Leon Romanovsky

On Wed, May 01, 2019 at 10:04:33AM -0400, Brian Welty wrote:
> In containerized or virtualized environments, there is desire to have
> controls in place for resources that can be consumed by users of a GPU
> device.  This RFC patch series proposes a framework for integrating
> use of existing cgroup controllers into device drivers.
> The i915 driver is updated in this series as our primary use case to
> leverage this framework and to serve as an example for discussion.
>
> The patch series enables device drivers to use cgroups to control the
> following resources within a GPU (or other accelerator device):
> *  control allocation of device memory (reuse of memcg)

Count us (Mellanox) too, our RDMA devices are exposing special and
limited in size device memory to the users and we would like to provide
an option to use cgroup to control its exposure.

> and with future work, we could extend to:
> *  track and control share of GPU time (reuse of cpu/cpuacct)
> *  apply mask of allowed execution engines (reuse of cpusets)
>
> Instead of introducing a new cgroup subsystem for GPU devices, a new
> framework is proposed to allow devices to register with existing cgroup
> controllers, which creates per-device cgroup_subsys_state within the
> cgroup.  This gives device drivers their own private cgroup controls
> (such as memory limits or other parameters) to be applied to device
> resources instead of host system resources.
> Device drivers (GPU or other) are then able to reuse the existing cgroup
> controls, instead of inventing similar ones.
>
> Per-device controls would be exposed in cgroup filesystem as:
> mount//.devices//
> such as (for example):
> mount//memory.devices//memory.max
> mount//memory.devices//memory.current
> mount//cpu.devices//cpu.stat
> mount//cpu.devices//cpu.weight
>
> The drm/i915 patch in this series is based on top of other RFC work [1]
> for i915 device memory support.
>
> AMD [2] and Intel [3] have proposed related work in this area within the
> last few years, listed below as reference.  This new RFC reuses existing
> cgroup controllers and takes a different approach than prior work.
>
> Finally, some potential discussion points for this series:
> * merge proposed .devices into a single devices directory?
> * allow devices to have multiple registrations for subsets of resources?
> * document a 'common charging policy' for device drivers to follow?
>
> [1] https://patchwork.freedesktop.org/series/56683/
> [2] https://lists.freedesktop.org/archives/dri-devel/2018-November/197106.html
> [3] https://lists.freedesktop.org/archives/intel-gfx/2018-January/153156.html
>
>
> Brian Welty (5):
>   cgroup: Add cgroup_subsys per-device registration framework
>   cgroup: Change kernfs_node for directories to store
> cgroup_subsys_state
>   memcg: Add per-device support to memory cgroup subsystem
>   drm: Add memory cgroup registration and DRIVER_CGROUPS feature bit
>   drm/i915: Use memory cgroup for enforcing device memory limit
>
>  drivers/gpu/drm/drm_drv.c  |  12 +
>  drivers/gpu/drm/drm_gem.c  |   7 +
>  drivers/gpu/drm/i915/i915_drv.c|   2 +-
>  drivers/gpu/drm/i915/intel_memory_region.c |  24 +-
>  include/drm/drm_device.h   |   3 +
>  include/drm/drm_drv.h  |   8 +
>  include/drm/drm_gem.h  |  11 +
>  include/linux/cgroup-defs.h|  28 ++
>  include/linux/cgroup.h |   3 +
>  include/linux/memcontrol.h |  10 +
>  kernel/cgroup/cgroup-v1.c  |  10 +-
>  kernel/cgroup/cgroup.c | 310 ++---
>  mm/memcontrol.c| 183 +++-
>  13 files changed, 552 insertions(+), 59 deletions(-)
>
> --
> 2.21.0
>
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v13 16/20] IB/mlx4, arm64: untag user pointers in mlx4_get_umem_mr

2019-04-29 Thread Leon Romanovsky

On Wed, Mar 20, 2019 at 03:51:30PM +0100, Andrey Konovalov wrote:
> This patch is a part of a series that extends arm64 kernel ABI to allow to
> pass tagged user pointers (with the top byte set to something else other
> than 0x00) as syscall arguments.
>
> mlx4_get_umem_mr() uses provided user pointers for vma lookups, which can
> only by done with untagged pointers.
>
> Untag user pointers in this function.
>
> Signed-off-by: Andrey Konovalov 
> ---
>  drivers/infiniband/hw/mlx4/mr.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
> index 395379a480cb..9a35ed2c6a6f 100644
> --- a/drivers/infiniband/hw/mlx4/mr.c
> +++ b/drivers/infiniband/hw/mlx4/mr.c
> @@ -378,6 +378,7 @@ static struct ib_umem *mlx4_get_umem_mr(struct ib_udata 
> *udata, u64 start,
>* again
>*/
>   if (!ib_access_writable(access_flags)) {
> + unsigned long untagged_start = untagged_addr(start);
>   struct vm_area_struct *vma;
>
>   down_read(>mm->mmap_sem);
> @@ -386,9 +387,9 @@ static struct ib_umem *mlx4_get_umem_mr(struct ib_udata 
> *udata, u64 start,
>* cover the memory, but for now it requires a single vma to
>* entirely cover the MR to support RO mappings.
>*/
> - vma = find_vma(current->mm, start);
> - if (vma && vma->vm_end >= start + length &&
> - vma->vm_start <= start) {
> + vma = find_vma(current->mm, untagged_start);
> + if (vma && vma->vm_end >= untagged_start + length &&
> + vma->vm_start <= untagged_start) {
>   if (vma->vm_flags & VM_WRITE)
>   access_flags |= IB_ACCESS_LOCAL_WRITE;
>   } else {
> --

Thanks,
Reviewed-by: Leon Romanovsky 

Interesting, the followup question is why mlx4 is only one driver in IB which
needs such code in umem_mr. I'll take a look on it.

Thanks
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v6 7/8] mm/mmu_notifier: pass down vma and reasons why mmu notifier is happening v2

2019-04-10 Thread Leon Romanovsky

On Wed, Apr 10, 2019 at 04:41:57PM -0700, Ira Weiny wrote:
> On Tue, Mar 26, 2019 at 12:47:46PM -0400, Jerome Glisse wrote:
> > From: Jérôme Glisse 
> >
> > CPU page table update can happens for many reasons, not only as a result
> > of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also
> > as a result of kernel activities (memory compression, reclaim, migration,
> > ...).
> >
> > Users of mmu notifier API track changes to the CPU page table and take
> > specific action for them. While current API only provide range of virtual
> > address affected by the change, not why the changes is happening
> >
> > This patch is just passing down the new informations by adding it to the
> > mmu_notifier_range structure.
> >
> > Changes since v1:
> > - Initialize flags field from mmu_notifier_range_init() arguments
> >
> > Signed-off-by: Jérôme Glisse 
> > Cc: Andrew Morton 
> > Cc: linux...@kvack.org
> > Cc: Christian König 
> > Cc: Joonas Lahtinen 
> > Cc: Jani Nikula 
> > Cc: Rodrigo Vivi 
> > Cc: Jan Kara 
> > Cc: Andrea Arcangeli 
> > Cc: Peter Xu 
> > Cc: Felix Kuehling 
> > Cc: Jason Gunthorpe 
> > Cc: Ross Zwisler 
> > Cc: Dan Williams 
> > Cc: Paolo Bonzini 
> > Cc: Radim Krčmář 
> > Cc: Michal Hocko 
> > Cc: Christian Koenig 
> > Cc: Ralph Campbell 
> > Cc: John Hubbard 
> > Cc: k...@vger.kernel.org
> > Cc: dri-devel@lists.freedesktop.org
> > Cc: linux-r...@vger.kernel.org
> > Cc: Arnd Bergmann 
> > ---
> >  include/linux/mmu_notifier.h | 6 +-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 62f94cd85455..0379956fff23 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -58,10 +58,12 @@ struct mmu_notifier_mm {
> >  #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
> >
> >  struct mmu_notifier_range {
> > +   struct vm_area_struct *vma;
> > struct mm_struct *mm;
> > unsigned long start;
> > unsigned long end;
> > unsigned flags;
> > +   enum mmu_notifier_event event;
> >  };
> >
> >  struct mmu_notifier_ops {
> > @@ -363,10 +365,12 @@ static inline void mmu_notifier_range_init(struct 
> > mmu_notifier_range *range,
> >unsigned long start,
> >unsigned long end)
> >  {
> > +   range->vma = vma;
> > +   range->event = event;
> > range->mm = mm;
> > range->start = start;
> > range->end = end;
> > -   range->flags = 0;
> > +   range->flags = flags;
>
> Which of the "user patch sets" uses the new flags?
>
> I'm not seeing that user yet.  In general I don't see anything wrong with the
> series and I like the idea of telling drivers why the invalidate has fired.
>
> But is the flags a future feature?

It seems that it is used in HMM ODP patch.
https://patchwork.kernel.org/patch/10894281/

Thanks

>
> For the series:
>
> Reviewed-by: Ira Weiny 
>
> Ira
>
> >  }
> >
> >  #define ptep_clear_flush_young_notify(__vma, __address, __ptep)
> > \
> > --
> > 2.20.1
> >


signature.asc
Description: PGP signature
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-17 Thread Leon Romanovsky

On Mon, Jul 16, 2018 at 04:12:49PM -0700, Andrew Morton wrote:
> On Mon, 16 Jul 2018 13:50:58 +0200 Michal Hocko  wrote:
>
> > From: Michal Hocko 
> >
> > There are several blockable mmu notifiers which might sleep in
> > mmu_notifier_invalidate_range_start and that is a problem for the
> > oom_reaper because it needs to guarantee a forward progress so it cannot
> > depend on any sleepable locks.
> >
> > Currently we simply back off and mark an oom victim with blockable mmu
> > notifiers as done after a short sleep. That can result in selecting a
> > new oom victim prematurely because the previous one still hasn't torn
> > its memory down yet.
> >
> > We can do much better though. Even if mmu notifiers use sleepable locks
> > there is no reason to automatically assume those locks are held.
> > Moreover majority of notifiers only care about a portion of the address
> > space and there is absolutely zero reason to fail when we are unmapping an
> > unrelated range. Many notifiers do really block and wait for HW which is
> > harder to handle and we have to bail out though.
> >
> > This patch handles the low hanging fruid. 
> > __mmu_notifier_invalidate_range_start
> > gets a blockable flag and callbacks are not allowed to sleep if the
> > flag is set to false. This is achieved by using trylock instead of the
> > sleepable lock for most callbacks and continue as long as we do not
> > block down the call chain.
>
> I assume device driver developers are wondering "what does this mean
> for me".  As I understand it, the only time they will see
> blockable==false is when their driver is being called in response to an
> out-of-memory condition, yes?  So it is a very rare thing.

I can't say for everyone, but at least for me (mlx5), it is not rare event.
I'm seeing OOM very often while I'm running my tests in low memory VMs.

Thanks

>
> Any suggestions regarding how the driver developers can test this code
> path?  I don't think we presently have a way to fake an oom-killing
> event?  Perhaps we should add such a thing, given the problems we're
> having with that feature.


signature.asc
Description: PGP signature
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-16 Thread Leon Romanovsky

On Tue, Jul 10, 2018 at 04:14:10PM +0200, Michal Hocko wrote:
> On Tue 10-07-18 16:40:40, Leon Romanovsky wrote:
> > On Mon, Jul 09, 2018 at 02:29:08PM +0200, Michal Hocko wrote:
> > > On Wed 27-06-18 09:44:21, Michal Hocko wrote:
> > > > This is the v2 of RFC based on the feedback I've received so far. The
> > > > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> > > > because I have no idea how.
> > > >
> > > > Any further feedback is highly appreciated of course.
> > >
> > > Any other feedback before I post this as non-RFC?
> >
> > From mlx5 perspective, who is primary user of umem_odp.c your change looks 
> > ok.
>
> Can I assume your Acked-by?
>
> Thanks for your review!

For mlx and umem_odp pieces,
Acked-by: Leon Romanovsky 

Thanks


signature.asc
Description: PGP signature
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-11 Thread Leon Romanovsky

On Wed, Jul 11, 2018 at 01:13:18PM +0200, Michal Hocko wrote:
> On Wed 11-07-18 13:14:47, Leon Romanovsky wrote:
> > On Wed, Jul 11, 2018 at 11:03:53AM +0200, Michal Hocko wrote:
> > > On Tue 10-07-18 19:20:20, Leon Romanovsky wrote:
> > > > On Tue, Jul 10, 2018 at 04:14:10PM +0200, Michal Hocko wrote:
> > > > > On Tue 10-07-18 16:40:40, Leon Romanovsky wrote:
> > > > > > On Mon, Jul 09, 2018 at 02:29:08PM +0200, Michal Hocko wrote:
> > > > > > > On Wed 27-06-18 09:44:21, Michal Hocko wrote:
> > > > > > > > This is the v2 of RFC based on the feedback I've received so 
> > > > > > > > far. The
> > > > > > > > code even compiles as a bonus ;) I haven't runtime tested it 
> > > > > > > > yet, mostly
> > > > > > > > because I have no idea how.
> > > > > > > >
> > > > > > > > Any further feedback is highly appreciated of course.
> > > > > > >
> > > > > > > Any other feedback before I post this as non-RFC?
> > > > > >
> > > > > > From mlx5 perspective, who is primary user of umem_odp.c your 
> > > > > > change looks ok.
> > > > >
> > > > > Can I assume your Acked-by?
> > > >
> > > > I didn't have a chance to test it because it applies on our rdma-next, 
> > > > but
> > > > fails to compile.
> > >
> > > What is the compilation problem? Is it caused by the patch or some other
> > > unrelated changed?
> >
> > Thanks for pushing me to take a look on it.
> > Your patch needs the following hunk to properly compile at least on my 
> > system.
>
> I suspect you were trying the original version. I've posted an updated
> patch here http://lkml.kernel.org/r/20180627074421.gf32...@dhcp22.suse.cz
> and all these issues should be fixed there. Including many other fixes.
>

Ohh, you used --reply-to, IMHO it is best way to make sure that the
patch will be lost :)

> Could you have a look at that one please?

I grabbed it, the results will be overnight only.

Thanks

> --
> Michal Hocko
> SUSE Labs


signature.asc
Description: PGP signature
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-11 Thread Leon Romanovsky

On Wed, Jul 11, 2018 at 11:03:53AM +0200, Michal Hocko wrote:
> On Tue 10-07-18 19:20:20, Leon Romanovsky wrote:
> > On Tue, Jul 10, 2018 at 04:14:10PM +0200, Michal Hocko wrote:
> > > On Tue 10-07-18 16:40:40, Leon Romanovsky wrote:
> > > > On Mon, Jul 09, 2018 at 02:29:08PM +0200, Michal Hocko wrote:
> > > > > On Wed 27-06-18 09:44:21, Michal Hocko wrote:
> > > > > > This is the v2 of RFC based on the feedback I've received so far. 
> > > > > > The
> > > > > > code even compiles as a bonus ;) I haven't runtime tested it yet, 
> > > > > > mostly
> > > > > > because I have no idea how.
> > > > > >
> > > > > > Any further feedback is highly appreciated of course.
> > > > >
> > > > > Any other feedback before I post this as non-RFC?
> > > >
> > > > From mlx5 perspective, who is primary user of umem_odp.c your change 
> > > > looks ok.
> > >
> > > Can I assume your Acked-by?
> >
> > I didn't have a chance to test it because it applies on our rdma-next, but
> > fails to compile.
>
> What is the compilation problem? Is it caused by the patch or some other
> unrelated changed?

Thanks for pushing me to take a look on it.
Your patch needs the following hunk to properly compile at least on my system.

I'll take it to our regression.

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 369867501bed..1f364a157097 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -155,9 +155,9 @@ struct mmu_notifier_ops {
 * cannot block, mmu_notifier_ops.flags should have
 * MMU_INVALIDATE_DOES_NOT_BLOCK set.
 */
-   void (*invalidate_range_start)(struct mmu_notifier *mn,
+   int (*invalidate_range_start)(struct mmu_notifier *mn,
   struct mm_struct *mm,
-  unsigned long start, unsigned long end);
+  unsigned long start, unsigned long end, 
bool blockable);
void (*invalidate_range_end)(struct mmu_notifier *mn,
 struct mm_struct *mm,
 unsigned long start, unsigned long end);
@@ -229,7 +229,7 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
 unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
  unsigned long address, pte_t pte);
-extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+extern int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
  unsigned long start, unsigned long end,
  bool blockable);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 6adac113e96d..92f70e4c6252 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -95,7 +95,7 @@ static inline int check_stable_address_space(struct mm_struct 
*mm)
return 0;
 }

-void __oom_reap_task_mm(struct mm_struct *mm);
+bool __oom_reap_task_mm(struct mm_struct *mm);

 extern unsigned long oom_badness(struct task_struct *p,
struct mem_cgroup *memcg, const nodemask_t *nodemask,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7e0c6e78ae5c..7c7bd6f3298e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1,6 +1,6 @@
 /*
  *  linux/mm/oom_kill.c
- *
+ *
  *  Copyright (C)  1998,2000  Rik van Riel
  * Thanks go out to Claus Fischer for some serious inspiration and
  * for goading me into coding this file...
@@ -569,7 +569,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
if (!__oom_reap_task_mm(mm)) {
up_read(>mmap_sem);
ret = false;
-   goto out_unlock;
+   goto unlock_oom;
}

pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB\n",

Thanks

> --
> Michal Hocko
> SUSE Labs


signature.asc
Description: PGP signature
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-10 Thread Leon Romanovsky

On Tue, Jul 10, 2018 at 04:14:10PM +0200, Michal Hocko wrote:
> On Tue 10-07-18 16:40:40, Leon Romanovsky wrote:
> > On Mon, Jul 09, 2018 at 02:29:08PM +0200, Michal Hocko wrote:
> > > On Wed 27-06-18 09:44:21, Michal Hocko wrote:
> > > > This is the v2 of RFC based on the feedback I've received so far. The
> > > > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> > > > because I have no idea how.
> > > >
> > > > Any further feedback is highly appreciated of course.
> > >
> > > Any other feedback before I post this as non-RFC?
> >
> > From mlx5 perspective, who is primary user of umem_odp.c your change looks 
> > ok.
>
> Can I assume your Acked-by?

I didn't have a chance to test it because it applies on our rdma-next, but
fails to compile.

Thanks

>
> Thanks for your review!
> --
> Michal Hocko
> SUSE Labs
>


signature.asc
Description: PGP signature
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-10 Thread Leon Romanovsky

On Mon, Jul 09, 2018 at 02:29:08PM +0200, Michal Hocko wrote:
> On Wed 27-06-18 09:44:21, Michal Hocko wrote:
> > This is the v2 of RFC based on the feedback I've received so far. The
> > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> > because I have no idea how.
> >
> > Any further feedback is highly appreciated of course.
>
> Any other feedback before I post this as non-RFC?

From mlx5 perspective, who is primary user of umem_odp.c your change looks ok.

Thanks


signature.asc
Description: PGP signature
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH rdma-next 01/21] drm/i915: Move u64-to-ptr helpers to general header

2018-05-15 Thread Leon Romanovsky

On Mon, May 14, 2018 at 02:10:54PM -0600, Jason Gunthorpe wrote:
> On Thu, May 03, 2018 at 04:36:55PM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leo...@mellanox.com>
> >
> > The macro u64_to_ptr() and function ptr_to_u64() are useful enough
> > to be part of general header, so move them there and allow RDMA
> > subsystem reuse them.
> >
> > Signed-off-by: Leon Romanovsky <leo...@mellanox.com>
> > ---
> >  drivers/gpu/drm/i915/i915_utils.h | 12 ++--
> >  include/linux/kernel.h| 12 
> >  2 files changed, 14 insertions(+), 10 deletions(-)
>
> This patch and the next one to kernel.h will need to be Ack'd be
> someone.. But I am not sure who.. AndrewM perhaps?

Why is that?

The kernel.h doesn't have explicit maintainer and is managed by
community effort.

Thanks

>
> Jason


signature.asc
Description: PGP signature
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH rdma-next 01/21] drm/i915: Move u64-to-ptr helpers to general header

2018-05-03 Thread Leon Romanovsky

From: Leon Romanovsky <leo...@mellanox.com>

The macro u64_to_ptr() and function ptr_to_u64() are useful enough
to be part of general header, so move them there and allow RDMA
subsystem reuse them.

Signed-off-by: Leon Romanovsky <leo...@mellanox.com>
---
 drivers/gpu/drm/i915/i915_utils.h | 12 ++--
 include/linux/kernel.h| 12 
 2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_utils.h 
b/drivers/gpu/drm/i915/i915_utils.h
index 51dbfe5bb418..de3bfda7bf96 100644
--- a/drivers/gpu/drm/i915/i915_utils.h
+++ b/drivers/gpu/drm/i915/i915_utils.h
@@ -25,6 +25,8 @@
 #ifndef __I915_UTILS_H
 #define __I915_UTILS_H

+#include 
+
 #undef WARN_ON
 /* Many gcc seem to no see through this and fall over :( */
 #if 0
@@ -102,16 +104,6 @@
__T;\
 })

-static inline u64 ptr_to_u64(const void *ptr)
-{
-   return (uintptr_t)ptr;
-}
-
-#define u64_to_ptr(T, x) ({\
-   typecheck(u64, x);  \
-   (T *)(uintptr_t)(x);\
-})
-
 #define __mask_next_bit(mask) ({   \
int __idx = ffs(mask) - 1;  \
mask &= ~BIT(__idx);\
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 6a1eb0b0aad9..a738393c9694 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -70,6 +70,18 @@
  */
 #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) + __must_be_array(arr))

+static inline u64 ptr_to_u64(const void *ptr)
+{
+   return (uintptr_t)ptr;
+}
+
+#define u64_to_ptr(T, x) ( \
+{  \
+   typecheck(u64, x);  \
+   (T *)(uintptr_t)(x);\
+}  \
+)
+
 #define u64_to_user_ptr(x) (   \
 {  \
typecheck(u64, x);  \
--
2.14.3

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 00/12] radix-tree: split out struct radix_tree_root out to

2017-10-09 Thread Leon Romanovsky

On Mon, Oct 09, 2017 at 01:10:01AM +0900, Masahiro Yamada wrote:

<...>
>
> By splitting out the radix_tree_root definition,
> we can reduce the header file dependency.
>
> Reducing the header dependency will help for speeding the kernel
> build, suppressing unnecessary recompile of objects during
> git-bisect'ing, etc.

If we judge by the diffstat of this series, there won't be any
visible change in anything mentioned above.

<...>

>
> Masahiro Yamada (12):
>   radix-tree: replace  with 
>   radix-tree: split struct radix_tree_root to 
>   irqdomain: replace  with 
>   writeback: replace  with 
>   iocontext.h: replace  with
> 
>   fs: replace  with 
>   blkcg: replace  with 
>   fscache: include 
>   sh: intc: replace  with 
>   net/mlx4: replace  with 
>   net/mlx5: replace  with 
>   drm/i915: replace  with 
>
>  drivers/gpu/drm/i915/i915_gem.c|  1 +
>  drivers/gpu/drm/i915/i915_gem_context.c|  1 +
>  drivers/gpu/drm/i915/i915_gem_context.h|  2 +-
>  drivers/gpu/drm/i915/i915_gem_execbuffer.c |  1 +
>  drivers/gpu/drm/i915/i915_gem_object.h |  1 +
>  drivers/net/ethernet/mellanox/mlx4/cq.c|  1 +
>  drivers/net/ethernet/mellanox/mlx4/mlx4.h  |  2 +-
>  drivers/net/ethernet/mellanox/mlx4/qp.c|  1 +
>  drivers/net/ethernet/mellanox/mlx4/srq.c   |  1 +
>  drivers/sh/intc/internals.h|  2 +-
>  include/linux/backing-dev-defs.h   |  2 +-
>  include/linux/blk-cgroup.h |  2 +-
>  include/linux/fs.h |  2 +-
>  include/linux/fscache.h|  1 +
>  include/linux/iocontext.h  |  2 +-
>  include/linux/irqdomain.h  |  2 +-
>  include/linux/mlx4/device.h|  2 +-
>  include/linux/mlx4/qp.h|  1 +
>  include/linux/mlx5/driver.h|  2 +-
>  include/linux/mlx5/qp.h|  1 +
>  include/linux/radix-tree-root.h| 24 
>  include/linux/radix-tree.h |  8 ++--
>  22 files changed, 46 insertions(+), 16 deletions(-)
>  create mode 100644 include/linux/radix-tree-root.h
>
> --
> 2.7.4
>


signature.asc
Description: PGP signature
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 00/12] radix-tree: split out struct radix_tree_root out to

2017-10-09 Thread Leon Romanovsky

On Mon, Oct 09, 2017 at 02:58:58PM +0900, Masahiro Yamada wrote:
> 2017-10-09 3:52 GMT+09:00 Leon Romanovsky <leo...@mellanox.com>:
> > On Mon, Oct 09, 2017 at 01:10:01AM +0900, Masahiro Yamada wrote:
> >
> > <...>
> >>
> >> By splitting out the radix_tree_root definition,
> >> we can reduce the header file dependency.
> >>
> >> Reducing the header dependency will help for speeding the kernel
> >> build, suppressing unnecessary recompile of objects during
> >> git-bisect'ing, etc.
> >
> > If we judge by the diffstat of this series, there won't be any
> > visible change in anything mentioned above.
>
>
> Of course, judging by the diffstat is wrong.
>

I'm more than happy to be wrong and you for sure can help me.
Can you provide any quantitative support of your claims?

Thanks

>
>
> --
> Best Regards
> Masahiro Yamada
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: PGP signature
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

64 matches

Mail list logo