Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks

Tariq Toukan Wed, 16 May 2018 00:05:05 -0700


On 15/05/2018 9:53 PM, Qing Huang wrote:



On 5/15/2018 2:19 AM, Tariq Toukan wrote:



On 14/05/2018 7:41 PM, Qing Huang wrote:



On 5/13/2018 2:00 AM, Tariq Toukan wrote:



On 11/05/2018 10:23 PM, Qing Huang wrote:

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk

size, each ICM chunk can only hold 512 mtt entries (8 bytes foreach mtt

entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang <[email protected]>
Acked-by: Daniel Jurgens <[email protected]>
Reviewed-by: Zhu Yanjun <[email protected]>
---
v2 -> v1: adjusted chunk size to reflect different architectures.

  drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++++++-------
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.cb/drivers/net/ethernet/mellanox/mlx4/icm.c

index a822f7a..ccb62b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
  #include "fw.h"
    /*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.

+ * We allocate in page size (default 4KB on many archs) chunks toavoid high+ * order memory allocations in fragmented/high usage memorysituation.

   */
  enum {
-    MLX4_ICM_ALLOC_SIZE    = 1 << 18,
-    MLX4_TABLE_CHUNK_SIZE    = 1 << 18
+    MLX4_ICM_ALLOC_SIZE    = 1 << PAGE_SHIFT,
+    MLX4_TABLE_CHUNK_SIZE    = 1 << PAGE_SHIFT


Which is actually PAGE_SIZE.


Yes, we wanted to avoid high order memory allocations.


Then please use PAGE_SIZE instead.

PAGE_SIZE is usually defined as 1 << PAGE_SHIFT. So I think PAGE_SHIFTis actually more appropriate here.


Definition of PAGE_SIZE varies among different archs.
It is not always as simple as 1 << PAGE_SHIFT.
It might be:
PAGE_SIZE (1UL << PAGE_SHIFT)
PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT)
etc...

Please replace 1 << PAGE_SHIFT with PAGE_SIZE.

Also, please add a comma at the end of the last entry.
Hmm..., followed the existing code style and checkpatch.pl didn'tcomplain about the comma.
I am in favor of having a comma also after the last element, so thatwhen another enum element is added we do not modify this line again,which would falsely affect git blame.
I know it didn't exist before your patch, but once we're here, let'sdo it.
I'm okay either way. If adding an extra comma is preferred by manypeople, someone should update checkpatch.pl to enforce it. :)

I agree.
Until then, please use an extra comma in this patch.

  };
static void mlx4_free_icm_pages(struct mlx4_dev *dev, structmlx4_icm_chunk *chunk)@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev,struct mlx4_icm_table *table,
      obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
      num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
- table->icm = kcalloc(num_icm, sizeof(*table->icm),GFP_KERNEL);
+    table->icm      = vzalloc(num_icm * sizeof(*table->icm));
Why not kvzalloc ?
I think table->icm pointer array doesn't really need physicallycontiguous memory. Sometimes high ordermemory allocation by kmalloc variants may trigger slow path and causetasks to be blocked.
This is control path so it is less latency-sensitive.
Let's not produce unnecessary degradation here, please call kvzallocso we maintain a similar behavior when contiguous memory is available,and a fallback for resiliency.
No sure what exactly degradation is caused by vzalloc here. I think it'sbetter to keep physically contiguous pagesto other requests which really need them. Besides slow path/memcompacting can be really expensive.

Degradation is expected when you replace a contig memory with non-contigmemory, without any perf test.We agree that when contig memory is not available, we should usenon-contig instead of simply failing, and for this you can call kvzalloc.

Thanks,
Qing

      if (!table->icm)
          return -ENOMEM;
      table->virt     = virt;

@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev,struct mlx4_icm_table *table,

              mlx4_free_icm(dev, table->icm[i], use_coherent);
          }
  -    kfree(table->icm);
+    vfree(table->icm);
        return -ENOMEM;
  }

@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev*dev, struct mlx4_icm_table *table)

              mlx4_free_icm(dev, table->icm[i], table->coherent);
          }
  -    kfree(table->icm);
+    vfree(table->icm);
  }


Thanks for your patch.

I need to verify there is no dramatic performance degradation here.
You can prepare and send a v3 in the meanwhile.

Thanks,
Tariq
--

To unsubscribe from this list: send the line "unsubscribelinux-rdma" in

the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks

Reply via email to