Hi Eric, thanks for your patch.

On 13/03/2017 2:58 AM, Eric Dumazet wrote:
When adding order-0 pages allocations and page recycling in receive path,
I added issues on PowerPC, or more generally on arches with large pages.

A GRO packet, aggregating 45 segments, ended up using 45 page frags
on 45 different pages. Before my changes we were very likely packing
up to 42 Ethernet frames per 64KB page.

1) At skb freeing time, all put_page() on the skb frags now touch 45
   different 'struct page' and this adds more cache line misses.
   Too bad that standard Ethernet MTU is so small :/

2) Using one order-0 page per ring slot consumes ~42 times more memory
   on PowerPC.

3) Allocating order-0 pages is very likely to use pages from very
   different locations, increasing TLB pressure on hosts with more
   than 256 GB of memory after days of uptime.

This patch uses a refined strategy, addressing these points.

We still use order-0 pages, but the page recyling technique is modified
so that we have better chances to lower number of pages containing the
frags for a given GRO skb (factor of 2 on x86, and 21 on PowerPC)

Page allocations are split in two halves :
- One currently visible by the NIC for DMA operations.
- The other contains pages that already added to old skbs, put in
  a quarantine.

When we receive a frame, we look at the oldest entry in the pool and
check if the page count is back to one, meaning old skbs/frags were
consumed and the page can be recycled.

Page allocations are attempted using high order ones, trying
to lower TLB pressure.

I think MM-list people won't be happy with this.
We were doing a similar thing with order-5 pages in mlx5 Striding RQ:
Allocate and split high-order pages, to have:
- Physically contiguous memory,
- Less page allocations,
- Yet, keep a fine grained refcounts/truesize.
In case no high-order page available, fallback to using order-0 pages.

However, we changed this behavior, as it was fragmenting the memory, and depleting the high-order pages available quickly.


On x86, memory allocations stay the same. (One page per RX slot for MTU=1500)
But on PowerPC, this patch considerably reduces the allocated memory.

Performance gain on PowerPC is about 50% for a single TCP flow.

Nice!


On x86, I could not measure the difference, my test machine being
limited by the sender (33 Gbit per TCP flow).
22 less cache line misses per 64 KB GRO packet is probably in the order
of 2 % or so.

Signed-off-by: Eric Dumazet <eduma...@google.com>
Cc: Tariq Toukan <tar...@mellanox.com>
Cc: Saeed Mahameed <sae...@mellanox.com>
Cc: Alexander Duyck <alexander.du...@gmail.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   | 462 +++++++++++++++------------
 drivers/net/ethernet/mellanox/mlx4/en_tx.c   |  15 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  54 +++-
 3 files changed, 310 insertions(+), 221 deletions(-)


Thanks,
Tariq

Reply via email to