Re: [PATCH net] net: avoid 32 x truesize under-estimation for tiny skbs

Jason Wang Thu, 01 Apr 2021 19:52:35 -0700


在 2021/4/1 下午5:58, Eric Dumazet 写道:

On Thu, Apr 1, 2021 at 11:04 AM Xuan Zhuo <xuanz...@linux.alibaba.com> wrote:

On Thu, 1 Apr 2021 15:14:18 +0800, Jason Wang <jasow...@redhat.com> wrote:

在 2021/3/31 下午4:11, Michael S. Tsirkin 写道:

On Mon, Mar 29, 2021 at 11:06:09AM +0200, Eric Dumazet wrote:

On Mon, Mar 29, 2021 at 10:52 AM Xuan Zhuo <xuanz...@linux.alibaba.com> wrote:

On Wed, 13 Jan 2021 08:18:19 -0800, Eric Dumazet <eric.duma...@gmail.com> wrote:

From: Eric Dumazet <eduma...@google.com>

Both virtio net and napi_get_frags() allocate skbs
with a very small skb->head

While using page fragments instead of a kmalloc backed skb->head might give
a small performance improvement in some cases, there is a huge risk of
under estimating memory usage.

For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations
per page (order-3 page in x86), or even 64 on PowerPC

We have been tracking OOM issues on GKE hosts hitting tcp_mem limits
but consuming far more memory for TCP buffers than instructed in tcp_mem[2]

Even if we force napi_alloc_skb() to only use order-0 pages, the issue
would still be there on arches with PAGE_SIZE >= 32768

This patch makes sure that small skb head are kmalloc backed, so that
other objects in the slab page can be reused instead of being held as long
as skbs are sitting in socket queues.

Note that we might in the future use the sk_buff napi cache,
instead of going through a more expensive __alloc_skb()

Another idea would be to use separate page sizes depending
on the allocated length (to never have more than 4 frags per page)

I would like to thank Greg Thelen for his precious help on this matter,
analysing crash dumps is always a time consuming task.

This patch causes a performance degradation of about 10% in the scenario of
virtio-net + GRO.

For GRO, there is no way to merge skbs based on frags with this patch, only
frag_list can be used to link skbs. The problem that this cause are that 
compared
to the GRO package merged into the frags way, the current skb needs to call
kfree_skb_list to release each skb, resulting in performance degradation.

virtio-net will store some data onto the linear space after receiving it. In
addition to the header, there are also some payloads, so "headlen <= offset"
fails. And skb->head_frag is failing when use kmalloc() for skb->head 
allocation.

Thanks for the report.

There is no way we can make things both fast for existing strategies
used by _insert_your_driver
and malicious usages of data that can sit for seconds/minutes in socket queues.

I think that if you want to gain this 10% back, you have to change
virtio_net to meet optimal behavior.

Normal drivers make sure to not pull payload in skb->head, only headers.

Hmm we do have hdr_len field, but seem to ignore it on RX.
Jason do you see any issues with using it for the head len?


This might work only if the device sets a correct hdr_len. I'm not sure
all of the devices can do this properly. E.g for tap, we use
skb_headlen() in virtio_net_hdr_from_skb() which depends highly on the
behaviour of the underlayer layers (device driver or GRO). And we only
set this hint for GSO packet but virtio-net may tries to do GRO for non
GSO packets.

Thanks

hi, Jason

I personally prefer to use build_skb to create skb, so the problem here is
actually gone.

The premise of this is that the buffer added by add_recvbuf_mergeable must
retain a skb_shared_info. Of course, then rx frags coalescing won't
work. But I consider that suppose the size of the mrg_avg_pkt_len 1500, so we
can still store 17 * 1500 = 24k packets in a skb. If the packet is really big,
the mrg_avg_pkt_len will also increase, and the buffer allocated later will
increase. When the mrg_avg_pkt_len is greater than PAGE_SIZE/2, rx frags
coalesce is no longer needed. Because we can't allocate two bufs with a value of
mrg_avg_pkt_len on the same page.

For the record I implemented build_skb() 10 years ago, so you can
trust me when I
am saying this will not help.

Using build_skb() will waste additional skb_shared_info per MSS.
That's an increase of 20% of memory, for nothing at all.

So I wonder something like the following like this help. We know thefrag size, that means, if we know there's sufficient tailroom we can usebuild_skb() without reserving dedicated room for skb_shared_info.


Thanks


Also there are cases when this won't be possible, say if you use an MTU of 4000

Thanks.

Optimal GRO packets are when payload is in page fragments.

(I am speaking not only for raw performance, but ability for systems
to cope with network outages and sudden increase of memory usage in
out of order queues)

This has been quite clearly stated in my changelog.

Thanks.

int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
{
          struct skb_shared_info *pinfo, *skbinfo = skb_shinfo(skb);
          unsigned int offset = skb_gro_offset(skb);
          unsigned int headlen = skb_headlen(skb);

      .......

          if (headlen <= offset) {         // virtio-net will fail
          ........ // merge by frags
                  goto done;
          } else if (skb->head_frag) {     // skb->head_frag is fail when use 
kmalloc() for skb->head allocation
          ........ // merge by frags
                  goto done;
          }

merge:
      ......

          if (NAPI_GRO_CB(p)->last == p)
                  skb_shinfo(p)->frag_list = skb;
          else
                  NAPI_GRO_CB(p)->last->next = skb;

      ......
          return 0;
}


test cmd:
   for i in $(seq 1 4)
   do
      redis-benchmark -r 10000000 -n 10000000 -t set -d 1024 -c 8 -P 32 -h  <ip> -p 6379 
2>&1 | grep 'per second'  &
   done

Reported-by: su-li...@linux.alibaba.com

Fixes: fd11a83dd363 ("net: Pull out core bits of __netdev_alloc_skb and add 
__napi_alloc_skb")
Signed-off-by: Eric Dumazet <eduma...@google.com>
Cc: Alexander Duyck <alexanderdu...@fb.com>
Cc: Paolo Abeni <pab...@redhat.com>
Cc: Michael S. Tsirkin <m...@redhat.com>
Cc: Greg Thelen <gthe...@google.com>
---
   net/core/skbuff.c | 9 +++++++--
   1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 
7626a33cce590e530f36167bd096026916131897..3a8f55a43e6964344df464a27b9b1faa0eb804f3
 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -501,13 +501,17 @@ EXPORT_SYMBOL(__netdev_alloc_skb);
   struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
                                 gfp_t gfp_mask)
   {
-     struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+     struct napi_alloc_cache *nc;
        struct sk_buff *skb;
        void *data;

        len += NET_SKB_PAD + NET_IP_ALIGN;

-     if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
+     /* If requested length is either too small or too big,
+      * we use kmalloc() for skb->head allocation.
+      */
+     if (len <= SKB_WITH_OVERHEAD(1024) ||
+         len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
            (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
                skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
                if (!skb)
@@ -515,6 +519,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, 
unsigned int len,
                goto skb_success;
        }

+     nc = this_cpu_ptr(&napi_alloc_cache);
        len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
        len = SKB_DATA_ALIGN(len);

--
2.30.0.284.gd98b1dd5eaa7-goog

Re: [PATCH net] net: avoid 32 x truesize under-estimation for tiny skbs

Reply via email to