Re: [mm, slub] 8ff60eb052: stress-ng.rawpkt.ops_per_sec -47.9% regression

Linus Torvalds Tue, 09 Mar 2021 15:57:58 -0800

Jann,
 it looks like that change of yours made a rather big negative impact
on this load.


On Sun, Feb 28, 2021 at 11:49 PM kernel test robot
<[email protected]> wrote:
>
> FYI, we noticed a -47.9% regression of stress-ng.rawpkt.ops_per_sec due to 
> commit:

Looking at the profile, nothing really obvious stands out, although
some of the numbers imply more polling, and less waiting, ie:

Lots less context switching::

>      12810 ą 17%     -63.6%       4658 ą  8%  sched_debug.cpu.nr_switches.avg

Less time spent sending packets:

>      68.52            -5.0       63.55        
> perf-profile.calltrace.cycles-pp.__dev_queue_xmit.packet_snd.sock_sendmsg.__sys_sendto.__x64_sys_sendto
>      69.00            -4.9       64.14        
> perf-profile.calltrace.cycles-pp.__sys_sendto.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe

and quite a lot more time spent in what looks like the receive path,
which allocates the packets:

>       5.47 ą 12%      +4.5        9.94 ą  2%  
> perf-profile.calltrace.cycles-pp.kmem_cache_alloc.skb_clone.packet_rcv.__netif_receive_skb_core.__netif_receive_skb_one_core
>       5.46 ą 12%      +4.5        9.92 ą  2%  
> perf-profile.calltrace.cycles-pp.skb_clone.packet_rcv.dev_queue_xmit_nit.dev_hard_start_xmit.__dev_queue_xmit
>       5.37 ą 12%      +4.5        9.88 ą  2%  
> perf-profile.calltrace.cycles-pp.kmem_cache_alloc.skb_clone.packet_rcv.dev_queue_xmit_nit.dev_hard_start_xmit
>       5.18 ą 13%      +4.5        9.71 ą  2%  
> perf-profile.calltrace.cycles-pp.__slab_alloc.kmem_cache_alloc.skb_clone.packet_rcv.__netif_receive_skb_core
>       5.08 ą 13%      +4.6        9.66 ą  2%  
> perf-profile.calltrace.cycles-pp.__slab_alloc.kmem_cache_alloc.skb_clone.packet_rcv.dev_queue_xmit_nit
>      29.57 ą  2%      +5.5       35.12        
> perf-profile.calltrace.cycles-pp.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe
>      29.48 ą  2%      +5.6       35.08        
> perf-profile.calltrace.cycles-pp.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe
>      26.62 ą  2%      +7.1       33.67        
> perf-profile.calltrace.cycles-pp.packet_recvmsg.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       8.83 ą 15%      +8.8       17.66 ą  2%  
> perf-profile.calltrace.cycles-pp._raw_spin_lock.get_partial_node.___slab_alloc.__slab_alloc.kmem_cache_alloc
>       8.58 ą 15%      +8.9       17.48 ą  2%  
> perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.get_partial_node.___slab_alloc.__slab_alloc
>       9.46 ą 14%      +9.0       18.44 ą  2%  
> perf-profile.calltrace.cycles-pp.get_partial_node.___slab_alloc.__slab_alloc.kmem_cache_alloc.skb_clone
>      10.19 ą 13%      +9.1       19.28 ą  2%  
> perf-profile.calltrace.cycles-pp.___slab_alloc.__slab_alloc.kmem_cache_alloc.skb_clone.packet_rcv
>      14.31 ą  6%     +14.0       28.27 ą  3%  
> perf-profile.calltrace.cycles-pp.skb_free_datagram.packet_recvmsg.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64
>       9.50 ą 11%     +16.7       26.21 ą  4%  
> perf-profile.calltrace.cycles-pp.kmem_cache_free.skb_free_datagram.packet_recvmsg.__sys_recvfrom.__x64_sys_recvfrom
>       6.09 ą 18%     +17.8       23.88 ą  5%  
> perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.unfreeze_partials.put_cpu_partial.kmem_cache_free
>       6.37 ą 18%     +17.9       24.23 ą  5%  
> perf-profile.calltrace.cycles-pp._raw_spin_lock.unfreeze_partials.put_cpu_partial.kmem_cache_free.skb_free_datagram
>       6.92 ą 17%     +18.0       24.94 ą  5%  
> perf-profile.calltrace.cycles-pp.unfreeze_partials.put_cpu_partial.kmem_cache_free.skb_free_datagram.packet_recvmsg
>       7.03 ą 16%     +18.0       25.06 ą  5%  
> perf-profile.calltrace.cycles-pp.put_cpu_partial.kmem_cache_free.skb_free_datagram.packet_recvmsg.__sys_recvfrom

.. and I think the reason is here:

>      26.26 ą  6%     +23.8       50.07 ą  3%  
> perf-profile.children.cycles-pp._raw_spin_lock

Look at that +23.8 for _raw_spin_lock, and it really shows up here too:

>      25.86 ą  6%     +23.4       49.22 ą  3%  
> perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath

I think what is going on is that your change caused the "contention on
the freelist" case to now loop - possibly several times, and
expensively with atomic operations - while you are holding the
'n->list_lock' spinlock in get_partial_node().

End result: contention on the freelist now becomes *huge* contention
on that list_lock instead.

Before, it would release the list lock, and generally then (maybe) try
again. Or more likely just get another page and avoid the contention.

So when you wrote:

    However, the current code accidentally stops looking at the partial list
    completely in that case.  Especially on kernels without CONFIG_NUMA set,
    this means that get_partial() fails and new_slab_objects() falls back to
    new_slab(), allocating new pages.  This could lead to an unnecessary
    increase in memory fragmentation.

it really looks like this might well have been very intentional
indeed. Or at least very beneficial for _some_ loads.

Comments?

                    Linus

Re: [mm, slub] 8ff60eb052: stress-ng.rawpkt.ops_per_sec -47.9% regression

Reply via email to