Jann, it looks like that change of yours made a rather big negative impact on this load.
On Sun, Feb 28, 2021 at 11:49 PM kernel test robot <[email protected]> wrote: > > FYI, we noticed a -47.9% regression of stress-ng.rawpkt.ops_per_sec due to > commit: Looking at the profile, nothing really obvious stands out, although some of the numbers imply more polling, and less waiting, ie: Lots less context switching:: > 12810 ą 17% -63.6% 4658 ą 8% sched_debug.cpu.nr_switches.avg Less time spent sending packets: > 68.52 -5.0 63.55 > perf-profile.calltrace.cycles-pp.__dev_queue_xmit.packet_snd.sock_sendmsg.__sys_sendto.__x64_sys_sendto > 69.00 -4.9 64.14 > perf-profile.calltrace.cycles-pp.__sys_sendto.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe and quite a lot more time spent in what looks like the receive path, which allocates the packets: > 5.47 ą 12% +4.5 9.94 ą 2% > perf-profile.calltrace.cycles-pp.kmem_cache_alloc.skb_clone.packet_rcv.__netif_receive_skb_core.__netif_receive_skb_one_core > 5.46 ą 12% +4.5 9.92 ą 2% > perf-profile.calltrace.cycles-pp.skb_clone.packet_rcv.dev_queue_xmit_nit.dev_hard_start_xmit.__dev_queue_xmit > 5.37 ą 12% +4.5 9.88 ą 2% > perf-profile.calltrace.cycles-pp.kmem_cache_alloc.skb_clone.packet_rcv.dev_queue_xmit_nit.dev_hard_start_xmit > 5.18 ą 13% +4.5 9.71 ą 2% > perf-profile.calltrace.cycles-pp.__slab_alloc.kmem_cache_alloc.skb_clone.packet_rcv.__netif_receive_skb_core > 5.08 ą 13% +4.6 9.66 ą 2% > perf-profile.calltrace.cycles-pp.__slab_alloc.kmem_cache_alloc.skb_clone.packet_rcv.dev_queue_xmit_nit > 29.57 ą 2% +5.5 35.12 > perf-profile.calltrace.cycles-pp.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe > 29.48 ą 2% +5.6 35.08 > perf-profile.calltrace.cycles-pp.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe > 26.62 ą 2% +7.1 33.67 > perf-profile.calltrace.cycles-pp.packet_recvmsg.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64.entry_SYSCALL_64_after_hwframe > 8.83 ą 15% +8.8 17.66 ą 2% > perf-profile.calltrace.cycles-pp._raw_spin_lock.get_partial_node.___slab_alloc.__slab_alloc.kmem_cache_alloc > 8.58 ą 15% +8.9 17.48 ą 2% > perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.get_partial_node.___slab_alloc.__slab_alloc > 9.46 ą 14% +9.0 18.44 ą 2% > perf-profile.calltrace.cycles-pp.get_partial_node.___slab_alloc.__slab_alloc.kmem_cache_alloc.skb_clone > 10.19 ą 13% +9.1 19.28 ą 2% > perf-profile.calltrace.cycles-pp.___slab_alloc.__slab_alloc.kmem_cache_alloc.skb_clone.packet_rcv > 14.31 ą 6% +14.0 28.27 ą 3% > perf-profile.calltrace.cycles-pp.skb_free_datagram.packet_recvmsg.__sys_recvfrom.__x64_sys_recvfrom.do_syscall_64 > 9.50 ą 11% +16.7 26.21 ą 4% > perf-profile.calltrace.cycles-pp.kmem_cache_free.skb_free_datagram.packet_recvmsg.__sys_recvfrom.__x64_sys_recvfrom > 6.09 ą 18% +17.8 23.88 ą 5% > perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.unfreeze_partials.put_cpu_partial.kmem_cache_free > 6.37 ą 18% +17.9 24.23 ą 5% > perf-profile.calltrace.cycles-pp._raw_spin_lock.unfreeze_partials.put_cpu_partial.kmem_cache_free.skb_free_datagram > 6.92 ą 17% +18.0 24.94 ą 5% > perf-profile.calltrace.cycles-pp.unfreeze_partials.put_cpu_partial.kmem_cache_free.skb_free_datagram.packet_recvmsg > 7.03 ą 16% +18.0 25.06 ą 5% > perf-profile.calltrace.cycles-pp.put_cpu_partial.kmem_cache_free.skb_free_datagram.packet_recvmsg.__sys_recvfrom .. and I think the reason is here: > 26.26 ą 6% +23.8 50.07 ą 3% > perf-profile.children.cycles-pp._raw_spin_lock Look at that +23.8 for _raw_spin_lock, and it really shows up here too: > 25.86 ą 6% +23.4 49.22 ą 3% > perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath I think what is going on is that your change caused the "contention on the freelist" case to now loop - possibly several times, and expensively with atomic operations - while you are holding the 'n->list_lock' spinlock in get_partial_node(). End result: contention on the freelist now becomes *huge* contention on that list_lock instead. Before, it would release the list lock, and generally then (maybe) try again. Or more likely just get another page and avoid the contention. So when you wrote: However, the current code accidentally stops looking at the partial list completely in that case. Especially on kernels without CONFIG_NUMA set, this means that get_partial() fails and new_slab_objects() falls back to new_slab(), allocating new pages. This could lead to an unnecessary increase in memory fragmentation. it really looks like this might well have been very intentional indeed. Or at least very beneficial for _some_ loads. Comments? Linus

