Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
On Fri, 2016-12-02 at 16:37 +0100, Jesper Dangaard Brouer wrote: > On Thu, 01 Dec 2016 23:17:48 +0100 > Paolo Abeni wrote: > > > On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote: > > > (Cc. netdev, we might have an issue with Paolo's UDP accounting and > > > small socket queues) > > > > > > On Wed, 30 Nov 2016 16:35:20 + > > > Mel Gorman wrote: > > > > > > > > I don't quite get why you are setting the socket recv size > > > > > (with -- -s and -S) to such a small number, size + 256. > > > > > > > > > > > > > Maybe I missed something at the time I wrote that but why would it > > > > need to be larger? > > > > > > Well, to me it is quite obvious that we need some queue to avoid packet > > > drops. We have two processes netperf and netserver, that are sending > > > packets between each-other (UDP_STREAM mostly netperf -> netserver). > > > These PIDs are getting scheduled and migrated between CPUs, and thus > > > does not get executed equally fast, thus a queue is need absorb the > > > fluctuations. > > > > > > The network stack is even partly catching your config "mistake" and > > > increase the socket queue size, so we minimum can handle one max frame > > > (due skb "truesize" concept approx PAGE_SIZE + overhead). > > > > > > Hopefully for localhost testing a small queue should hopefully not > > > result in packet drops. Testing... ups, this does result in packet > > > drops. > > > > > > Test command extracted from mmtests, UDP_STREAM size 1024: > > > > > > netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 \ > > >-- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895 > > > > > > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) > > > port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET > > > Socket Message Elapsed Messages > > > SizeSize Time Okay Errors Throughput > > > bytes bytessecs# # 10^6bits/sec > > > > > >46081024 60.00 50024301 06829.98 > > >2560 60.00 46133211 6298.72 > > > > > > Dropped packets: 50024301-46133211=3891090 > > > > > > To get a better drop indication, during this I run a command, to get > > > system-wide network counters from the last second, so below numbers are > > > per second. > > > > > > $ nstat > /dev/null && sleep 1 && nstat > > > #kernel > > > IpInReceives885162 0.0 > > > IpInDelivers885161 0.0 > > > IpOutRequests 885162 0.0 > > > UdpInDatagrams 776105 0.0 > > > UdpInErrors 109056 0.0 > > > UdpOutDatagrams 885160 0.0 > > > UdpRcvbufErrors 109056 0.0 > > > IpExtInOctets 931190476 0.0 > > > IpExtOutOctets 931189564 0.0 > > > IpExtInNoECTPkts885162 0.0 > > > > > > So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See > > > UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop > > > happens kernel side in __udp_queue_rcv_skb[1], because receiving > > > process didn't empty it's queue fast enough see [2]. > > > > > > Although upstream changes are coming in this area, [2] is replaced with > > > __udp_enqueue_schedule_skb, which I actually tested with... hmm > > > > > > Retesting with kernel 4.7.0-baseline+ ... show something else. > > > To Paolo, you might want to look into this. And it could also explain why > > > I've not see the mentioned speedup by mm-change, as I've been testing > > > this patch on top of net-next (at 93ba550) with Paolo's UDP changes. > > > > Thank you for reporting this. > > > > It seems that the commit 123b4a633580 ("udp: use it's own memory > > accounting schema") is too strict while checking the rcvbuf. > > > > For very small value of rcvbuf, it allows a single skb to be enqueued, > > while previously we allowed 2 of them to enter the queue, even if the > > first one truesize exceeded rcvbuf, as in your test-case. > > > > Can you please try the following patch ? > > Sure, it looks much better with this patch. Thank you for testing. I'll send a formal patch to David soon. BTW I see I nice performance improvement compared to 4.7... Cheers, Paolo
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
On Thu, 01 Dec 2016 23:17:48 +0100 Paolo Abeni wrote: > On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote: > > (Cc. netdev, we might have an issue with Paolo's UDP accounting and > > small socket queues) > > > > On Wed, 30 Nov 2016 16:35:20 + > > Mel Gorman wrote: > > > > > > I don't quite get why you are setting the socket recv size > > > > (with -- -s and -S) to such a small number, size + 256. > > > > > > > > > > Maybe I missed something at the time I wrote that but why would it > > > need to be larger? > > > > Well, to me it is quite obvious that we need some queue to avoid packet > > drops. We have two processes netperf and netserver, that are sending > > packets between each-other (UDP_STREAM mostly netperf -> netserver). > > These PIDs are getting scheduled and migrated between CPUs, and thus > > does not get executed equally fast, thus a queue is need absorb the > > fluctuations. > > > > The network stack is even partly catching your config "mistake" and > > increase the socket queue size, so we minimum can handle one max frame > > (due skb "truesize" concept approx PAGE_SIZE + overhead). > > > > Hopefully for localhost testing a small queue should hopefully not > > result in packet drops. Testing... ups, this does result in packet > > drops. > > > > Test command extracted from mmtests, UDP_STREAM size 1024: > > > > netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 \ > >-- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895 > > > > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) > > port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET > > Socket Message Elapsed Messages > > SizeSize Time Okay Errors Throughput > > bytes bytessecs# # 10^6bits/sec > > > >46081024 60.00 50024301 06829.98 > >2560 60.00 46133211 6298.72 > > > > Dropped packets: 50024301-46133211=3891090 > > > > To get a better drop indication, during this I run a command, to get > > system-wide network counters from the last second, so below numbers are > > per second. > > > > $ nstat > /dev/null && sleep 1 && nstat > > #kernel > > IpInReceives885162 0.0 > > IpInDelivers885161 0.0 > > IpOutRequests 885162 0.0 > > UdpInDatagrams 776105 0.0 > > UdpInErrors 109056 0.0 > > UdpOutDatagrams 885160 0.0 > > UdpRcvbufErrors 109056 0.0 > > IpExtInOctets 931190476 0.0 > > IpExtOutOctets 931189564 0.0 > > IpExtInNoECTPkts885162 0.0 > > > > So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See > > UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop > > happens kernel side in __udp_queue_rcv_skb[1], because receiving > > process didn't empty it's queue fast enough see [2]. > > > > Although upstream changes are coming in this area, [2] is replaced with > > __udp_enqueue_schedule_skb, which I actually tested with... hmm > > > > Retesting with kernel 4.7.0-baseline+ ... show something else. > > To Paolo, you might want to look into this. And it could also explain why > > I've not see the mentioned speedup by mm-change, as I've been testing > > this patch on top of net-next (at 93ba550) with Paolo's UDP changes. > > Thank you for reporting this. > > It seems that the commit 123b4a633580 ("udp: use it's own memory > accounting schema") is too strict while checking the rcvbuf. > > For very small value of rcvbuf, it allows a single skb to be enqueued, > while previously we allowed 2 of them to enter the queue, even if the > first one truesize exceeded rcvbuf, as in your test-case. > > Can you please try the following patch ? Sure, it looks much better with this patch. $ /home/jbrouer/git/mmtests/work/testdisk/sources/netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -l 60 -H 127.0.0.1-- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895 UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET Socket Message Elapsed Messages SizeSize Time Okay Errors Throughput bytes bytessecs# # 10^6bits/sec 46081024 60.00 50191555 06852.82 2560 60.00 50189872 6852.59 Only 50191555-50189872=1683 drops, approx 1683/60 = 28/sec $ nstat > /dev/null && sleep 1 && nstat #kernel IpInReceives885417 0.0 IpInDelivers885416 0.0 IpOutRequests 885417 0.0 UdpInDatagrams 885382 0.0 UdpInErrors 29 0.0 UdpOutDatagrams
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote: > (Cc. netdev, we might have an issue with Paolo's UDP accounting and > small socket queues) > > On Wed, 30 Nov 2016 16:35:20 + > Mel Gorman wrote: > > > > I don't quite get why you are setting the socket recv size > > > (with -- -s and -S) to such a small number, size + 256. > > > > > > > Maybe I missed something at the time I wrote that but why would it > > need to be larger? > > Well, to me it is quite obvious that we need some queue to avoid packet > drops. We have two processes netperf and netserver, that are sending > packets between each-other (UDP_STREAM mostly netperf -> netserver). > These PIDs are getting scheduled and migrated between CPUs, and thus > does not get executed equally fast, thus a queue is need absorb the > fluctuations. > > The network stack is even partly catching your config "mistake" and > increase the socket queue size, so we minimum can handle one max frame > (due skb "truesize" concept approx PAGE_SIZE + overhead). > > Hopefully for localhost testing a small queue should hopefully not > result in packet drops. Testing... ups, this does result in packet > drops. > > Test command extracted from mmtests, UDP_STREAM size 1024: > > netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 \ >-- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895 > > UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) > port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET > Socket Message Elapsed Messages > SizeSize Time Okay Errors Throughput > bytes bytessecs# # 10^6bits/sec > >46081024 60.00 50024301 06829.98 >2560 60.00 46133211 6298.72 > > Dropped packets: 50024301-46133211=3891090 > > To get a better drop indication, during this I run a command, to get > system-wide network counters from the last second, so below numbers are > per second. > > $ nstat > /dev/null && sleep 1 && nstat > #kernel > IpInReceives885162 0.0 > IpInDelivers885161 0.0 > IpOutRequests 885162 0.0 > UdpInDatagrams 776105 0.0 > UdpInErrors 109056 0.0 > UdpOutDatagrams 885160 0.0 > UdpRcvbufErrors 109056 0.0 > IpExtInOctets 931190476 0.0 > IpExtOutOctets 931189564 0.0 > IpExtInNoECTPkts885162 0.0 > > So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See > UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop > happens kernel side in __udp_queue_rcv_skb[1], because receiving > process didn't empty it's queue fast enough see [2]. > > Although upstream changes are coming in this area, [2] is replaced with > __udp_enqueue_schedule_skb, which I actually tested with... hmm > > Retesting with kernel 4.7.0-baseline+ ... show something else. > To Paolo, you might want to look into this. And it could also explain why > I've not see the mentioned speedup by mm-change, as I've been testing > this patch on top of net-next (at 93ba550) with Paolo's UDP changes. Thank you for reporting this. It seems that the commit 123b4a633580 ("udp: use it's own memory accounting schema") is too strict while checking the rcvbuf. For very small value of rcvbuf, it allows a single skb to be enqueued, while previously we allowed 2 of them to enter the queue, even if the first one truesize exceeded rcvbuf, as in your test-case. Can you please try the following patch ? Thank you, Paolo --- net/ipv4/udp.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index e1d0bf8..2f5dc92 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1200,19 +1200,21 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb) struct sk_buff_head *list = &sk->sk_receive_queue; int rmem, delta, amt, err = -ENOMEM; int size = skb->truesize; + int limit; /* try to avoid the costly atomic add/sub pair when the receive * queue is full; always allow at least a packet */ rmem = atomic_read(&sk->sk_rmem_alloc); - if (rmem && (rmem + size > sk->sk_rcvbuf)) + limit = size + sk->sk_rcvbuf; + if (rmem > limit) goto drop; /* we drop only if the receive buf is full and the receive * queue contains some other skb */ rmem = atomic_add_return(size, &sk->sk_rmem_alloc); - if ((rmem > sk->sk_rcvbuf) && (rmem > size)) + if (rmem > limit) goto uncharge_drop; spin_lock(&list->lock);
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
(Cc. netdev, we might have an issue with Paolo's UDP accounting and small socket queues) On Wed, 30 Nov 2016 16:35:20 + Mel Gorman wrote: > > I don't quite get why you are setting the socket recv size > > (with -- -s and -S) to such a small number, size + 256. > > > > Maybe I missed something at the time I wrote that but why would it > need to be larger? Well, to me it is quite obvious that we need some queue to avoid packet drops. We have two processes netperf and netserver, that are sending packets between each-other (UDP_STREAM mostly netperf -> netserver). These PIDs are getting scheduled and migrated between CPUs, and thus does not get executed equally fast, thus a queue is need absorb the fluctuations. The network stack is even partly catching your config "mistake" and increase the socket queue size, so we minimum can handle one max frame (due skb "truesize" concept approx PAGE_SIZE + overhead). Hopefully for localhost testing a small queue should hopefully not result in packet drops. Testing... ups, this does result in packet drops. Test command extracted from mmtests, UDP_STREAM size 1024: netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 \ -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895 UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET Socket Message Elapsed Messages SizeSize Time Okay Errors Throughput bytes bytessecs# # 10^6bits/sec 46081024 60.00 50024301 06829.98 2560 60.00 46133211 6298.72 Dropped packets: 50024301-46133211=3891090 To get a better drop indication, during this I run a command, to get system-wide network counters from the last second, so below numbers are per second. $ nstat > /dev/null && sleep 1 && nstat #kernel IpInReceives885162 0.0 IpInDelivers885161 0.0 IpOutRequests 885162 0.0 UdpInDatagrams 776105 0.0 UdpInErrors 109056 0.0 UdpOutDatagrams 885160 0.0 UdpRcvbufErrors 109056 0.0 IpExtInOctets 931190476 0.0 IpExtOutOctets 931189564 0.0 IpExtInNoECTPkts885162 0.0 So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop happens kernel side in __udp_queue_rcv_skb[1], because receiving process didn't empty it's queue fast enough see [2]. Although upstream changes are coming in this area, [2] is replaced with __udp_enqueue_schedule_skb, which I actually tested with... hmm Retesting with kernel 4.7.0-baseline+ ... show something else. To Paolo, you might want to look into this. And it could also explain why I've not see the mentioned speedup by mm-change, as I've been testing this patch on top of net-next (at 93ba550) with Paolo's UDP changes. netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 \ -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895 UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET Socket Message Elapsed Messages SizeSize Time Okay Errors Throughput bytes bytessecs# # 10^6bits/sec 46081024 60.00 47248301 06450.97 2560 60.00 47245030 6450.52 Only dropped 47248301-47245030=3271 $ nstat > /dev/null && sleep 1 && nstat #kernel IpInReceives810566 0.0 IpInDelivers810566 0.0 IpOutRequests 810566 0.0 UdpInDatagrams 810468 0.0 UdpInErrors 99 0.0 UdpOutDatagrams 810566 0.0 UdpRcvbufErrors 99 0.0 IpExtInOctets 852713328 0.0 IpExtOutOctets 852713328 0.0 IpExtInNoECTPkts810563 0.0 And nstat is also much better with only 99 drop/sec. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer [1] http://lxr.free-electrons.com/source/net/ipv4/udp.c?v=4.8#L1454 [2] http://lxr.free-electrons.com/source/net/core/sock.c?v=4.8#L413 Extra: with net-next at 93ba550 If I use netperf default socket queue, then there is not a single packet drop: netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -m 1024 -M 1024 -P 15895 UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET Socket Message Elapsed Messages
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
On Wed, Nov 30, 2016 at 04:06:12PM +0100, Jesper Dangaard Brouer wrote: > > > [...] > > > > This is the result from netperf running UDP_STREAM on localhost. It was > > > > selected on the basis that it is slab-intensive and has been the subject > > > > of previous SLAB vs SLUB comparisons with the caveat that this is not > > > > testing between two physical hosts. > > > > > > I do like you are using a networking test to benchmark this. Looking at > > > the results, my initial response is that the improvements are basically > > > too good to be true. > > > > > > > FWIW, LKP independently measured the boost to be 23% so it's expected > > there will be different results depending on exact configuration and CPU. > > Yes, noticed that, nice (which was a SCTP test) > https://lists.01.org/pipermail/lkp/2016-November/005210.html > > It is of-cause great. It is just strange I cannot reproduce it on my > high-end box, with manual testing. I'll try your test suite and try to > figure out what is wrong with my setup. > That would be great. I had seen the boost on multiple machines and LKP verifying it is helpful. > > > > Can you share how you tested this with netperf and the specific netperf > > > parameters? > > > > The mmtests config file used is > > configs/config-global-dhp__network-netperf-unbound so all details can be > > extrapolated or reproduced from that. > > I didn't know of mmtests: https://github.com/gormanm/mmtests > > It looks nice and quite comprehensive! :-) > Thanks. > > > e.g. > > > How do you configure the send/recv sizes? > > > > Static range of sizes specified in the config file. > > I'll figure it out... reading your shell code :-) > > export NETPERF_BUFFER_SIZES=64,128,256,1024,2048,3312,4096,8192,16384 > > https://github.com/gormanm/mmtests/blob/master/configs/config-global-dhp__network-netperf-unbound#L72 > > I see you are using netperf 2.4.5 and setting both the send an recv > size (-- -m and -M) which is fine. > Ok. > I don't quite get why you are setting the socket recv size (with -- -s > and -S) to such a small number, size + 256. > Maybe I missed something at the time I wrote that but why would it need to be larger? -- Mel Gorman SUSE Labs
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
On Wed, 30 Nov 2016 14:06:15 + Mel Gorman wrote: > On Wed, Nov 30, 2016 at 01:40:34PM +0100, Jesper Dangaard Brouer wrote: > > > > On Sun, 27 Nov 2016 13:19:54 + Mel Gorman > > wrote: > > > > [...] > > > SLUB has been the default small kernel object allocator for quite some > > > time > > > but it is not universally used due to performance concerns and a reliance > > > on high-order pages. The high-order concerns has two major components -- > > > high-order pages are not always available and high-order page allocations > > > potentially contend on the zone->lock. This patch addresses some concerns > > > about the zone lock contention by extending the per-cpu page allocator to > > > cache high-order pages. The patch makes the following modifications > > > > > > o New per-cpu lists are added to cache the high-order pages. This > > > increases > > > the cache footprint of the per-cpu allocator and overall usage but for > > > some workloads, this will be offset by reduced contention on > > > zone->lock. > > > > This will also help performance of NIC driver that allocator > > higher-order pages for their RX-ring queue (and chop it up for MTU). > > I do like this patch, even-though I'm working on moving drivers away > > from allocation these high-order pages. > > > > Acked-by: Jesper Dangaard Brouer > > > > Thanks. > > > [...] > > > This is the result from netperf running UDP_STREAM on localhost. It was > > > selected on the basis that it is slab-intensive and has been the subject > > > of previous SLAB vs SLUB comparisons with the caveat that this is not > > > testing between two physical hosts. > > > > I do like you are using a networking test to benchmark this. Looking at > > the results, my initial response is that the improvements are basically > > too good to be true. > > > > FWIW, LKP independently measured the boost to be 23% so it's expected > there will be different results depending on exact configuration and CPU. Yes, noticed that, nice (which was a SCTP test) https://lists.01.org/pipermail/lkp/2016-November/005210.html It is of-cause great. It is just strange I cannot reproduce it on my high-end box, with manual testing. I'll try your test suite and try to figure out what is wrong with my setup. > > Can you share how you tested this with netperf and the specific netperf > > parameters? > > The mmtests config file used is > configs/config-global-dhp__network-netperf-unbound so all details can be > extrapolated or reproduced from that. I didn't know of mmtests: https://github.com/gormanm/mmtests It looks nice and quite comprehensive! :-) > > e.g. > > How do you configure the send/recv sizes? > > Static range of sizes specified in the config file. I'll figure it out... reading your shell code :-) export NETPERF_BUFFER_SIZES=64,128,256,1024,2048,3312,4096,8192,16384 https://github.com/gormanm/mmtests/blob/master/configs/config-global-dhp__network-netperf-unbound#L72 I see you are using netperf 2.4.5 and setting both the send an recv size (-- -m and -M) which is fine. I don't quite get why you are setting the socket recv size (with -- -s and -S) to such a small number, size + 256. SOCKETSIZE_OPT="-s $((SIZE+256)) -S $((SIZE+256)) netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -i 3 3 -I 95 5 -H 127.0.0.1 \ -- -s 320 -S 320 -m 64 -M 64 -P 15895 netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -i 3 3 -I 95 5 -H 127.0.0.1 \ -- -s 384 -S 384 -m 128 -M 128 -P 15895 netperf-2.4.5-installed/bin/netperf -t UDP_STREAM -i 3 3 -I 95 5 -H 127.0.0.1 \ -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895 > > Have you pinned netperf and netserver on different CPUs? > > > > No. While it's possible to do a pinned test which helps stability, it > also tends to be less reflective of what happens in a variety of > workloads so I took the "harder" option. Agree. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
On Wed 30-11-16 14:16:13, Mel Gorman wrote: > On Wed, Nov 30, 2016 at 02:05:50PM +0100, Michal Hocko wrote: [...] > > But... Unless I am missing something this effectively means that we do > > not exercise high order atomic reserves. Shouldn't we fallback to > > the locked __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC) for > > order > 0 && ALLOC_HARDER ? Or is this just hidden in some other code > > path which I am not seeing? > > > > Good spot, would this be acceptable to you? It's not a queen of beauty but it works. A more elegant solution would require more surgery I guess which is probably not worth it at this stage. > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 91dc68c2a717..94808f565f74 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2609,9 +2609,18 @@ struct page *buffered_rmqueue(struct zone > *preferred_zone, > int nr_pages = rmqueue_bulk(zone, order, > pcp->batch, list, > migratetype, cold); > - pcp->count += (nr_pages << order); > - if (unlikely(list_empty(list))) > + if (unlikely(list_empty(list))) { > + /* > + * Retry high-order atomic allocs > + * from the buddy list which may > + * use MIGRATE_HIGHATOMIC. > + */ > + if (order && (alloc_flags & > ALLOC_HARDER)) > + goto try_buddylist; > + > goto failed; > + } > + pcp->count += (nr_pages << order); > } > > if (cold) > @@ -2624,6 +2633,7 @@ struct page *buffered_rmqueue(struct zone > *preferred_zone, > > } while (check_new_pcp(page)); > } else { > +try_buddylist: > /* >* We most definitely don't want callers attempting to >* allocate greater than order-1 page units with __GFP_NOFAIL. > -- > Mel Gorman > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- Michal Hocko SUSE Labs
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
On Wed, Nov 30, 2016 at 02:05:50PM +0100, Michal Hocko wrote: > On Sun 27-11-16 13:19:54, Mel Gorman wrote: > [...] > > @@ -2588,18 +2594,22 @@ struct page *buffered_rmqueue(struct zone > > *preferred_zone, > > struct page *page; > > bool cold = ((gfp_flags & __GFP_COLD) != 0); > > > > - if (likely(order == 0)) { > > + if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) { > > struct per_cpu_pages *pcp; > > struct list_head *list; > > > > local_irq_save(flags); > > do { > > + unsigned int pindex; > > + > > + pindex = order_to_pindex(migratetype, order); > > pcp = &this_cpu_ptr(zone->pageset)->pcp; > > - list = &pcp->lists[migratetype]; > > + list = &pcp->lists[pindex]; > > if (list_empty(list)) { > > - pcp->count += rmqueue_bulk(zone, 0, > > + int nr_pages = rmqueue_bulk(zone, order, > > pcp->batch, list, > > migratetype, cold); > > + pcp->count += (nr_pages << order); > > if (unlikely(list_empty(list))) > > goto failed; > > just a nit, we can reorder the check and the count update because nobody > could have stolen pages allocated by rmqueue_bulk. Ok, it's minor but I can do that. > I would also consider > nr_pages a bit misleading because we get a number or allocated elements. > Nothing to lose sleep over... > I didn't think of a clearer name because in this sort of context, I consider a high-order page to be a single page. > > } > > But... Unless I am missing something this effectively means that we do > not exercise high order atomic reserves. Shouldn't we fallback to > the locked __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC) for > order > 0 && ALLOC_HARDER ? Or is this just hidden in some other code > path which I am not seeing? > Good spot, would this be acceptable to you? diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 91dc68c2a717..94808f565f74 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2609,9 +2609,18 @@ struct page *buffered_rmqueue(struct zone *preferred_zone, int nr_pages = rmqueue_bulk(zone, order, pcp->batch, list, migratetype, cold); - pcp->count += (nr_pages << order); - if (unlikely(list_empty(list))) + if (unlikely(list_empty(list))) { + /* +* Retry high-order atomic allocs +* from the buddy list which may +* use MIGRATE_HIGHATOMIC. +*/ + if (order && (alloc_flags & ALLOC_HARDER)) + goto try_buddylist; + goto failed; + } + pcp->count += (nr_pages << order); } if (cold) @@ -2624,6 +2633,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone, } while (check_new_pcp(page)); } else { +try_buddylist: /* * We most definitely don't want callers attempting to * allocate greater than order-1 page units with __GFP_NOFAIL. -- Mel Gorman SUSE Labs
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
On Wed, Nov 30, 2016 at 01:40:34PM +0100, Jesper Dangaard Brouer wrote: > > On Sun, 27 Nov 2016 13:19:54 + Mel Gorman > wrote: > > [...] > > SLUB has been the default small kernel object allocator for quite some time > > but it is not universally used due to performance concerns and a reliance > > on high-order pages. The high-order concerns has two major components -- > > high-order pages are not always available and high-order page allocations > > potentially contend on the zone->lock. This patch addresses some concerns > > about the zone lock contention by extending the per-cpu page allocator to > > cache high-order pages. The patch makes the following modifications > > > > o New per-cpu lists are added to cache the high-order pages. This increases > > the cache footprint of the per-cpu allocator and overall usage but for > > some workloads, this will be offset by reduced contention on zone->lock. > > This will also help performance of NIC driver that allocator > higher-order pages for their RX-ring queue (and chop it up for MTU). > I do like this patch, even-though I'm working on moving drivers away > from allocation these high-order pages. > > Acked-by: Jesper Dangaard Brouer > Thanks. > [...] > > This is the result from netperf running UDP_STREAM on localhost. It was > > selected on the basis that it is slab-intensive and has been the subject > > of previous SLAB vs SLUB comparisons with the caveat that this is not > > testing between two physical hosts. > > I do like you are using a networking test to benchmark this. Looking at > the results, my initial response is that the improvements are basically > too good to be true. > FWIW, LKP independently measured the boost to be 23% so it's expected there will be different results depending on exact configuration and CPU. > Can you share how you tested this with netperf and the specific netperf > parameters? The mmtests config file used is configs/config-global-dhp__network-netperf-unbound so all details can be extrapolated or reproduced from that. > e.g. > How do you configure the send/recv sizes? Static range of sizes specified in the config file. > Have you pinned netperf and netserver on different CPUs? > No. While it's possible to do a pinned test which helps stability, it also tends to be less reflective of what happens in a variety of workloads so I took the "harder" option. > For localhost testing, when netperf and netserver run on the same CPU, > you observer half the performance, very intuitively. When pinning > netperf and netserver (via e.g. option -T 1,2) you observe the most > stable results. When allowing netperf and netserver to migrate between > CPUs (default setting), the real fun starts and unstable results, > because now the CPU scheduler is also being tested, and my experience > is also more "fun" memory situations occurs, as I guess we are hopping > between more per CPU alloc caches (also affecting the SLUB per CPU usage > pattern). > Yes which is another reason why I used an unbound configuration. I didn't want to get an artificial boost from pinned server/client using the same per-cpu caches. As a side-effect, it may mean that machines with fewer CPUs get a greater boost as there are fewer per-cpu caches being used. > > 2-socket modern machine > > 4.9.0-rc5 4.9.0-rc5 > > vanilla hopcpu-v3 > > The kernel from 4.9.0-rc5-vanilla to 4.9.0-rc5-hopcpu-v3 only contains > this single change right? Yes. -- Mel Gorman SUSE Labs
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
On Sun 27-11-16 13:19:54, Mel Gorman wrote: [...] > @@ -2588,18 +2594,22 @@ struct page *buffered_rmqueue(struct zone > *preferred_zone, > struct page *page; > bool cold = ((gfp_flags & __GFP_COLD) != 0); > > - if (likely(order == 0)) { > + if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) { > struct per_cpu_pages *pcp; > struct list_head *list; > > local_irq_save(flags); > do { > + unsigned int pindex; > + > + pindex = order_to_pindex(migratetype, order); > pcp = &this_cpu_ptr(zone->pageset)->pcp; > - list = &pcp->lists[migratetype]; > + list = &pcp->lists[pindex]; > if (list_empty(list)) { > - pcp->count += rmqueue_bulk(zone, 0, > + int nr_pages = rmqueue_bulk(zone, order, > pcp->batch, list, > migratetype, cold); > + pcp->count += (nr_pages << order); > if (unlikely(list_empty(list))) > goto failed; just a nit, we can reorder the check and the count update because nobody could have stolen pages allocated by rmqueue_bulk. I would also consider nr_pages a bit misleading because we get a number or allocated elements. Nothing to lose sleep over... > } But... Unless I am missing something this effectively means that we do not exercise high order atomic reserves. Shouldn't we fallback to the locked __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC) for order > 0 && ALLOC_HARDER ? Or is this just hidden in some other code path which I am not seeing? Other than that the patch looks reasonable to me. Keeping some portion of !costly pages on pcp lists sounds useful from the fragmentation point of view as well AFAICS because it would be normally dissolved for order-0 requests while we push on the reclaim more right now. -- Michal Hocko SUSE Labs
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
On Sun, 27 Nov 2016 13:19:54 + Mel Gorman wrote: [...] > SLUB has been the default small kernel object allocator for quite some time > but it is not universally used due to performance concerns and a reliance > on high-order pages. The high-order concerns has two major components -- > high-order pages are not always available and high-order page allocations > potentially contend on the zone->lock. This patch addresses some concerns > about the zone lock contention by extending the per-cpu page allocator to > cache high-order pages. The patch makes the following modifications > > o New per-cpu lists are added to cache the high-order pages. This increases > the cache footprint of the per-cpu allocator and overall usage but for > some workloads, this will be offset by reduced contention on zone->lock. This will also help performance of NIC driver that allocator higher-order pages for their RX-ring queue (and chop it up for MTU). I do like this patch, even-though I'm working on moving drivers away from allocation these high-order pages. Acked-by: Jesper Dangaard Brouer [...] > This is the result from netperf running UDP_STREAM on localhost. It was > selected on the basis that it is slab-intensive and has been the subject > of previous SLAB vs SLUB comparisons with the caveat that this is not > testing between two physical hosts. I do like you are using a networking test to benchmark this. Looking at the results, my initial response is that the improvements are basically too good to be true. Can you share how you tested this with netperf and the specific netperf parameters? e.g. How do you configure the send/recv sizes? Have you pinned netperf and netserver on different CPUs? For localhost testing, when netperf and netserver run on the same CPU, you observer half the performance, very intuitively. When pinning netperf and netserver (via e.g. option -T 1,2) you observe the most stable results. When allowing netperf and netserver to migrate between CPUs (default setting), the real fun starts and unstable results, because now the CPU scheduler is also being tested, and my experience is also more "fun" memory situations occurs, as I guess we are hopping between more per CPU alloc caches (also affecting the SLUB per CPU usage pattern). > 2-socket modern machine > 4.9.0-rc5 4.9.0-rc5 > vanilla hopcpu-v3 The kernel from 4.9.0-rc5-vanilla to 4.9.0-rc5-hopcpu-v3 only contains this single change right? Netdev/Paolo recently (in net-next) optimized the UDP code path significantly, and I just want to make sure your results are not affected by these changes. > Hmeansend-64 178.38 ( 0.00%) 256.74 ( 43.93%) > Hmeansend-128351.49 ( 0.00%) 507.52 ( 44.39%) > Hmeansend-256671.23 ( 0.00%) 1004.19 ( 49.60%) > Hmeansend-1024 2663.60 ( 0.00%) 3910.42 ( 46.81%) > Hmeansend-2048 5126.53 ( 0.00%) 7562.13 ( 47.51%) > Hmeansend-3312 7949.99 ( 0.00%)11565.98 ( 45.48%) > Hmeansend-4096 9433.56 ( 0.00%)12929.67 ( 37.06%) > Hmeansend-8192 15940.64 ( 0.00%)21587.63 ( 35.43%) > Hmeansend-1638426699.54 ( 0.00%)32013.79 ( 19.90%) > Hmeanrecv-64 178.38 ( 0.00%) 256.72 ( 43.92%) > Hmeanrecv-128351.49 ( 0.00%) 507.47 ( 44.38%) > Hmeanrecv-256671.20 ( 0.00%) 1003.95 ( 49.57%) > Hmeanrecv-1024 2663.45 ( 0.00%) 3909.70 ( 46.79%) > Hmeanrecv-2048 5126.26 ( 0.00%) 7560.67 ( 47.49%) > Hmeanrecv-3312 7949.50 ( 0.00%)11564.63 ( 45.48%) > Hmeanrecv-4096 9433.04 ( 0.00%)12927.48 ( 37.04%) > Hmeanrecv-8192 15939.64 ( 0.00%)21584.59 ( 35.41%) > Hmeanrecv-1638426698.44 ( 0.00%)32009.77 ( 19.89%) > > 1-socket 6 year old machine > 4.9.0-rc5 4.9.0-rc5 > vanilla hopcpu-v3 > Hmeansend-64 87.47 ( 0.00%) 127.14 ( 45.36%) > Hmeansend-128174.36 ( 0.00%) 256.42 ( 47.06%) > Hmeansend-256347.52 ( 0.00%) 509.41 ( 46.59%) > Hmeansend-1024 1363.03 ( 0.00%) 1991.54 ( 46.11%) > Hmeansend-2048 2632.68 ( 0.00%) 3759.51 ( 42.80%) > Hmeansend-3312 4123.19 ( 0.00%) 5873.28 ( 42.45%) > Hmeansend-4096 5056.48 ( 0.00%) 7072.81 ( 39.88%) > Hmeansend-8192 8784.22 ( 0.00%)12143.92 ( 38.25%) > Hmeansend-1638415081.60 ( 0.00%)19812.71 ( 31.37%) > Hmeanrecv-64 86.19 ( 0.00%) 126.59 ( 46.87%) > Hmeanrecv-128173.93 ( 0.00%) 255.21 ( 46.73%) > Hmeanrecv-256346.19 ( 0.00%) 506.72 ( 46.37%) > Hmeanrecv-1024 1358.28 ( 0.00%) 1980.03 ( 45.77%) > Hmeanrecv-2048 2623.45 ( 0.00%) 3729.35 ( 42.15%) > Hmeanrecv-
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
On Mon, Nov 28, 2016 at 12:00:41PM +0100, Vlastimil Babka wrote: > > 1-socket 6 year old machine > > 4.9.0-rc5 4.9.0-rc5 > > vanilla hopcpu-v3 > > Hmeansend-64 87.47 ( 0.00%) 127.14 ( 45.36%) > > Hmeansend-128174.36 ( 0.00%) 256.42 ( 47.06%) > > Hmeansend-256347.52 ( 0.00%) 509.41 ( 46.59%) > > Hmeansend-1024 1363.03 ( 0.00%) 1991.54 ( 46.11%) > > Hmeansend-2048 2632.68 ( 0.00%) 3759.51 ( 42.80%) > > Hmeansend-3312 4123.19 ( 0.00%) 5873.28 ( 42.45%) > > Hmeansend-4096 5056.48 ( 0.00%) 7072.81 ( 39.88%) > > Hmeansend-8192 8784.22 ( 0.00%)12143.92 ( 38.25%) > > Hmeansend-1638415081.60 ( 0.00%)19812.71 ( 31.37%) > > Hmeanrecv-64 86.19 ( 0.00%) 126.59 ( 46.87%) > > Hmeanrecv-128173.93 ( 0.00%) 255.21 ( 46.73%) > > Hmeanrecv-256346.19 ( 0.00%) 506.72 ( 46.37%) > > Hmeanrecv-1024 1358.28 ( 0.00%) 1980.03 ( 45.77%) > > Hmeanrecv-2048 2623.45 ( 0.00%) 3729.35 ( 42.15%) > > Hmeanrecv-3312 4108.63 ( 0.00%) 5831.47 ( 41.93%) > > Hmeanrecv-4096 5037.25 ( 0.00%) 7021.59 ( 39.39%) > > Hmeanrecv-8192 8762.32 ( 0.00%)12072.44 ( 37.78%) > > Hmeanrecv-1638415042.36 ( 0.00%)19690.14 ( 30.90%) > > That looks way much better than the "v1" RFC posting. Was it just because > you stopped doing the "at first iteration, use migratetype as index", and > initializing pindex UINT_MAX hits so much quicker, or was there something > more subtle that I missed? There was no changelog between "v1" and "v2". > FYI, the LKP test robot reported the following so there is some independent basis for picking this up. ---8<--- FYI, we noticed a +23.0% improvement of netperf.Throughput_Mbps due to commit: commit 79404c5a5c66481aa55c0cae685e49e0f44a0479 ("mm: page_alloc: High-order per-cpu page allocator") https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-pagealloc-highorder-percpu-v3r1 -- Mel Gorman SUSE Labs
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
On 11/28/2016 07:54 PM, Christoph Lameter wrote: > On Mon, 28 Nov 2016, Mel Gorman wrote: > >> If you have a series aimed at parts of the fragmentation problem or how >> subsystems can avoid tracking 4K pages in some important cases then by >> all means post them. > > I designed SLUB with defrag methods in mind. We could warm up some old > patchsets that where never merged: > > https://lkml.org/lkml/2010/1/29/332 Note that some other solutions to the dentry cache problem (perhaps of a more low-hanging fruit kind) were also discussed at KS/LPC MM panel session: https://lwn.net/Articles/705758/
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
On Sun, Nov 27, 2016 at 01:19:54PM +, Mel Gorman wrote: > While it is recognised that this is a mixed bag of results, the patch > helps a lot more workloads than it hurts and intuitively, avoiding the > zone->lock in some cases is a good thing. > > Signed-off-by: Mel Gorman This seems like a net gain to me, and the patch loos good too. Acked-by: Johannes Weiner > @@ -255,6 +255,24 @@ enum zone_watermarks { > NR_WMARK > }; > > +/* > + * One per migratetype for order-0 pages and one per high-order up to > + * and including PAGE_ALLOC_COSTLY_ORDER. This may allow unmovable > + * allocations to contaminate reclaimable pageblocks if high-order > + * pages are heavily used. I think that should be fine. Higher order allocations rely on being able to compact movable blocks, not on reclaim freeing contiguous blocks, so poisoning reclaimable blocks is much less of a concern than poisoning movable blocks. And I'm not aware of any 0 < order < COSTLY movable allocations that would put movable blocks into an HO cache.
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
On Mon, 28 Nov 2016, Mel Gorman wrote: > If you have a series aimed at parts of the fragmentation problem or how > subsystems can avoid tracking 4K pages in some important cases then by > all means post them. I designed SLUB with defrag methods in mind. We could warm up some old patchsets that where never merged: https://lkml.org/lkml/2010/1/29/332
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
On Mon, Nov 28, 2016 at 10:38:58AM -0600, Christoph Lameter wrote: > > > that only insiders know how to tune and an overall fragile solution. > > While I agree with all of this, it's also a problem independent of this > > patch. > > It is related. The fundamental issue with fragmentation remain and IMHO we > really need to tackle this. > Fragmentation is one issue. Allocation scalability is a separate issue. This patch is about scaling parallel allocations of small contiguous ranges. Even if there were fragmentation-related patches up for discussion, they would not be directly affected by this patch. If you have a series aimed at parts of the fragmentation problem or how subsystems can avoid tracking 4K pages in some important cases then by all means post them. -- Mel Gorman SUSE Labs
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
On Mon, 28 Nov 2016, Mel Gorman wrote: > Yes, that's a problem for SLUB with or without this patch. It's always > been the case that SLUB relying on high-order pages for performance is > problematic. This is a general issue in the kernel. Performance often requires larger contiguous ranges of memory. > > that only insiders know how to tune and an overall fragile solution. > While I agree with all of this, it's also a problem independent of this > patch. It is related. The fundamental issue with fragmentation remain and IMHO we really need to tackle this.
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
On Mon, Nov 28, 2016 at 09:39:19AM -0600, Christoph Lameter wrote: > On Sun, 27 Nov 2016, Mel Gorman wrote: > > > > > SLUB has been the default small kernel object allocator for quite some time > > but it is not universally used due to performance concerns and a reliance > > on high-order pages. The high-order concerns has two major components -- > > high-order pages are not always available and high-order page allocations > > potentially contend on the zone->lock. This patch addresses some concerns > > about the zone lock contention by extending the per-cpu page allocator to > > cache high-order pages. The patch makes the following modifications > > Note that SLUB will only use high order pages when available and fall back > to order 0 if memory is fragmented. This means that the effect of this > patch is going to gradually vanish as memory becomes more and more > fragmented. > Yes, that's a problem for SLUB with or without this patch. It's always been the case that SLUB relying on high-order pages for performance is problematic. > I think this patch is beneficial but we need to address long term the > issue of memory fragmentation. That is not only a SLUB issue but an > overall problem since we keep on having to maintain lists of 4k memory > blocks in variuos subsystems. And as memory increases these lists are > becoming larger and larger and more difficult to manage. Code complexity > increases and fragility too (look at transparent hugepages). Ultimately we > will need a clean way to manage the allocation and freeing of large > physically contiguous pages. Reserving memory at booting (CMA, giant > pages) is some sort of solution but this all devolves into lots of knobs > that only insiders know how to tune and an overall fragile solution. > While I agree with all of this, it's also a problem independent of this patch. -- Mel Gorman SUSE Labs
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
On Sun, 27 Nov 2016, Mel Gorman wrote: > > SLUB has been the default small kernel object allocator for quite some time > but it is not universally used due to performance concerns and a reliance > on high-order pages. The high-order concerns has two major components -- > high-order pages are not always available and high-order page allocations > potentially contend on the zone->lock. This patch addresses some concerns > about the zone lock contention by extending the per-cpu page allocator to > cache high-order pages. The patch makes the following modifications Note that SLUB will only use high order pages when available and fall back to order 0 if memory is fragmented. This means that the effect of this patch is going to gradually vanish as memory becomes more and more fragmented. I think this patch is beneficial but we need to address long term the issue of memory fragmentation. That is not only a SLUB issue but an overall problem since we keep on having to maintain lists of 4k memory blocks in variuos subsystems. And as memory increases these lists are becoming larger and larger and more difficult to manage. Code complexity increases and fragility too (look at transparent hugepages). Ultimately we will need a clean way to manage the allocation and freeing of large physically contiguous pages. Reserving memory at booting (CMA, giant pages) is some sort of solution but this all devolves into lots of knobs that only insiders know how to tune and an overall fragile solution.
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
On Mon, Nov 28, 2016 at 12:00:41PM +0100, Vlastimil Babka wrote: > On 11/27/2016 02:19 PM, Mel Gorman wrote: > > > > 2-socket modern machine > > 4.9.0-rc5 4.9.0-rc5 > > vanilla hopcpu-v3 > > Hmeansend-64 178.38 ( 0.00%) 256.74 ( 43.93%) > > Hmeansend-128351.49 ( 0.00%) 507.52 ( 44.39%) > > Hmeansend-256671.23 ( 0.00%) 1004.19 ( 49.60%) > > Hmeansend-1024 2663.60 ( 0.00%) 3910.42 ( 46.81%) > > Hmeansend-2048 5126.53 ( 0.00%) 7562.13 ( 47.51%) > > Hmeansend-3312 7949.99 ( 0.00%)11565.98 ( 45.48%) > > Hmeansend-4096 9433.56 ( 0.00%)12929.67 ( 37.06%) > > Hmeansend-8192 15940.64 ( 0.00%)21587.63 ( 35.43%) > > Hmeansend-1638426699.54 ( 0.00%)32013.79 ( 19.90%) > > Hmeanrecv-64 178.38 ( 0.00%) 256.72 ( 43.92%) > > Hmeanrecv-128351.49 ( 0.00%) 507.47 ( 44.38%) > > Hmeanrecv-256671.20 ( 0.00%) 1003.95 ( 49.57%) > > Hmeanrecv-1024 2663.45 ( 0.00%) 3909.70 ( 46.79%) > > Hmeanrecv-2048 5126.26 ( 0.00%) 7560.67 ( 47.49%) > > Hmeanrecv-3312 7949.50 ( 0.00%)11564.63 ( 45.48%) > > Hmeanrecv-4096 9433.04 ( 0.00%)12927.48 ( 37.04%) > > Hmeanrecv-8192 15939.64 ( 0.00%)21584.59 ( 35.41%) > > Hmeanrecv-1638426698.44 ( 0.00%)32009.77 ( 19.89%) > > > > 1-socket 6 year old machine > > 4.9.0-rc5 4.9.0-rc5 > > vanilla hopcpu-v3 > > Hmeansend-64 87.47 ( 0.00%) 127.14 ( 45.36%) > > Hmeansend-128174.36 ( 0.00%) 256.42 ( 47.06%) > > Hmeansend-256347.52 ( 0.00%) 509.41 ( 46.59%) > > Hmeansend-1024 1363.03 ( 0.00%) 1991.54 ( 46.11%) > > Hmeansend-2048 2632.68 ( 0.00%) 3759.51 ( 42.80%) > > Hmeansend-3312 4123.19 ( 0.00%) 5873.28 ( 42.45%) > > Hmeansend-4096 5056.48 ( 0.00%) 7072.81 ( 39.88%) > > Hmeansend-8192 8784.22 ( 0.00%)12143.92 ( 38.25%) > > Hmeansend-1638415081.60 ( 0.00%)19812.71 ( 31.37%) > > Hmeanrecv-64 86.19 ( 0.00%) 126.59 ( 46.87%) > > Hmeanrecv-128173.93 ( 0.00%) 255.21 ( 46.73%) > > Hmeanrecv-256346.19 ( 0.00%) 506.72 ( 46.37%) > > Hmeanrecv-1024 1358.28 ( 0.00%) 1980.03 ( 45.77%) > > Hmeanrecv-2048 2623.45 ( 0.00%) 3729.35 ( 42.15%) > > Hmeanrecv-3312 4108.63 ( 0.00%) 5831.47 ( 41.93%) > > Hmeanrecv-4096 5037.25 ( 0.00%) 7021.59 ( 39.39%) > > Hmeanrecv-8192 8762.32 ( 0.00%)12072.44 ( 37.78%) > > Hmeanrecv-1638415042.36 ( 0.00%)19690.14 ( 30.90%) > > That looks way much better than the "v1" RFC posting. Was it just because > you stopped doing the "at first iteration, use migratetype as index", and > initializing pindex UINT_MAX hits so much quicker, or was there something > more subtle that I missed? There was no changelog between "v1" and "v2". > The array is sized correctly which avoids one useless check. The order-0 lists are always drained first so in some rare cases, only the fast paths are used. There was a subtle correction in detecting when all of one list should be drained. In combination, it happened to boost performance a lot on the two machines I reported on. While 6 other machines were tested, not all of them saw such a dramatic boost and if these machines are rebooted and retested every time, the high performance is not always consistent, it all depends on how often the fast paths are used. > > Signed-off-by: Mel Gorman > > Acked-by: Vlastimil Babka > Thanks. -- Mel Gorman SUSE Labs
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
On 11/27/2016 02:19 PM, Mel Gorman wrote: 2-socket modern machine 4.9.0-rc5 4.9.0-rc5 vanilla hopcpu-v3 Hmeansend-64 178.38 ( 0.00%) 256.74 ( 43.93%) Hmeansend-128351.49 ( 0.00%) 507.52 ( 44.39%) Hmeansend-256671.23 ( 0.00%) 1004.19 ( 49.60%) Hmeansend-1024 2663.60 ( 0.00%) 3910.42 ( 46.81%) Hmeansend-2048 5126.53 ( 0.00%) 7562.13 ( 47.51%) Hmeansend-3312 7949.99 ( 0.00%)11565.98 ( 45.48%) Hmeansend-4096 9433.56 ( 0.00%)12929.67 ( 37.06%) Hmeansend-8192 15940.64 ( 0.00%)21587.63 ( 35.43%) Hmeansend-1638426699.54 ( 0.00%)32013.79 ( 19.90%) Hmeanrecv-64 178.38 ( 0.00%) 256.72 ( 43.92%) Hmeanrecv-128351.49 ( 0.00%) 507.47 ( 44.38%) Hmeanrecv-256671.20 ( 0.00%) 1003.95 ( 49.57%) Hmeanrecv-1024 2663.45 ( 0.00%) 3909.70 ( 46.79%) Hmeanrecv-2048 5126.26 ( 0.00%) 7560.67 ( 47.49%) Hmeanrecv-3312 7949.50 ( 0.00%)11564.63 ( 45.48%) Hmeanrecv-4096 9433.04 ( 0.00%)12927.48 ( 37.04%) Hmeanrecv-8192 15939.64 ( 0.00%)21584.59 ( 35.41%) Hmeanrecv-1638426698.44 ( 0.00%)32009.77 ( 19.89%) 1-socket 6 year old machine 4.9.0-rc5 4.9.0-rc5 vanilla hopcpu-v3 Hmeansend-64 87.47 ( 0.00%) 127.14 ( 45.36%) Hmeansend-128174.36 ( 0.00%) 256.42 ( 47.06%) Hmeansend-256347.52 ( 0.00%) 509.41 ( 46.59%) Hmeansend-1024 1363.03 ( 0.00%) 1991.54 ( 46.11%) Hmeansend-2048 2632.68 ( 0.00%) 3759.51 ( 42.80%) Hmeansend-3312 4123.19 ( 0.00%) 5873.28 ( 42.45%) Hmeansend-4096 5056.48 ( 0.00%) 7072.81 ( 39.88%) Hmeansend-8192 8784.22 ( 0.00%)12143.92 ( 38.25%) Hmeansend-1638415081.60 ( 0.00%)19812.71 ( 31.37%) Hmeanrecv-64 86.19 ( 0.00%) 126.59 ( 46.87%) Hmeanrecv-128173.93 ( 0.00%) 255.21 ( 46.73%) Hmeanrecv-256346.19 ( 0.00%) 506.72 ( 46.37%) Hmeanrecv-1024 1358.28 ( 0.00%) 1980.03 ( 45.77%) Hmeanrecv-2048 2623.45 ( 0.00%) 3729.35 ( 42.15%) Hmeanrecv-3312 4108.63 ( 0.00%) 5831.47 ( 41.93%) Hmeanrecv-4096 5037.25 ( 0.00%) 7021.59 ( 39.39%) Hmeanrecv-8192 8762.32 ( 0.00%)12072.44 ( 37.78%) Hmeanrecv-1638415042.36 ( 0.00%)19690.14 ( 30.90%) That looks way much better than the "v1" RFC posting. Was it just because you stopped doing the "at first iteration, use migratetype as index", and initializing pindex UINT_MAX hits so much quicker, or was there something more subtle that I missed? There was no changelog between "v1" and "v2". Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka
[PATCH] mm: page_alloc: High-order per-cpu page allocator v3
Changelog since v2 o Correct initialisation to avoid -Woverflow warning SLUB has been the default small kernel object allocator for quite some time but it is not universally used due to performance concerns and a reliance on high-order pages. The high-order concerns has two major components -- high-order pages are not always available and high-order page allocations potentially contend on the zone->lock. This patch addresses some concerns about the zone lock contention by extending the per-cpu page allocator to cache high-order pages. The patch makes the following modifications o New per-cpu lists are added to cache the high-order pages. This increases the cache footprint of the per-cpu allocator and overall usage but for some workloads, this will be offset by reduced contention on zone->lock. The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The remaining are high-order caches up to and including PAGE_ALLOC_COSTLY_ORDER o pcp accounting during free is now confined to free_pcppages_bulk as it's impossible for the caller to know exactly how many pages were freed. Due to the high-order caches, the number of pages drained for a request is no longer precise. o The high watermark for per-cpu pages is increased to reduce the probability that a single refill causes a drain on the next free. The benefit depends on both the workload and the machine as ultimately the determining factor is whether cache line bounces on zone->lock or contention is a problem. The patch was tested on a variety of workloads and machines, some of which are reported here. This is the result from netperf running UDP_STREAM on localhost. It was selected on the basis that it is slab-intensive and has been the subject of previous SLAB vs SLUB comparisons with the caveat that this is not testing between two physical hosts. 2-socket modern machine 4.9.0-rc5 4.9.0-rc5 vanilla hopcpu-v3 Hmeansend-64 178.38 ( 0.00%) 256.74 ( 43.93%) Hmeansend-128351.49 ( 0.00%) 507.52 ( 44.39%) Hmeansend-256671.23 ( 0.00%) 1004.19 ( 49.60%) Hmeansend-1024 2663.60 ( 0.00%) 3910.42 ( 46.81%) Hmeansend-2048 5126.53 ( 0.00%) 7562.13 ( 47.51%) Hmeansend-3312 7949.99 ( 0.00%)11565.98 ( 45.48%) Hmeansend-4096 9433.56 ( 0.00%)12929.67 ( 37.06%) Hmeansend-8192 15940.64 ( 0.00%)21587.63 ( 35.43%) Hmeansend-1638426699.54 ( 0.00%)32013.79 ( 19.90%) Hmeanrecv-64 178.38 ( 0.00%) 256.72 ( 43.92%) Hmeanrecv-128351.49 ( 0.00%) 507.47 ( 44.38%) Hmeanrecv-256671.20 ( 0.00%) 1003.95 ( 49.57%) Hmeanrecv-1024 2663.45 ( 0.00%) 3909.70 ( 46.79%) Hmeanrecv-2048 5126.26 ( 0.00%) 7560.67 ( 47.49%) Hmeanrecv-3312 7949.50 ( 0.00%)11564.63 ( 45.48%) Hmeanrecv-4096 9433.04 ( 0.00%)12927.48 ( 37.04%) Hmeanrecv-8192 15939.64 ( 0.00%)21584.59 ( 35.41%) Hmeanrecv-1638426698.44 ( 0.00%)32009.77 ( 19.89%) 1-socket 6 year old machine 4.9.0-rc5 4.9.0-rc5 vanilla hopcpu-v3 Hmeansend-64 87.47 ( 0.00%) 127.14 ( 45.36%) Hmeansend-128174.36 ( 0.00%) 256.42 ( 47.06%) Hmeansend-256347.52 ( 0.00%) 509.41 ( 46.59%) Hmeansend-1024 1363.03 ( 0.00%) 1991.54 ( 46.11%) Hmeansend-2048 2632.68 ( 0.00%) 3759.51 ( 42.80%) Hmeansend-3312 4123.19 ( 0.00%) 5873.28 ( 42.45%) Hmeansend-4096 5056.48 ( 0.00%) 7072.81 ( 39.88%) Hmeansend-8192 8784.22 ( 0.00%)12143.92 ( 38.25%) Hmeansend-1638415081.60 ( 0.00%)19812.71 ( 31.37%) Hmeanrecv-64 86.19 ( 0.00%) 126.59 ( 46.87%) Hmeanrecv-128173.93 ( 0.00%) 255.21 ( 46.73%) Hmeanrecv-256346.19 ( 0.00%) 506.72 ( 46.37%) Hmeanrecv-1024 1358.28 ( 0.00%) 1980.03 ( 45.77%) Hmeanrecv-2048 2623.45 ( 0.00%) 3729.35 ( 42.15%) Hmeanrecv-3312 4108.63 ( 0.00%) 5831.47 ( 41.93%) Hmeanrecv-4096 5037.25 ( 0.00%) 7021.59 ( 39.39%) Hmeanrecv-8192 8762.32 ( 0.00%)12072.44 ( 37.78%) Hmeanrecv-1638415042.36 ( 0.00%)19690.14 ( 30.90%) This is somewhat dramatic but it's also not universal. For example, it was observed on an older HP machine using pcc-cpufreq that there was almost no difference but pcc-cpufreq is also a known performance hazard. These are quite different results but illustrate that the patch is dependent on the CPU. The results are similar for TCP_STREAM on the two-socket machine. The observations on sockperf are different. 2-socket modern machine sockperf-tcp-throughput 4.9.0-rc5 4.9.0-rc5