Re: Unstable local network throughput
On 17 August 2016 at 08:43, Ben RUBSON wrote: > >> On 17 Aug 2016, at 17:38, Adrian Chadd wrote: >> >> [snip] >> >> ok, so this is what I was seeing when I was working on this stuff last. >> >> The big abusers are: >> >> * so_snd lock, for TX'ing producer/consumer socket data >> * tcp stack pcb locking (which rss tries to work around, but it again >> doesn't help producer/consumer locking, only multiple sockets) >> * for some of the workloads, the scheduler spinlocks are pretty >> heavily contended and that's likely worth digging into. >> >> Thanks! I'll go try this on a couple of boxes I have with >> intel/chelsio 40g hardware in it and see if I can reproduce it. (My >> test boxes have the 40g NICs in NUMA domain 1...) > > You're welcome, happy to help and troubleshoot :) > > What about the performance which differs from one reboot to another, > as if the NUMA domains have switched ? (0 to 1 & 1 to 0) > Did you already see this ? I've seen some varying behaviours, yeah. There are a lot of missing pieces in kernel-side NUMA, so a lot of the kernel memory allocation behaviours are undefined. Well, tehy'e defined; it's just there's no way right now for the kernel (eg mbufs, etc) to allocate domain local memory. So it's "by accident", and sometimes it's fine; sometimes it's not. -adrian ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
> On 17 Aug 2016, at 17:38, Adrian Chadd wrote: > > [snip] > > ok, so this is what I was seeing when I was working on this stuff last. > > The big abusers are: > > * so_snd lock, for TX'ing producer/consumer socket data > * tcp stack pcb locking (which rss tries to work around, but it again > doesn't help producer/consumer locking, only multiple sockets) > * for some of the workloads, the scheduler spinlocks are pretty > heavily contended and that's likely worth digging into. > > Thanks! I'll go try this on a couple of boxes I have with > intel/chelsio 40g hardware in it and see if I can reproduce it. (My > test boxes have the 40g NICs in NUMA domain 1...) You're welcome, happy to help and troubleshoot :) What about the performance which differs from one reboot to another, as if the NUMA domains have switched ? (0 to 1 & 1 to 0) Did you already see this ? Ben ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
[snip] ok, so this is what I was seeing when I was working on this stuff last. The big abusers are: * so_snd lock, for TX'ing producer/consumer socket data * tcp stack pcb locking (which rss tries to work around, but it again doesn't help producer/consumer locking, only multiple sockets) * for some of the workloads, the scheduler spinlocks are pretty heavily contended and that's likely worth digging into. Thanks! I'll go try this on a couple of boxes I have with intel/chelsio 40g hardware in it and see if I can reproduce it. (My test boxes have the 40g NICs in NUMA domain 1...) -adrian ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
> On 15 Aug 2016, at 16:49, Ben RUBSON wrote: > >> On 12 Aug 2016, at 00:52, Adrian Chadd wrote: >> >> Which ones of these hit the line rate comfortably? > > So Adrian, I ran tests again using FreeBSD 11-RC1. > I put iperf throughput in result files (so that we can classify them), as > well as top -P ALL and pcm-memory.x. > iperf results : columns 3&4 are for srv1->srv2, columns 5&6 are for > srv2->srv1 (both flows running at the same time). > > > > Results, expected throughput (best first) : > 11, 01, 05, 07, 06 > > Results, bad (best first) : > 04, 02, 09, 03 > > Results, worst (best first) : > 10, 08 > > > > 00) Idle system > http://pastebin.com/raw/K1iMVHVF And strangely enough, from one server reboot to another, results are not the same. They can be excellent, as 01), and they can be dramatically bad, as 01b) : > 01) No pinning > http://pastebin.com/raw/7J3HibX0 01b) http://pastebin.com/raw/HbSPjigZ (-36GB/s) I kept this "bad boot" state and performed the other tests (with lock_profiling stats for 10 seconds) : > 02) numactl -l fixed-domain-rr -m 0 -c 0 > http://pastebin.com/raw/Yt7yYr0K 02b) http://pastebin.com/raw/n7aZF7ad (+16GB/s) > 03) numactl -l fixed-domain-rr -m 0 -c 0 > + cpuset -l <0-11> -x > http://pastebin.com/raw/1FAgDUSU 03b) http://pastebin.com/raw/QHbauimp (+24GB/s) > 04) numactl -l fixed-domain-rr -m 0 -c 0 > + cpuset -l <12-23> -x > http://pastebin.com/raw/fTAxrzBb 04b) http://pastebin.com/raw/7gJFZdqB (+10GB/s) > 05) numactl -l fixed-domain-rr -m 1 -c 1 > http://pastebin.com/raw/kuAHzKu2 05b) http://pastebin.com/raw/TwhHGKNa (-36GB/s) > 06) numactl -l fixed-domain-rr -m 1 -c 1 > + cpuset -l <0-11> -x > http://pastebin.com/raw/tgtaZgwb 06b) http://pastebin.com/raw/zSZ7r09Y (-36GB/s) > 07) numactl -l fixed-domain-rr -m 1 -c 1 > + cpuset -l <12-23> -x > http://pastebin.com/raw/16ReuGFF 07b) http://pastebin.com/raw/qCsaGBVn (-36GB/s) These results are very strange, as if NUMA domains were "inverted"... dmesg : http://pastebin.com/raw/i5USqLix If I'm lucky enough, after several reboots, I can produce same performance results as in test 01). dmesg : http://pastebin.com/raw/VvfQv6TM 01c) http://pastebin.com/raw/BVxgSyBN > 08) No pinning, default kernel (no NUMA option) > http://pastebin.com/raw/Ah74fKRx > > 09) default kernel (no NUMA option) > + cpuset -l <0-11> > + cpuset -l <0-11> -x > http://pastebin.com/raw/YE0PxEu8 > > 10) default kernel (no NUMA option) > + cpuset -l <12-23> > + cpuset -l <12-23> -x > http://pastebin.com/raw/RPh8aM49 > > > > 11) No pinning, default kernel (no NUMA option), NUMA BIOS disabled > http://pastebin.com/raw/LyGcLKDd ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
> On 16 Aug 2016, at 21:36, Adrian Chadd wrote: > > On 16 August 2016 at 02:58, Ben RUBSON wrote: >> >>> On 16 Aug 2016, at 03:45, Adrian Chadd wrote: >>> >>> Hi, >>> >>> ok, can you try 5) but also running with the interrupt threads pinned to >>> CPU 1? >> >> What do you mean by interrupt threads ? >> >> Perhaps you mean the NIC interrupts ? >> In this case see 6) and 7) where NIC IRQs are pinned to CPUs 0-11 (6) and >> 11-23 (7) ? > > Hm, interesting. ok. So, I wonder what the maximum per-domain memory > throughput is. Datasheet says 59GB/s ? http://ark.intel.com/products/83352/Intel-Xeon-Processor-E5-2620-v3-15M-Cache-2_40-GHz?q=E5-2620%20v3 ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
On 16 August 2016 at 02:58, Ben RUBSON wrote: > >> On 16 Aug 2016, at 03:45, Adrian Chadd wrote: >> >> Hi, >> >> ok, can you try 5) but also running with the interrupt threads pinned to CPU >> 1? > > What do you mean by interrupt threads ? > > Perhaps you mean the NIC interrupts ? > In this case see 6) and 7) where NIC IRQs are pinned to CPUs 0-11 (6) and > 11-23 (7) ? Hm, interesting. ok. So, I wonder what the maximum per-domain memory throughput is. I don't have any other easy things to instrument right now - the "everything disabled" method likely works best because of how the system is interleaving memory for you (instead of the OS trying to do it). Not pinning things means latency can be kept down to work around lock contention (ie, if a lock is held by thread A, and thread B needs to make some progress, it can make progress on another CPU , keeping CPU A held for a shorter period of time.) Would you mind compiling in LOCK_PROFILING and doing say, these tests with lock profiling enabled? It'll impact performance, sure, but I'd like to see what the locking looks like. sysctl debug.lock.prof.reset=1 sysctl debug.lock.prof.enable=1 (run test for a few seconds) sysctl debug.lock.prof.enable=0 sysctl debug.lock.prof.stats (and capture) * interrupts - domain 0, work - domain 1 * interrupts - domain 1, work - domain 1 * interrupts - domain 1, work - domain 0 Thanks! -adrian ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
> On 16 Aug 2016, at 03:45, Adrian Chadd wrote: > > Hi, > > ok, can you try 5) but also running with the interrupt threads pinned to CPU > 1? What do you mean by interrupt threads ? Perhaps you mean the NIC interrupts ? In this case see 6) and 7) where NIC IRQs are pinned to CPUs 0-11 (6) and 11-23 (7) ? Ben ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
Hi, ok, can you try 5) but also running with the interrupt threads pinned to CPU 1? It looks like the interrupt threads are running on CPU 0, and my /guess/ (looking at the CPU usage distributions) that sometimes the userland bits run on the same CPU or numa domain as the interrupt bits, and it likely decreases some latency -> increasing throughput slightly. Thanks, -adrian ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
> On 12 Aug 2016, at 00:52, Adrian Chadd wrote: > > Which ones of these hit the line rate comfortably? So Adrian, I ran tests again using FreeBSD 11-RC1. I put iperf throughput in result files (so that we can classify them), as well as top -P ALL and pcm-memory.x. iperf results : columns 3&4 are for srv1->srv2, columns 5&6 are for srv2->srv1 (both flows running at the same time). Results, expected throughput (best first) : 11, 01, 05, 07, 06 Results, bad (best first) : 04, 02, 09, 03 Results, worst (best first) : 10, 08 00) Idle system http://pastebin.com/raw/K1iMVHVF 01) No pinning http://pastebin.com/raw/7J3HibX0 02) numactl -l fixed-domain-rr -m 0 -c 0 http://pastebin.com/raw/Yt7yYr0K 03) numactl -l fixed-domain-rr -m 0 -c 0 + cpuset -l <0-11> -x http://pastebin.com/raw/1FAgDUSU 04) numactl -l fixed-domain-rr -m 0 -c 0 + cpuset -l <12-23> -x http://pastebin.com/raw/fTAxrzBb 05) numactl -l fixed-domain-rr -m 1 -c 1 http://pastebin.com/raw/kuAHzKu2 06) numactl -l fixed-domain-rr -m 1 -c 1 + cpuset -l <0-11> -x http://pastebin.com/raw/tgtaZgwb 07) numactl -l fixed-domain-rr -m 1 -c 1 + cpuset -l <12-23> -x http://pastebin.com/raw/16ReuGFF 08) No pinning, default kernel (no NUMA option) http://pastebin.com/raw/Ah74fKRx 09) default kernel (no NUMA option) + cpuset -l <0-11> + cpuset -l <0-11> -x http://pastebin.com/raw/YE0PxEu8 10) default kernel (no NUMA option) + cpuset -l <12-23> + cpuset -l <12-23> -x http://pastebin.com/raw/RPh8aM49 11) No pinning, default kernel (no NUMA option), NUMA BIOS disabled http://pastebin.com/raw/LyGcLKDd Ben ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
Which ones of these hit the line rate comfortably? -a On 11 August 2016 at 15:35, Ben RUBSON wrote: > >> On 11 Aug 2016, at 18:36, Adrian Chadd wrote: >> >> Hi! >> >> mlx4_core0: mem >> 0xfbe0-0xfbef,0xfb00-0xfb7f irq 64 at device 0.0 >> numa-domain 1 on pci16 >> mlx4_core: Initializing mlx4_core: Mellanox ConnectX VPI driver v2.1.6 >> (Aug 11 2016) >> >> so the NIC is in numa-domain 1. Try pinning the worker threads to >> numa-domain 1 when you run the test: >> >> numactl -l first-touch-rr -m 1 -c 1 ./test-program >> >> You can also try pinning the NIC threads to numa-domain 1 versus 0 (so >> the second set of CPUs, not the first set.) >> >> vmstat -ia | grep mlx (get the list of interrupt thread ids) >> then for each: >> >> cpuset -d 1 -x >> >> Run pcm-memory.x each time so we can see the before and after effects >> on local versus remote memory access. >> >> Thanks! > > Adrian, here are the results : > > > > Idle system : > http://pastebin.com/raw/K1iMVHVF > > > > No pinning : > http://pastebin.com/raw/w5KuexQ3 > CPU : http://pastebin.com/raw/8zgRaazN > > numactl -l fixed-domain-rr -m 1 -c 1 : > http://pastebin.com/raw/VWweYF9H > CPU : http://pastebin.com/raw/QjaVH32X > > numactl -l fixed-domain-rr -m 0 -c 0 : > http://pastebin.com/raw/71hfGJdw > CPU : http://pastebin.com/raw/hef058Na > > numactl -l fixed-domain-rr -m 1 -c 1 > + cpuset -l -x : > http://pastebin.com/raw/nEQkgMK2 > CPU : http://pastebin.com/raw/R652KAdJ > > numactl -l fixed-domain-rr -m 0 -c 0 > + cpuset -l -x : > http://pastebin.com/raw/GdYJHyae > CPU : http://pastebin.com/raw/Ggfx9uF9 > > > > No pinning, default kernel (no NUMA option) : > http://pastebin.com/raw/iQ2u8d8k > CPU : http://pastebin.com/raw/Xr77KpcM > > default kernel (no NUMA option) > + cpuset -l > + cpuset -l -x : > http://pastebin.com/raw/VBWg4SZs > > default kernel (no NUMA option) > + cpuset -l > + cpuset -l -x : > http://pastebin.com/raw/SrJLZxuT > > > > No pinning, default kernel (no NUMA option), NUMA BIOS disabled : > http://pastebin.com/raw/P5LrUASN > > > > I would say : > - FreeBSD <= 10.3 : disable NUMA in BIOS > - FreeBSD >= 11 : disable NUMA in BIOS or enable NUMA in kernel. > But let's wait your analysis :) > > > > Ben > > ___ > freebsd-net@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
> On 11 Aug 2016, at 18:36, Adrian Chadd wrote: > > Hi! > > mlx4_core0: mem > 0xfbe0-0xfbef,0xfb00-0xfb7f irq 64 at device 0.0 > numa-domain 1 on pci16 > mlx4_core: Initializing mlx4_core: Mellanox ConnectX VPI driver v2.1.6 > (Aug 11 2016) > > so the NIC is in numa-domain 1. Try pinning the worker threads to > numa-domain 1 when you run the test: > > numactl -l first-touch-rr -m 1 -c 1 ./test-program > > You can also try pinning the NIC threads to numa-domain 1 versus 0 (so > the second set of CPUs, not the first set.) > > vmstat -ia | grep mlx (get the list of interrupt thread ids) > then for each: > > cpuset -d 1 -x > > Run pcm-memory.x each time so we can see the before and after effects > on local versus remote memory access. > > Thanks! Adrian, here are the results : Idle system : http://pastebin.com/raw/K1iMVHVF No pinning : http://pastebin.com/raw/w5KuexQ3 CPU : http://pastebin.com/raw/8zgRaazN numactl -l fixed-domain-rr -m 1 -c 1 : http://pastebin.com/raw/VWweYF9H CPU : http://pastebin.com/raw/QjaVH32X numactl -l fixed-domain-rr -m 0 -c 0 : http://pastebin.com/raw/71hfGJdw CPU : http://pastebin.com/raw/hef058Na numactl -l fixed-domain-rr -m 1 -c 1 + cpuset -l -x : http://pastebin.com/raw/nEQkgMK2 CPU : http://pastebin.com/raw/R652KAdJ numactl -l fixed-domain-rr -m 0 -c 0 + cpuset -l -x : http://pastebin.com/raw/GdYJHyae CPU : http://pastebin.com/raw/Ggfx9uF9 No pinning, default kernel (no NUMA option) : http://pastebin.com/raw/iQ2u8d8k CPU : http://pastebin.com/raw/Xr77KpcM default kernel (no NUMA option) + cpuset -l + cpuset -l -x : http://pastebin.com/raw/VBWg4SZs default kernel (no NUMA option) + cpuset -l + cpuset -l -x : http://pastebin.com/raw/SrJLZxuT No pinning, default kernel (no NUMA option), NUMA BIOS disabled : http://pastebin.com/raw/P5LrUASN I would say : - FreeBSD <= 10.3 : disable NUMA in BIOS - FreeBSD >= 11 : disable NUMA in BIOS or enable NUMA in kernel. But let's wait your analysis :) Ben ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
adrian did mean fixed-domain-rr. :-P sorry! (Sorry, needed to update my NUMA boxes, things "changed" since I wrote this.) -a ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
On 08/11/16 12:54 PM, Ben RUBSON wrote: > >> On 11 Aug 2016, at 19:51, Ben RUBSON wrote: >> >> >>> On 11 Aug 2016, at 18:36, Adrian Chadd wrote: >>> >>> Hi! >> >> Hi Adrian, >> >>> mlx4_core0: mem >>> 0xfbe0-0xfbef,0xfb00-0xfb7f irq 64 at device 0.0 >>> numa-domain 1 on pci16 >>> mlx4_core: Initializing mlx4_core: Mellanox ConnectX VPI driver v2.1.6 >>> (Aug 11 2016) >>> >>> so the NIC is in numa-domain 1. Try pinning the worker threads to >>> numa-domain 1 when you run the test: >>> >>> numactl -l first-touch-rr -m 1 -c 1 ./test-program >> >> # numactl -l first-touch-rr -m 1 -c 1 /usr/local/bin/iperf -c 192.168.2.1 -l >> 128KB -P 16 -i 2 -t 6000 >> Could not parse policy: '128KB' >> >> I did not manage to give arguments to command. Any idea ? > > I answer to myself, this should do the trick : > numactl -l first-touch-rr -m 1 -c 1 -- /usr/local/bin/iperf -c 192.168.2.1 -l > 128KB -P 16 -i 2 -t 6000 This has annoyed me quite a bit, too. Setting the POSIXLY_CORRECT environment variable would also make it behave "correctly". > However of course it still gives the error below : > >> # numactl -l first-touch-rr -m 1 -c 1 /usr/local/bin/iperf >> >> numactl: numa_setaffinity: Invalid argument >> >> And sounds like -m is not allowed with first-touch-rr. >> What should I use ? Adrian probably meant fixed-domain-rr. >> Thank you ! >> >>> You can also try pinning the NIC threads to numa-domain 1 versus 0 (so >>> the second set of CPUs, not the first set.) >>> >>> vmstat -ia | grep mlx (get the list of interrupt thread ids) >>> then for each: >>> >>> cpuset -d 1 -x >>> >>> Run pcm-memory.x each time so we can see the before and after effects >>> on local versus remote memory access. >>> >>> Thanks! >>> >>> >>> >>> -adrian >> > > ___ > freebsd-net@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" > ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
> On 11 Aug 2016, at 18:36, Adrian Chadd wrote: > > Hi! > > mlx4_core0: mem > 0xfbe0-0xfbef,0xfb00-0xfb7f irq 64 at device 0.0 > numa-domain 1 on pci16 > mlx4_core: Initializing mlx4_core: Mellanox ConnectX VPI driver v2.1.6 > (Aug 11 2016) > > so the NIC is in numa-domain 1. Try pinning the worker threads to > numa-domain 1 when you run the test: > > numactl -l first-touch-rr -m 1 -c 1 ./test-program > > You can also try pinning the NIC threads to numa-domain 1 versus 0 (so > the second set of CPUs, not the first set.) > > vmstat -ia | grep mlx (get the list of interrupt thread ids) > then for each: > > cpuset -d 1 -x > > Run pcm-memory.x each time so we can see the before and after effects > on local versus remote memory access. > > Thanks! Waiting for the correct commands to use, I made some tests with : cpuset -l 0-11 or cpuset -l 12-23 and : c=0 vmstat -ia | grep mlx | sed 's/^irq\(.*\):.*/\1/' | while read i do cpuset -l $c -x $i ; ((c++)) ; [[ $c -gt 11 ]] && c=0 done or c=12 vmstat -ia | grep mlx | sed 's/^irq\(.*\):.*/\1/' | while read i do cpuset -l $c -x $i ; ((c++)) ; [[ $c -gt 23 ]] && c=12 done Results : No pinning http://pastebin.com/raw/CrK1CQpm Pinning workers to 0-11 Pinning NIC IRQ to 0-11 http://pastebin.com/raw/kLEQ6TKL Pinning workers to 12-23 Pinning NIC IRQ to 12-23 http://pastebin.com/raw/qGxw9KL2 Pinning workers to 12-23 Pinning NIC IRQ to 0-11 http://pastebin.com/raw/tFjii629 Comments : Strangely, the best iPer throughput results are when there is no pinning. Whereas before running kernel with your new options, the best results were with everything pinned to 0-11. Feel free to ask me further testing. Ben ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
> On 11 Aug 2016, at 19:51, Ben RUBSON wrote: > > >> On 11 Aug 2016, at 18:36, Adrian Chadd wrote: >> >> Hi! > > Hi Adrian, > >> mlx4_core0: mem >> 0xfbe0-0xfbef,0xfb00-0xfb7f irq 64 at device 0.0 >> numa-domain 1 on pci16 >> mlx4_core: Initializing mlx4_core: Mellanox ConnectX VPI driver v2.1.6 >> (Aug 11 2016) >> >> so the NIC is in numa-domain 1. Try pinning the worker threads to >> numa-domain 1 when you run the test: >> >> numactl -l first-touch-rr -m 1 -c 1 ./test-program > > # numactl -l first-touch-rr -m 1 -c 1 /usr/local/bin/iperf -c 192.168.2.1 -l > 128KB -P 16 -i 2 -t 6000 > Could not parse policy: '128KB' > > I did not manage to give arguments to command. Any idea ? I answer to myself, this should do the trick : numactl -l first-touch-rr -m 1 -c 1 -- /usr/local/bin/iperf -c 192.168.2.1 -l 128KB -P 16 -i 2 -t 6000 However of course it still gives the error below : > # numactl -l first-touch-rr -m 1 -c 1 /usr/local/bin/iperf > > numactl: numa_setaffinity: Invalid argument > > And sounds like -m is not allowed with first-touch-rr. > What should I use ? > > Thank you ! > >> You can also try pinning the NIC threads to numa-domain 1 versus 0 (so >> the second set of CPUs, not the first set.) >> >> vmstat -ia | grep mlx (get the list of interrupt thread ids) >> then for each: >> >> cpuset -d 1 -x >> >> Run pcm-memory.x each time so we can see the before and after effects >> on local versus remote memory access. >> >> Thanks! >> >> >> >> -adrian > ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
> On 11 Aug 2016, at 18:36, Adrian Chadd wrote: > > Hi! Hi Adrian, > mlx4_core0: mem > 0xfbe0-0xfbef,0xfb00-0xfb7f irq 64 at device 0.0 > numa-domain 1 on pci16 > mlx4_core: Initializing mlx4_core: Mellanox ConnectX VPI driver v2.1.6 > (Aug 11 2016) > > so the NIC is in numa-domain 1. Try pinning the worker threads to > numa-domain 1 when you run the test: > > numactl -l first-touch-rr -m 1 -c 1 ./test-program # numactl -l first-touch-rr -m 1 -c 1 /usr/local/bin/iperf -c 192.168.2.1 -l 128KB -P 16 -i 2 -t 6000 Could not parse policy: '128KB' I did not manage to give arguments to command. Any idea ? # numactl -l first-touch-rr -m 1 -c 1 /usr/local/bin/iperf numactl: numa_setaffinity: Invalid argument And sounds like -m is not allowed with first-touch-rr. What should I use ? Thank you ! > You can also try pinning the NIC threads to numa-domain 1 versus 0 (so > the second set of CPUs, not the first set.) > > vmstat -ia | grep mlx (get the list of interrupt thread ids) > then for each: > > cpuset -d 1 -x > > Run pcm-memory.x each time so we can see the before and after effects > on local versus remote memory access. > > Thanks! > > > > -adrian ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
Hi! mlx4_core0: mem 0xfbe0-0xfbef,0xfb00-0xfb7f irq 64 at device 0.0 numa-domain 1 on pci16 mlx4_core: Initializing mlx4_core: Mellanox ConnectX VPI driver v2.1.6 (Aug 11 2016) so the NIC is in numa-domain 1. Try pinning the worker threads to numa-domain 1 when you run the test: numactl -l first-touch-rr -m 1 -c 1 ./test-program You can also try pinning the NIC threads to numa-domain 1 versus 0 (so the second set of CPUs, not the first set.) vmstat -ia | grep mlx (get the list of interrupt thread ids) then for each: cpuset -d 1 -x Run pcm-memory.x each time so we can see the before and after effects on local versus remote memory access. Thanks! -adrian ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
> On 11 Aug 2016, at 00:11, Adrian Chadd wrote: > > hi, > > ok, lets start by getting the NUMA bits into the kernel so you can > mess with things. > > add this to the kernel > > options MAXMEMDOM=8 > (which hopefully is enough) > options VM_NUMA_ALLOC > options DEVICE_NUMA > > Then reboot and post your 'dmesg' output to the list. This should show > exactly which domain devices are in. http://pastebin.com/raw/yaYEytME > Install the 'intel-pcm' package. There's a 'pcm-numa.x' command - do > kldload cpuctl, then run pcm-numa.x and see if it works. It should > give us some useful information about NUMA. > (Same as pcm-memory.x, pcm-pcie.x, etc.) Yes these tools work : # pcm-numa.x Intel(r) Performance Counter Monitor: NUMA monitoring utility Copyright (c) 2009-2016 Intel Corporation Number of physical cores: 12 Number of logical cores: 24 Number of online logical cores: 24 Threads (logical cores) per physical core: 2 Num sockets: 2 Physical cores per socket: 6 Core PMU (perfmon) version: 3 Number of core PMU generic (programmable) counters: 4 Width of generic (programmable) counters: 48 bits Number of core PMU fixed counters: 3 Width of fixed counters: 48 bits Nominal core frequency: 24 Hz Package thermal spec power: 85 Watt; Package minimum power: 31 Watt; Package maximum power: 170 Watt; ERROR: QPI LL monitoring device (0:127:9:2) is missing. The QPI statistics will be incomplete or missing. Socket 0: 2 memory controllers detected with total number of 5 channels. 1 QPI ports detected. ERROR: QPI LL monitoring device (0:255:9:2) is missing. The QPI statistics will be incomplete or missing. Socket 1: 2 memory controllers detected with total number of 5 channels. 1 QPI ports detected. Socket 0 Max QPI link 0 speed: 16.0 GBytes/second (8.0 GT/second) Socket 1 Max QPI link 0 speed: 16.0 GBytes/second (8.0 GT/second) Detected Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz "Intel(r) microarchitecture codename Haswell-EP/EN/EX" Update every 1.0 seconds Time elapsed: 1010 ms Core | IPC | Instructions | Cycles | Local DRAM accesses | Remote DRAM Accesses 0 0.70 1158 K 1655 K 577 245 1 0.33186 K 557 K 160 15 2 0.43317 K 745 K 385 31 3 0.36260 K 718 K 232 33 4 0.31186 K 602 K 188 11 5 0.39314 K 806 K 371 43 6 0.36235 K 659 K 257 46 7 0.35200 K 576 K 133 44 8 0.42423 K 1011 K 226 20 9 0.60 1309 K 2199 K 379 104 10 0.34192 K 562 K 161 26 11 0.38257 K 684 K 158 44 12 0.35185 K 528 K39 121 13 0.32199 K 616 K51 171 14 0.31184 K 594 K34 130 15 0.35272 K 783 K47 256 16 0.31178 K 579 K26 127 17 0.37272 K 729 K87 204 18 0.52485 K 942 K35 204 19 0.40285 K 723 K16 147 20 0.31195 K 620 K10 134 21 0.33201 K 615 K30 114 22 0.29176 K 612 K24 110 23 0.52896 K 1716 K86 895 --- * 0.43 8575 K 19 M 37123275 > Then next is playing around with interrupt thread / userland cpuset > and memory affinity. We can look at that next. Waiting for your instructions ! Ben ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
On 10 August 2016 at 12:50, Ben RUBSON wrote: > >> On 10 Aug 2016, at 21:47, Adrian Chadd wrote: >> >> hi, >> >> yeah, I'd like you to do some further testing with NUMA. Are you able >> to run freebsd-11 or -HEAD on these boxes? > > Hi Adrian, > > Yes I currently have 11 BETA3 running on them. > I could also run BETA4. hi, ok, lets start by getting the NUMA bits into the kernel so you can mess with things. add this to the kernel options MAXMEMDOM=8 (which hopefully is enough) options VM_NUMA_ALLOC options DEVICE_NUMA Then reboot and post your 'dmesg' output to the list. This should show exactly which domain devices are in. Install the 'intel-pcm' package. There's a 'pcm-numa.x' command - do kldload cpuctl, then run pcm-numa.x and see if it works. It should give us some useful information about NUMA. (Same as pcm-memory.x, pcm-pcie.x, etc.) Then next is playing around with interrupt thread / userland cpuset and memory affinity. We can look at that next. Currently the kernel doesn't know about NUMA local memory for device driver memory, kernel allocations for mbufs, etc, but we should still get a "good enough" idea about things. We can talk about that here once the above steps are done. Thanks! -adrian ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
> On 10 Aug 2016, at 21:47, Adrian Chadd wrote: > > hi, > > yeah, I'd like you to do some further testing with NUMA. Are you able > to run freebsd-11 or -HEAD on these boxes? Hi Adrian, Yes I currently have 11 BETA3 running on them. I could also run BETA4. Ben ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
hi, yeah, I'd like you to do some further testing with NUMA. Are you able to run freebsd-11 or -HEAD on these boxes? -adrian On 8 August 2016 at 07:01, Ben RUBSON wrote: > >> On 04 Aug 2016, at 11:40, Ben RUBSON wrote: >> >> >>> On 02 Aug 2016, at 22:11, Ben RUBSON wrote: >>> On 02 Aug 2016, at 21:35, Hans Petter Selasky wrote: The CX-3 driver doesn't bind the worker threads to specific CPU cores by default, so if your CPU has more than one so-called numa, you'll end up that the bottle-neck is the high-speed link between the CPU cores and not the card. A quick and dirty workaround is to "cpuset" iperf and the interrupt and taskqueue threads to specific CPU cores. >>> >>> My CPUs : 2x E5-2620v3 with DDR4@1866. >> >> OK, so I cpuset all Mellanox interrupts to one NUMA, as well as the iPerf >> processes, and I'm able to reach max bandwidth. >> Choosing the wrong NUMA (or both, or one for interrupts, the other one for >> iPerf, etc...) totally kills throughput. >> >> However, full-duplex throughput is still limited, I can't manage to reach >> 2x40Gb/s, throttle is at about 45Gb/s. >> I tried many different cpuset layouts, but I never went above 45Gb/s. >> (Linux allowed me to reach 2x40Gb/s so hardware is not a bottleneck) > > OK, I then found a workaround. > > In the motherboards' BIOS, I disabled the following option : > Advanced / ACPI Settings / NUMA > > And I'm now able to go up to 2x40Gb/s ! > I'm then even able to achieve this throughput without any cpuset ! > > Strange that Linux was able to deal with this setting, but I'm pretty sure > production performance will be easier to maintain with only 1 NUMA. > > Feel free to ask me if you want further testing with 2 NUMA. > > Ben > > ___ > freebsd-net@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
> On 04 Aug 2016, at 11:40, Ben RUBSON wrote: > > >> On 02 Aug 2016, at 22:11, Ben RUBSON wrote: >> >>> On 02 Aug 2016, at 21:35, Hans Petter Selasky wrote: >>> >>> The CX-3 driver doesn't bind the worker threads to specific CPU cores by >>> default, so if your CPU has more than one so-called numa, you'll end up >>> that the bottle-neck is the high-speed link between the CPU cores and not >>> the card. A quick and dirty workaround is to "cpuset" iperf and the >>> interrupt and taskqueue threads to specific CPU cores. >> >> My CPUs : 2x E5-2620v3 with DDR4@1866. > > OK, so I cpuset all Mellanox interrupts to one NUMA, as well as the iPerf > processes, and I'm able to reach max bandwidth. > Choosing the wrong NUMA (or both, or one for interrupts, the other one for > iPerf, etc...) totally kills throughput. > > However, full-duplex throughput is still limited, I can't manage to reach > 2x40Gb/s, throttle is at about 45Gb/s. > I tried many different cpuset layouts, but I never went above 45Gb/s. > (Linux allowed me to reach 2x40Gb/s so hardware is not a bottleneck) OK, I then found a workaround. In the motherboards' BIOS, I disabled the following option : Advanced / ACPI Settings / NUMA And I'm now able to go up to 2x40Gb/s ! I'm then even able to achieve this throughput without any cpuset ! Strange that Linux was able to deal with this setting, but I'm pretty sure production performance will be easier to maintain with only 1 NUMA. Feel free to ask me if you want further testing with 2 NUMA. Ben ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
> On 05 Aug 2016, at 10:30, Hans Petter Selasky wrote: > > On 08/04/16 23:49, Ben RUBSON wrote: >>> >>> On 04 Aug 2016, at 20:15, Ryan Stone wrote: >>> >>> On Thu, Aug 4, 2016 at 11:33 AM, Ben RUBSON wrote: >>> But even without RSS, I should be able to go up to 2x40Gbps, don't you >>> think so ? >>> Nobody already did this ? >>> >>> Try this patch >>> (...) >> >> I also just tested the NODEBUG kernel but it did not help. > > Hi, > > When running these tests, do you see any CPUs fully utilised? No, CPUs look like this on both servers : 27 processes: 1 running, 26 sleeping CPU 0: 1.1% user, 0.0% nice, 16.7% system, 0.0% interrupt, 82.2% idle CPU 1: 1.1% user, 0.0% nice, 18.9% system, 0.0% interrupt, 80.0% idle CPU 2: 1.9% user, 0.0% nice, 17.8% system, 0.0% interrupt, 80.4% idle CPU 3: 1.1% user, 0.0% nice, 15.2% system, 0.0% interrupt, 83.7% idle CPU 4: 0.4% user, 0.0% nice, 16.3% system, 0.0% interrupt, 83.3% idle CPU 5: 1.1% user, 0.0% nice, 14.4% system, 0.0% interrupt, 84.4% idle CPU 6: 2.6% user, 0.0% nice, 17.4% system, 0.0% interrupt, 80.0% idle CPU 7: 2.2% user, 0.0% nice, 15.2% system, 0.0% interrupt, 82.6% idle CPU 8: 1.1% user, 0.0% nice, 3.0% system, 15.9% interrupt, 80.0% idle CPU 9: 0.0% user, 0.0% nice, 3.0% system, 32.2% interrupt, 64.8% idle CPU 10: 0.0% user, 0.0% nice, 0.4% system, 58.9% interrupt, 40.7% idle CPU 11: 0.0% user, 0.0% nice, 0.4% system, 77.4% interrupt, 22.2% idle CPU 12: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU 13: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU 14: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU 15: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU 16: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU 17: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU 18: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU 19: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU 20: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU 21: 0.0% user, 0.0% nice, 0.0% system, 0.4% interrupt, 99.6% idle CPU 22: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU 23: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle Load is correctly spread over the NUMA connected to the NIC (the first 12 CPUs). There is clearly enough power to fulfill the full-duplex link ! I tried many cpuset configurations (IRQs over the 12 CPUs etc...), but no improvement at all. > Did you check the RX/TX pauseframes settings and the mlx4 sysctl statistics > counters, if there is packet loss? I tried to disable RX/TX pauseframes, but it did not help. And "sysctl -a | grep mlx | grep err" counters are all 0. I also played with ring size, adaptive interrupt moderation... with no luck. Ben ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
On 08/04/16 23:49, Ben RUBSON wrote: On 04 Aug 2016, at 20:15, Ryan Stone wrote: On Thu, Aug 4, 2016 at 11:33 AM, Ben RUBSON wrote: But even without RSS, I should be able to go up to 2x40Gbps, don't you think so ? Nobody already did this ? Try this patch (...) I also just tested the NODEBUG kernel but it did not help. Hi, When running these tests, do you see any CPUs fully utilized? Did you check the RX/TX pauseframes settings and the mlx4 sysctl statistics counters, if there is packet loss? --HPS ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
> > On 04 Aug 2016, at 20:15, Ryan Stone wrote: > > On Thu, Aug 4, 2016 at 11:33 AM, Ben RUBSON wrote: > But even without RSS, I should be able to go up to 2x40Gbps, don't you think > so ? > Nobody already did this ? > > Try this patch > (...) I also just tested the NODEBUG kernel but it did not help. ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
> On 04 Aug 2016, at 20:15, Ryan Stone wrote: > > On Thu, Aug 4, 2016 at 11:33 AM, Ben RUBSON wrote: > But even without RSS, I should be able to go up to 2x40Gbps, don't you think > so ? > Nobody already did this ? > > Try this patch > (...) I also just tested the NODEBUG kernel but I did not help. ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
> On 04 Aug 2016, at 20:15, Ryan Stone wrote: > > On Thu, Aug 4, 2016 at 11:33 AM, Ben RUBSON wrote: > But even without RSS, I should be able to go up to 2x40Gbps, don't you think > so ? > Nobody already did this ? > > Try this patch, which should improve performance when multiple TCP streams > are running in parallel over an mlx4_en port: > > https://people.freebsd.org/~rstone/patches/mlxen_counters.diff Thank you very much Ryan. I just tried it, but it does not help :/ Below is the cpuload during bidirectional trafic. We clearly see the 4 CPUs allocated to Mellanox IRQs, the others to iPerf processes. No improvement if IRQs are spread over the 12 NUMA CPUs, but slightly less throughput. Note that I get the same results if I only use 2 CPUs for IRQs. 27 processes: 1 running, 26 sleeping CPU 0: 1.1% user, 0.0% nice, 16.7% system, 0.0% interrupt, 82.2% idle CPU 1: 1.1% user, 0.0% nice, 18.9% system, 0.0% interrupt, 80.0% idle CPU 2: 1.9% user, 0.0% nice, 17.8% system, 0.0% interrupt, 80.4% idle CPU 3: 1.1% user, 0.0% nice, 15.2% system, 0.0% interrupt, 83.7% idle CPU 4: 0.4% user, 0.0% nice, 16.3% system, 0.0% interrupt, 83.3% idle CPU 5: 1.1% user, 0.0% nice, 14.4% system, 0.0% interrupt, 84.4% idle CPU 6: 2.6% user, 0.0% nice, 17.4% system, 0.0% interrupt, 80.0% idle CPU 7: 2.2% user, 0.0% nice, 15.2% system, 0.0% interrupt, 82.6% idle CPU 8: 1.1% user, 0.0% nice, 3.0% system, 15.9% interrupt, 80.0% idle CPU 9: 0.0% user, 0.0% nice, 3.0% system, 32.2% interrupt, 64.8% idle CPU 10: 0.0% user, 0.0% nice, 0.4% system, 58.9% interrupt, 40.7% idle CPU 11: 0.0% user, 0.0% nice, 0.4% system, 77.4% interrupt, 22.2% idle CPU 12: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU 13: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU 14: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU 15: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU 16: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU 17: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU 18: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU 19: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU 20: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU 21: 0.0% user, 0.0% nice, 0.0% system, 0.4% interrupt, 99.6% idle CPU 22: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle CPU 23: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
On Thu, Aug 4, 2016 at 11:33 AM, Ben RUBSON wrote: > But even without RSS, I should be able to go up to 2x40Gbps, don't you > think so ? > Nobody already did this ? > Try this patch, which should improve performance when multiple TCP streams are running in parallel over an mlx4_en port: https://people.freebsd.org/~rstone/patches/mlxen_counters.diff ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
> On 04 Aug 2016, at 17:33, Hans Petter Selasky wrote: > > On 08/04/16 17:24, Ben RUBSON wrote: >> >>> On 04 Aug 2016, at 11:40, Ben RUBSON wrote: >>> On 02 Aug 2016, at 22:11, Ben RUBSON wrote: > On 02 Aug 2016, at 21:35, Hans Petter Selasky wrote: > > The CX-3 driver doesn't bind the worker threads to specific CPU cores by > default, so if your CPU has more than one so-called numa, you'll end up > that the bottle-neck is the high-speed link between the CPU cores and not > the card. A quick and dirty workaround is to "cpuset" iperf and the > interrupt and taskqueue threads to specific CPU cores. My CPUs : 2x E5-2620v3 with DDR4@1866. >>> >>> OK, so I cpuset all Mellanox interrupts to one NUMA, as well as the iPerf >>> processes, and I'm able to reach max bandwidth. >>> Choosing the wrong NUMA (or both, or one for interrupts, the other one for >>> iPerf, etc...) totally kills throughput. >>> >>> However, full-duplex throughput is still limited, I can't manage to reach >>> 2x40Gb/s, throttle is at about 45Gb/s. >>> I tried many different cpuset layouts, but I never went above 45Gb/s. >>> (Linux allowed me to reach 2x40Gb/s so hardware is not a bottleneck) >>> > Are you using "options RSS" and "options PCBGROUP" in your kernel config? >>> >>> I will then give RSS a try. >> >> Without RSS : >> A ---> B : 40Gbps (unidirectional) >> A <--> B : 45Gbps (bidirectional) >> >> With RSS : >> A ---> B : 28Gbps (unidirectional) >> A <--> B : 28Gbps (bidirectional) >> >> Sounds like RSS does not help :/ >> >> Why, without RSS, do I have difficulties to reach 2x40Gbps (full-duplex) ? >> > > Hi, > > Possibly because the packets are arriving at the wrong CPU compared to what > RSS expects. Then RSS will invoke a taskqueue to process the packets on the > correct CPU, if I'm not mistaken. But even without RSS, I should be able to go up to 2x40Gbps, don't you think so ? Nobody already did this ? ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
On 08/04/16 17:24, Ben RUBSON wrote: On 04 Aug 2016, at 11:40, Ben RUBSON wrote: On 02 Aug 2016, at 22:11, Ben RUBSON wrote: On 02 Aug 2016, at 21:35, Hans Petter Selasky wrote: The CX-3 driver doesn't bind the worker threads to specific CPU cores by default, so if your CPU has more than one so-called numa, you'll end up that the bottle-neck is the high-speed link between the CPU cores and not the card. A quick and dirty workaround is to "cpuset" iperf and the interrupt and taskqueue threads to specific CPU cores. My CPUs : 2x E5-2620v3 with DDR4@1866. OK, so I cpuset all Mellanox interrupts to one NUMA, as well as the iPerf processes, and I'm able to reach max bandwidth. Choosing the wrong NUMA (or both, or one for interrupts, the other one for iPerf, etc...) totally kills throughput. However, full-duplex throughput is still limited, I can't manage to reach 2x40Gb/s, throttle is at about 45Gb/s. I tried many different cpuset layouts, but I never went above 45Gb/s. (Linux allowed me to reach 2x40Gb/s so hardware is not a bottleneck) Are you using "options RSS" and "options PCBGROUP" in your kernel config? I will then give RSS a try. Without RSS : A ---> B : 40Gbps (unidirectional) A <--> B : 45Gbps (bidirectional) With RSS : A ---> B : 28Gbps (unidirectional) A <--> B : 28Gbps (bidirectional) Sounds like RSS does not help :/ Why, without RSS, do I have difficulties to reach 2x40Gbps (full-duplex) ? Hi, Possibly because the packets are arriving at the wrong CPU compared to what RSS expects. Then RSS will invoke a taskqueue to process the packets on the correct CPU, if I'm not mistaken. The mlx4 driver does not fully support RSS. Then mlx5 does. --HPS ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
> On 04 Aug 2016, at 11:40, Ben RUBSON wrote: > >> On 02 Aug 2016, at 22:11, Ben RUBSON wrote: >> >>> On 02 Aug 2016, at 21:35, Hans Petter Selasky wrote: >>> >>> The CX-3 driver doesn't bind the worker threads to specific CPU cores by >>> default, so if your CPU has more than one so-called numa, you'll end up >>> that the bottle-neck is the high-speed link between the CPU cores and not >>> the card. A quick and dirty workaround is to "cpuset" iperf and the >>> interrupt and taskqueue threads to specific CPU cores. >> >> My CPUs : 2x E5-2620v3 with DDR4@1866. > > OK, so I cpuset all Mellanox interrupts to one NUMA, as well as the iPerf > processes, and I'm able to reach max bandwidth. > Choosing the wrong NUMA (or both, or one for interrupts, the other one for > iPerf, etc...) totally kills throughput. > > However, full-duplex throughput is still limited, I can't manage to reach > 2x40Gb/s, throttle is at about 45Gb/s. > I tried many different cpuset layouts, but I never went above 45Gb/s. > (Linux allowed me to reach 2x40Gb/s so hardware is not a bottleneck) > >>> Are you using "options RSS" and "options PCBGROUP" in your kernel config? > > I will then give RSS a try. Without RSS : A ---> B : 40Gbps (unidirectional) A <--> B : 45Gbps (bidirectional) With RSS : A ---> B : 28Gbps (unidirectional) A <--> B : 28Gbps (bidirectional) Sounds like RSS does not help :/ Why, without RSS, do I have difficulties to reach 2x40Gbps (full-duplex) ? Thank U ! ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
> On 02 Aug 2016, at 22:11, Ben RUBSON wrote: > >> On 02 Aug 2016, at 21:35, Hans Petter Selasky wrote: >> >> The CX-3 driver doesn't bind the worker threads to specific CPU cores by >> default, so if your CPU has more than one so-called numa, you'll end up that >> the bottle-neck is the high-speed link between the CPU cores and not the >> card. A quick and dirty workaround is to "cpuset" iperf and the interrupt >> and taskqueue threads to specific CPU cores. > > My CPUs : 2x E5-2620v3 with DDR4@1866. OK, so I cpuset all Mellanox interrupts to one NUMA, as well as the iPerf processes, and I'm able to reach max bandwidth. Choosing the wrong NUMA (or both, or one for interrupts, the other one for iPerf, etc...) totally kills throughput. However, full-duplex throughput is still limited, I can't manage to reach 2x40Gb/s, throttle is at about 45Gb/s. I tried many different cpuset layouts, but I never went above 45Gb/s. (Linux allowed me to reach 2x40Gb/s so hardware is not a bottleneck) >> Are you using "options RSS" and "options PCBGROUP" in your kernel config? I will then give RSS a try. Any other clue perhaps regarding the full-duplex limitation ? Many thanks ! Ben ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
> On 03 Aug 2016, at 20:02, Hans Petter Selasky wrote: > > The mlx4 send and receive queues have each their set of taskqueues. Look in > output from "ps auxww". I can't find them, I even unloaded/reloaded the driver in order to catch the differences, but I did not found any relevant process. Here are the process I have when driver is loaded (I removed my own process lines) : # ps auxxw USERPID %CPU %MEMVSZ RSS TT STAT STARTEDTIME COMMAND root 11 2398.1 0.0 0 384 - RL Mon10pm 65969:09.19 [idle] root 00.0 0.0 0 8288 - DLs Mon10pm 4:59.31 [kernel] root 10.0 0.0 9492 872 - ILs Mon10pm 0:00.04 /sbin/init -- root 20.0 0.0 096 - DL Mon10pm 0:00.82 [cam] root 30.0 0.0 0 176 - DL Mon10pm 0:17.88 [zfskern] root 40.0 0.0 016 - DL Mon10pm 0:00.00 [sctp_iterator] root 50.0 0.0 016 - DL Mon10pm 0:00.75 [enc_daemon0] root 60.0 0.0 016 - DL Mon10pm 0:00.50 [enc_daemon1] root 70.0 0.0 016 - DL Mon10pm 0:00.05 [enc_daemon2] root 80.0 0.0 016 - DL Mon10pm 0:00.05 [enc_daemon3] root 90.0 0.0 016 - DL Mon10pm 0:00.00 [g_mirror swap] root 100.0 0.0 016 - DL Mon10pm 0:00.00 [audit] root 120.0 0.0 0 1408 - WL Mon10pm 186:01.05 [intr] root 130.0 0.0 048 - DL Mon10pm 0:05.24 [geom] root 140.0 0.0 016 - DL Mon10pm 1:07.19 [rand_harvestq] root 150.0 0.0 0 160 - DL Mon10pm 0:08.22 [usb] root 160.0 0.0 032 - DL Mon10pm 0:00.23 [pagedaemon] root 170.0 0.0 016 - DL Mon10pm 0:00.00 [vmdaemon] root 180.0 0.0 016 - DL Mon10pm 0:00.00 [pagezero] root 190.0 0.0 016 - DL Mon10pm 0:00.12 [bufdaemon] root 200.0 0.0 016 - DL Mon10pm 0:00.13 [vnlru] root 210.0 0.0 016 - DL Mon10pm 2:13.02 [syncer] root1240.0 0.0 12360 1736 - Is Mon10pm 0:00.00 adjkerntz -i root6180.0 0.0 13628 4868 - Ss Mon10pm 0:00.03 /sbin/devd ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
On 08/03/16 18:57, Ben RUBSON wrote: taskqueue threads ? The mlx4 send and receive queues have each their set of taskqueues. Look in output from "ps auxww". --HPS ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
> On 02 Aug 2016, at 21:35, Hans Petter Selasky wrote: > > The CX-3 driver doesn't bind the worker threads to specific CPU cores by > default, so if your CPU has more than one so-called numa, you'll end up that > the bottle-neck is the high-speed link between the CPU cores and not the > card. A quick and dirty workaround is to "cpuset" iperf and the interrupt and > taskqueue threads to specific CPU cores. Hans Petter, I'm testing "cpuset" and sometimes get better results, I'm still trying to get the best. I've cpuset iperf & Mellanox interrupts, but what do you mean by taskqueue threads ? Thank U ! Ben ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
> On 03 Aug 2016, at 04:32, Eugene Grosbein wrote: > > If you have gateway_enable="YES" (sysctl net.inet.ip.forwarding=1) > then try to disable this forwarding setting and rerun your tests to compare > results. Thank you Eugene for this, but net.inet.ip.forwarding is disabled by default and I did not enabled it. # sysctl net.inet.ip.forwarding net.inet.ip.forwarding: 0 ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
03.08.2016 1:43, Ben RUBSON пишет: Hello, I'm trying to reach the 40Gb/s max throughtput between 2 hosts running a ConnectX-3 Mellanox network adapter. If you have gateway_enable="YES" (sysctl net.inet.ip.forwarding=1) then try to disable this forwarding setting and rerun your tests to compare results. ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
> On 02 Aug 2016, at 21:35, Hans Petter Selasky wrote: > > Hi, Thank you for your answer Hans Petter ! > The CX-3 driver doesn't bind the worker threads to specific CPU cores by > default, so if your CPU has more than one so-called numa, you'll end up that > the bottle-neck is the high-speed link between the CPU cores and not the > card. A quick and dirty workaround is to "cpuset" iperf and the interrupt and > taskqueue threads to specific CPU cores. My CPUs : 2x E5-2620v3 with DDR4@1866. What is strange is that even without using the card (iPerf on localhost), as my results show, I have very low and unstable random throughput (compared to Linux on the same host). > Are you using "options RSS" and "options PCBGROUP" in your kernel config? I only installed FreeBSD 10.3 and updated it, so I use the GENERIC kernel. RSS and PCBGROUP are not defined in /usr/src/sys/amd64/conf/GENERIC, so I think I do not use them. > Are you also testing CX-4 cards from Mellanox? No, I only have CX-3 at my disposal :) Ben PS : in my previous mail I sometimes used GB/s, of course you must read Gb/s everywhere. ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Re: Unstable local network throughput
On 08/02/16 20:43, Ben RUBSON wrote: Hello, I'm trying to reach the 40Gb/s max throughtput between 2 hosts running a ConnectX-3 Mellanox network adapter. FreeBSD 10.3 just installed, last updates performed. Network adapters running last firmwares / last drivers. No workload at all, just iPerf as the benchmark tool. Hi, The CX-3 driver doesn't bind the worker threads to specific CPU cores by default, so if your CPU has more than one so-called numa, you'll end up that the bottle-neck is the high-speed link between the CPU cores and not the card. A quick and dirty workaround is to "cpuset" iperf and the interrupt and taskqueue threads to specific CPU cores. Are you using "options RSS" and "options PCBGROUP" in your kernel config? Are you also testing CX-4 cards from Mellanox? --HPS ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
Unstable local network throughput
Hello, I'm trying to reach the 40Gb/s max throughtput between 2 hosts running a ConnectX-3 Mellanox network adapter. FreeBSD 10.3 just installed, last updates performed. Network adapters running last firmwares / last drivers. No workload at all, just iPerf as the benchmark tool. ### Step 1 : I never achieved to go beyond around 30Gb/s. I did the usual tuning (MTU, kern.ipc.maxsockbuf, net.inet.tcp.sendbuf_max, net.inet.tcp.recvbuf_max...). I played with adapter interrupt moderation. I played with iPerf options (window / buffer size, number of threads...). But it did not help. Results fluctuate, throughput is not sustained, and using 2 or more iPerf threads did not help but degraded the results "quality". ### Step 2 : Let's start Linux on these 2 physical hosts. I only had to use jumbo frames in order to achieve the 40Gb/s max throughtput... OK, network between the 2 hosts is not the root cause, and my hardware can run these adapters up to their max throughput. Good point. ### Step 3 : Go back to FreeBSD on these physical hosts. Let's run this simple command to test FreeBSD itself : # iperf -c 127.0.0.1 -i 1 -t 60 Strangely enough, higher results are around 35GB/s. Even more strange, from one run to another, I do not get identical results : sometimes 17Gb/s, sometimes 20, sometimes 30... Throughput can also suddenly drop down, then increase again... Power management in BIOS is totally disabled, as well as FreeBSD powerd, so CPU frequency is not throttled. Another strange thing, increasing the number of iPerf threads (-P 2 for example), does not improve the results at all. iPerf3 gave the same random results. ### Step 4 : Let's start Linux again on these 2 hosts. Let's run the same simple command : # iperf -c 127.0.0.1 -i 1 -t 60 Result : 45Gb/s. With 2 threads : 90Gb/s. With 4 threads : 180Gb/s. So here we have expected results, and they stay identical over the time. ### Step 5 : Does FreeBSD suffers when sending or when receiving ? Let's start one host with Linux, the other one with FreeBSD. Results : Linux --> FreeBSD : around 30GB/s. FreeBSD --> Linux : 40Gb/s. So sounds like FreeBSD suffers when receiving. ### Step 6 : FreeBSD 11-BETA3 gave the same random results. ### Questions : I think my tests show that there is something wrong with FreeBSD (tuning ? something else ?). Do you have the same kind of random results on your hosts ? Could you help me trying to have sustained througput @step3, as we have @step4 (I think this is what we should expect) ? There would then be no reason not to achieve max throughput through Mellanox adapters themselves. Thank you very much ! Best regards, Ben ___ freebsd-net@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"