Re: Unstable local network throughput

2016-08-17 Thread Adrian Chadd
On 17 August 2016 at 08:43, Ben RUBSON  wrote:
>
>> On 17 Aug 2016, at 17:38, Adrian Chadd  wrote:
>>
>> [snip]
>>
>> ok, so this is what I was seeing when I was working on this stuff last.
>>
>> The big abusers are:
>>
>> * so_snd lock, for TX'ing producer/consumer socket data
>> * tcp stack pcb locking (which rss tries to work around, but it again
>> doesn't help producer/consumer locking, only multiple sockets)
>> * for some of the workloads, the scheduler spinlocks are pretty
>> heavily contended and that's likely worth digging into.
>>
>> Thanks! I'll go try this on a couple of boxes I have with
>> intel/chelsio 40g hardware in it and see if I can reproduce it. (My
>> test boxes have the 40g NICs in NUMA domain 1...)
>
> You're welcome, happy to help and troubleshoot :)
>
> What about the performance which differs from one reboot to another,
> as if the NUMA domains have switched ? (0 to 1 & 1 to 0)
> Did you already see this ?

I've seen some varying behaviours, yeah. There are a lot of missing
pieces in kernel-side NUMA, so a lot of the kernel memory allocation
behaviours are undefined. Well, tehy'e defined; it's just there's no
way right now for the kernel (eg mbufs, etc) to allocate domain local
memory. So it's "by accident", and sometimes it's fine; sometimes it's
not.



-adrian
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-17 Thread Ben RUBSON

> On 17 Aug 2016, at 17:38, Adrian Chadd  wrote:
> 
> [snip]
> 
> ok, so this is what I was seeing when I was working on this stuff last.
> 
> The big abusers are:
> 
> * so_snd lock, for TX'ing producer/consumer socket data
> * tcp stack pcb locking (which rss tries to work around, but it again
> doesn't help producer/consumer locking, only multiple sockets)
> * for some of the workloads, the scheduler spinlocks are pretty
> heavily contended and that's likely worth digging into.
> 
> Thanks! I'll go try this on a couple of boxes I have with
> intel/chelsio 40g hardware in it and see if I can reproduce it. (My
> test boxes have the 40g NICs in NUMA domain 1...)

You're welcome, happy to help and troubleshoot :)

What about the performance which differs from one reboot to another,
as if the NUMA domains have switched ? (0 to 1 & 1 to 0)
Did you already see this ?

Ben

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-17 Thread Adrian Chadd
[snip]

ok, so this is what I was seeing when I was working on this stuff last.

The big abusers are:

* so_snd lock, for TX'ing producer/consumer socket data
* tcp stack pcb locking (which rss tries to work around, but it again
doesn't help producer/consumer locking, only multiple sockets)
* for some of the workloads, the scheduler spinlocks are pretty
heavily contended and that's likely worth digging into.

Thanks! I'll go try this on a couple of boxes I have with
intel/chelsio 40g hardware in it and see if I can reproduce it. (My
test boxes have the 40g NICs in NUMA domain 1...)



-adrian
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-17 Thread Ben RUBSON

> On 15 Aug 2016, at 16:49, Ben RUBSON  wrote:
> 
>> On 12 Aug 2016, at 00:52, Adrian Chadd  wrote:
>> 
>> Which ones of these hit the line rate comfortably?
> 
> So Adrian, I ran tests again using FreeBSD 11-RC1.
> I put iperf throughput in result files (so that we can classify them), as 
> well as top -P ALL and pcm-memory.x.
> iperf results : columns 3&4 are for srv1->srv2, columns 5&6 are for 
> srv2->srv1 (both flows running at the same time).
> 
> 
> 
> Results, expected throughput (best first) :
> 11, 01, 05, 07, 06
> 
> Results, bad (best first) :
> 04, 02, 09, 03
> 
> Results, worst (best first) :
> 10, 08
> 
> 
> 
> 00) Idle system
> http://pastebin.com/raw/K1iMVHVF

And strangely enough, from one server reboot to another, results are not the 
same.
They can be excellent, as 01), and they can be dramatically bad, as 01b) :

> 01) No pinning
> http://pastebin.com/raw/7J3HibX0
01b) http://pastebin.com/raw/HbSPjigZ (-36GB/s)

I kept this "bad boot" state and performed the other tests (with lock_profiling 
stats for 10 seconds) :

> 02) numactl -l fixed-domain-rr -m 0 -c 0
> http://pastebin.com/raw/Yt7yYr0K
02b) http://pastebin.com/raw/n7aZF7ad (+16GB/s)

> 03) numactl -l fixed-domain-rr -m 0 -c 0
> + cpuset -l <0-11> -x 
> http://pastebin.com/raw/1FAgDUSU
03b) http://pastebin.com/raw/QHbauimp (+24GB/s)

> 04) numactl -l fixed-domain-rr -m 0 -c 0
> + cpuset -l <12-23> -x 
> http://pastebin.com/raw/fTAxrzBb
04b) http://pastebin.com/raw/7gJFZdqB (+10GB/s)

> 05) numactl -l fixed-domain-rr -m 1 -c 1
> http://pastebin.com/raw/kuAHzKu2
05b) http://pastebin.com/raw/TwhHGKNa (-36GB/s)

> 06) numactl -l fixed-domain-rr -m 1 -c 1
> + cpuset -l <0-11> -x 
> http://pastebin.com/raw/tgtaZgwb
06b) http://pastebin.com/raw/zSZ7r09Y (-36GB/s)

> 07) numactl -l fixed-domain-rr -m 1 -c 1
> + cpuset -l <12-23> -x 
> http://pastebin.com/raw/16ReuGFF
07b) http://pastebin.com/raw/qCsaGBVn (-36GB/s)

These results are very strange, as if NUMA domains were "inverted"...
dmesg : http://pastebin.com/raw/i5USqLix

If I'm lucky enough, after several reboots, I can produce same performance 
results as in test 01).
dmesg : http://pastebin.com/raw/VvfQv6TM
01c) http://pastebin.com/raw/BVxgSyBN

> 08) No pinning, default kernel (no NUMA option)
> http://pastebin.com/raw/Ah74fKRx
> 
> 09) default kernel (no NUMA option)
> + cpuset -l <0-11>
> + cpuset -l <0-11> -x 
> http://pastebin.com/raw/YE0PxEu8
> 
> 10) default kernel (no NUMA option)
> + cpuset -l <12-23>
> + cpuset -l <12-23> -x 
> http://pastebin.com/raw/RPh8aM49
> 
> 
> 
> 11) No pinning, default kernel (no NUMA option), NUMA BIOS disabled
> http://pastebin.com/raw/LyGcLKDd

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-16 Thread Ben RUBSON

> On 16 Aug 2016, at 21:36, Adrian Chadd  wrote:
> 
> On 16 August 2016 at 02:58, Ben RUBSON  wrote:
>> 
>>> On 16 Aug 2016, at 03:45, Adrian Chadd  wrote:
>>> 
>>> Hi,
>>> 
>>> ok, can you try 5) but also running with the interrupt threads pinned to 
>>> CPU 1?
>> 
>> What do you mean by interrupt threads ?
>> 
>> Perhaps you mean the NIC interrupts ?
>> In this case see 6) and 7) where NIC IRQs are pinned to CPUs 0-11 (6) and 
>> 11-23 (7) ?
> 
> Hm, interesting. ok. So, I wonder what the maximum per-domain memory
> throughput is.

Datasheet says 59GB/s ?
http://ark.intel.com/products/83352/Intel-Xeon-Processor-E5-2620-v3-15M-Cache-2_40-GHz?q=E5-2620%20v3
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-16 Thread Adrian Chadd
On 16 August 2016 at 02:58, Ben RUBSON  wrote:
>
>> On 16 Aug 2016, at 03:45, Adrian Chadd  wrote:
>>
>> Hi,
>>
>> ok, can you try 5) but also running with the interrupt threads pinned to CPU 
>> 1?
>
> What do you mean by interrupt threads ?
>
> Perhaps you mean the NIC interrupts ?
> In this case see 6) and 7) where NIC IRQs are pinned to CPUs 0-11 (6) and 
> 11-23 (7) ?

Hm, interesting. ok. So, I wonder what the maximum per-domain memory
throughput is.

I don't have any other easy things to instrument right now - the
"everything disabled" method likely works best because of how the
system is interleaving memory for you (instead of the OS trying to do
it). Not pinning things means latency can be kept down to work around
lock contention (ie, if a lock is held by thread A, and thread B needs
to make some progress, it can make progress on another CPU , keeping
CPU A held for a shorter period of time.)

Would you mind compiling in LOCK_PROFILING and doing say, these tests
with lock profiling enabled? It'll impact performance, sure, but I'd
like to see what the locking looks like.

sysctl debug.lock.prof.reset=1
sysctl debug.lock.prof.enable=1
(run test for a few seconds)
sysctl debug.lock.prof.enable=0
sysctl debug.lock.prof.stats (and capture)

* interrupts - domain 0, work - domain 1
* interrupts - domain 1, work - domain 1
* interrupts - domain 1, work - domain 0

Thanks!



-adrian
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-16 Thread Ben RUBSON

> On 16 Aug 2016, at 03:45, Adrian Chadd  wrote:
> 
> Hi,
> 
> ok, can you try 5) but also running with the interrupt threads pinned to CPU 
> 1?

What do you mean by interrupt threads ?

Perhaps you mean the NIC interrupts ?
In this case see 6) and 7) where NIC IRQs are pinned to CPUs 0-11 (6) and 11-23 
(7) ?

Ben

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-15 Thread Adrian Chadd
Hi,

ok, can you try 5) but also running with the interrupt threads pinned to CPU 1?

It looks like the interrupt threads are running on CPU 0, and my
/guess/ (looking at the CPU usage distributions) that sometimes the
userland bits run on the same CPU or numa domain as the interrupt
bits, and it likely decreases some latency -> increasing throughput
slightly.

Thanks,



-adrian
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-15 Thread Ben RUBSON

> On 12 Aug 2016, at 00:52, Adrian Chadd  wrote:
> 
> Which ones of these hit the line rate comfortably?

So Adrian, I ran tests again using FreeBSD 11-RC1.
I put iperf throughput in result files (so that we can classify them), as well 
as top -P ALL and pcm-memory.x.
iperf results : columns 3&4 are for srv1->srv2, columns 5&6 are for srv2->srv1 
(both flows running at the same time).



Results, expected throughput (best first) :
11, 01, 05, 07, 06

Results, bad (best first) :
04, 02, 09, 03

Results, worst (best first) :
10, 08



00) Idle system
http://pastebin.com/raw/K1iMVHVF



01) No pinning
http://pastebin.com/raw/7J3HibX0

02) numactl -l fixed-domain-rr -m 0 -c 0
http://pastebin.com/raw/Yt7yYr0K

03) numactl -l fixed-domain-rr -m 0 -c 0
+ cpuset -l <0-11> -x 
http://pastebin.com/raw/1FAgDUSU

04) numactl -l fixed-domain-rr -m 0 -c 0
+ cpuset -l <12-23> -x 
http://pastebin.com/raw/fTAxrzBb

05) numactl -l fixed-domain-rr -m 1 -c 1
http://pastebin.com/raw/kuAHzKu2

06) numactl -l fixed-domain-rr -m 1 -c 1
+ cpuset -l <0-11> -x 
http://pastebin.com/raw/tgtaZgwb

07) numactl -l fixed-domain-rr -m 1 -c 1
+ cpuset -l <12-23> -x 
http://pastebin.com/raw/16ReuGFF



08) No pinning, default kernel (no NUMA option)
http://pastebin.com/raw/Ah74fKRx

09) default kernel (no NUMA option)
+ cpuset -l <0-11>
+ cpuset -l <0-11> -x 
http://pastebin.com/raw/YE0PxEu8

10) default kernel (no NUMA option)
+ cpuset -l <12-23>
+ cpuset -l <12-23> -x 
http://pastebin.com/raw/RPh8aM49



11) No pinning, default kernel (no NUMA option), NUMA BIOS disabled
http://pastebin.com/raw/LyGcLKDd



Ben

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-11 Thread Adrian Chadd
Which ones of these hit the line rate comfortably?



-a


On 11 August 2016 at 15:35, Ben RUBSON  wrote:
>
>> On 11 Aug 2016, at 18:36, Adrian Chadd  wrote:
>>
>> Hi!
>>
>> mlx4_core0:  mem
>> 0xfbe0-0xfbef,0xfb00-0xfb7f irq 64 at device 0.0
>> numa-domain 1 on pci16
>> mlx4_core: Initializing mlx4_core: Mellanox ConnectX VPI driver v2.1.6
>> (Aug 11 2016)
>>
>> so the NIC is in numa-domain 1. Try pinning the worker threads to
>> numa-domain 1 when you run the test:
>>
>> numactl -l first-touch-rr -m 1 -c 1 ./test-program
>>
>> You can also try pinning the NIC threads to numa-domain 1 versus 0 (so
>> the second set of CPUs, not the first set.)
>>
>> vmstat -ia | grep mlx (get the list of interrupt thread ids)
>> then for each:
>>
>> cpuset -d 1 -x 
>>
>> Run pcm-memory.x each time so we can see the before and after effects
>> on local versus remote memory access.
>>
>> Thanks!
>
> Adrian, here are the results :
>
>
>
> Idle system :
> http://pastebin.com/raw/K1iMVHVF
>
>
>
> No pinning :
> http://pastebin.com/raw/w5KuexQ3
> CPU : http://pastebin.com/raw/8zgRaazN
>
> numactl -l fixed-domain-rr -m 1 -c 1 :
> http://pastebin.com/raw/VWweYF9H
> CPU : http://pastebin.com/raw/QjaVH32X
>
> numactl -l fixed-domain-rr -m 0 -c 0 :
> http://pastebin.com/raw/71hfGJdw
> CPU : http://pastebin.com/raw/hef058Na
>
> numactl -l fixed-domain-rr -m 1 -c 1
> + cpuset -l  -x  :
> http://pastebin.com/raw/nEQkgMK2
> CPU : http://pastebin.com/raw/R652KAdJ
>
> numactl -l fixed-domain-rr -m 0 -c 0
> + cpuset -l  -x  :
> http://pastebin.com/raw/GdYJHyae
> CPU : http://pastebin.com/raw/Ggfx9uF9
>
>
>
> No pinning, default kernel (no NUMA option) :
> http://pastebin.com/raw/iQ2u8d8k
> CPU : http://pastebin.com/raw/Xr77KpcM
>
> default kernel (no NUMA option)
> + cpuset -l 
> + cpuset -l  -x  :
> http://pastebin.com/raw/VBWg4SZs
>
> default kernel (no NUMA option)
> + cpuset -l 
> + cpuset -l  -x  :
> http://pastebin.com/raw/SrJLZxuT
>
>
>
> No pinning, default kernel (no NUMA option), NUMA BIOS disabled :
> http://pastebin.com/raw/P5LrUASN
>
>
>
> I would say :
> - FreeBSD <= 10.3 : disable NUMA in BIOS
> - FreeBSD >= 11   : disable NUMA in BIOS or enable NUMA in kernel.
> But let's wait your analysis :)
>
>
>
> Ben
>
> ___
> freebsd-net@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-11 Thread Ben RUBSON

> On 11 Aug 2016, at 18:36, Adrian Chadd  wrote:
> 
> Hi!
> 
> mlx4_core0:  mem
> 0xfbe0-0xfbef,0xfb00-0xfb7f irq 64 at device 0.0
> numa-domain 1 on pci16
> mlx4_core: Initializing mlx4_core: Mellanox ConnectX VPI driver v2.1.6
> (Aug 11 2016)
> 
> so the NIC is in numa-domain 1. Try pinning the worker threads to
> numa-domain 1 when you run the test:
> 
> numactl -l first-touch-rr -m 1 -c 1 ./test-program
> 
> You can also try pinning the NIC threads to numa-domain 1 versus 0 (so
> the second set of CPUs, not the first set.)
> 
> vmstat -ia | grep mlx (get the list of interrupt thread ids)
> then for each:
> 
> cpuset -d 1 -x 
> 
> Run pcm-memory.x each time so we can see the before and after effects
> on local versus remote memory access.
> 
> Thanks!

Adrian, here are the results :



Idle system :
http://pastebin.com/raw/K1iMVHVF



No pinning :
http://pastebin.com/raw/w5KuexQ3
CPU : http://pastebin.com/raw/8zgRaazN

numactl -l fixed-domain-rr -m 1 -c 1 :
http://pastebin.com/raw/VWweYF9H
CPU : http://pastebin.com/raw/QjaVH32X

numactl -l fixed-domain-rr -m 0 -c 0 :
http://pastebin.com/raw/71hfGJdw
CPU : http://pastebin.com/raw/hef058Na

numactl -l fixed-domain-rr -m 1 -c 1
+ cpuset -l  -x  :
http://pastebin.com/raw/nEQkgMK2
CPU : http://pastebin.com/raw/R652KAdJ

numactl -l fixed-domain-rr -m 0 -c 0
+ cpuset -l  -x  :
http://pastebin.com/raw/GdYJHyae
CPU : http://pastebin.com/raw/Ggfx9uF9



No pinning, default kernel (no NUMA option) :
http://pastebin.com/raw/iQ2u8d8k
CPU : http://pastebin.com/raw/Xr77KpcM

default kernel (no NUMA option)
+ cpuset -l 
+ cpuset -l  -x  :
http://pastebin.com/raw/VBWg4SZs

default kernel (no NUMA option)
+ cpuset -l 
+ cpuset -l  -x  :
http://pastebin.com/raw/SrJLZxuT



No pinning, default kernel (no NUMA option), NUMA BIOS disabled :
http://pastebin.com/raw/P5LrUASN



I would say :
- FreeBSD <= 10.3 : disable NUMA in BIOS
- FreeBSD >= 11   : disable NUMA in BIOS or enable NUMA in kernel.
But let's wait your analysis :)



Ben

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-11 Thread Adrian Chadd
adrian did mean fixed-domain-rr. :-P sorry!

(Sorry, needed to update my NUMA boxes, things "changed" since I wrote this.)


-a
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-11 Thread Eric van Gyzen
On 08/11/16 12:54 PM, Ben RUBSON wrote:
> 
>> On 11 Aug 2016, at 19:51, Ben RUBSON  wrote:
>>
>>
>>> On 11 Aug 2016, at 18:36, Adrian Chadd  wrote:
>>>
>>> Hi!
>>
>> Hi Adrian,
>>
>>> mlx4_core0:  mem
>>> 0xfbe0-0xfbef,0xfb00-0xfb7f irq 64 at device 0.0
>>> numa-domain 1 on pci16
>>> mlx4_core: Initializing mlx4_core: Mellanox ConnectX VPI driver v2.1.6
>>> (Aug 11 2016)
>>>
>>> so the NIC is in numa-domain 1. Try pinning the worker threads to
>>> numa-domain 1 when you run the test:
>>>
>>> numactl -l first-touch-rr -m 1 -c 1 ./test-program
>>
>> # numactl -l first-touch-rr -m 1 -c 1 /usr/local/bin/iperf -c 192.168.2.1 -l 
>> 128KB -P 16 -i 2 -t 6000  
>> Could not parse policy: '128KB'
>>
>> I did not manage to give arguments to command. Any idea ?
> 
> I answer to myself, this should do the trick :
> numactl -l first-touch-rr -m 1 -c 1 -- /usr/local/bin/iperf -c 192.168.2.1 -l 
> 128KB -P 16 -i 2 -t 6000

This has annoyed me quite a bit, too.  Setting the POSIXLY_CORRECT
environment variable would also make it behave "correctly".

> However of course it still gives the error below :
> 
>> # numactl -l first-touch-rr -m 1 -c 1 /usr/local/bin/iperf   
>> 
>> numactl: numa_setaffinity: Invalid argument
>>
>> And sounds like -m is not allowed with first-touch-rr.
>> What should I use ?

Adrian probably meant fixed-domain-rr.

>> Thank you !
>>
>>> You can also try pinning the NIC threads to numa-domain 1 versus 0 (so
>>> the second set of CPUs, not the first set.)
>>>
>>> vmstat -ia | grep mlx (get the list of interrupt thread ids)
>>> then for each:
>>>
>>> cpuset -d 1 -x 
>>>
>>> Run pcm-memory.x each time so we can see the before and after effects
>>> on local versus remote memory access.
>>>
>>> Thanks!
>>>
>>>
>>>
>>> -adrian
>>
> 
> ___
> freebsd-net@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
> 
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-11 Thread Ben RUBSON

> On 11 Aug 2016, at 18:36, Adrian Chadd  wrote:
> 
> Hi!
> 
> mlx4_core0:  mem
> 0xfbe0-0xfbef,0xfb00-0xfb7f irq 64 at device 0.0
> numa-domain 1 on pci16
> mlx4_core: Initializing mlx4_core: Mellanox ConnectX VPI driver v2.1.6
> (Aug 11 2016)
> 
> so the NIC is in numa-domain 1. Try pinning the worker threads to
> numa-domain 1 when you run the test:
> 
> numactl -l first-touch-rr -m 1 -c 1 ./test-program
> 
> You can also try pinning the NIC threads to numa-domain 1 versus 0 (so
> the second set of CPUs, not the first set.)
> 
> vmstat -ia | grep mlx (get the list of interrupt thread ids)
> then for each:
> 
> cpuset -d 1 -x 
> 
> Run pcm-memory.x each time so we can see the before and after effects
> on local versus remote memory access.
> 
> Thanks!

Waiting for the correct commands to use, I made some tests with :

  cpuset -l 0-11 
or
  cpuset -l 12-23 

and :

  c=0
  vmstat -ia | grep mlx | sed 's/^irq\(.*\):.*/\1/' | while read i
  do
cpuset -l $c -x $i ; ((c++)) ; [[ $c -gt 11 ]] && c=0
  done
or 
  c=12
  vmstat -ia | grep mlx | sed 's/^irq\(.*\):.*/\1/' | while read i
  do
cpuset -l $c -x $i ; ((c++)) ; [[ $c -gt 23 ]] && c=12
  done

Results :

No pinning
http://pastebin.com/raw/CrK1CQpm

Pinning workers to 0-11
Pinning NIC IRQ to 0-11
http://pastebin.com/raw/kLEQ6TKL

Pinning workers to 12-23
Pinning NIC IRQ to 12-23
http://pastebin.com/raw/qGxw9KL2

Pinning workers to 12-23
Pinning NIC IRQ to 0-11
http://pastebin.com/raw/tFjii629

Comments :

Strangely, the best iPer throughput results are when there is no pinning.
Whereas before running kernel with your new options, the best results were with 
everything pinned to 0-11.

Feel free to ask me further testing.

Ben

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-11 Thread Ben RUBSON

> On 11 Aug 2016, at 19:51, Ben RUBSON  wrote:
> 
> 
>> On 11 Aug 2016, at 18:36, Adrian Chadd  wrote:
>> 
>> Hi!
> 
> Hi Adrian,
> 
>> mlx4_core0:  mem
>> 0xfbe0-0xfbef,0xfb00-0xfb7f irq 64 at device 0.0
>> numa-domain 1 on pci16
>> mlx4_core: Initializing mlx4_core: Mellanox ConnectX VPI driver v2.1.6
>> (Aug 11 2016)
>> 
>> so the NIC is in numa-domain 1. Try pinning the worker threads to
>> numa-domain 1 when you run the test:
>> 
>> numactl -l first-touch-rr -m 1 -c 1 ./test-program
> 
> # numactl -l first-touch-rr -m 1 -c 1 /usr/local/bin/iperf -c 192.168.2.1 -l 
> 128KB -P 16 -i 2 -t 6000  
> Could not parse policy: '128KB'
> 
> I did not manage to give arguments to command. Any idea ?

I answer to myself, this should do the trick :
numactl -l first-touch-rr -m 1 -c 1 -- /usr/local/bin/iperf -c 192.168.2.1 -l 
128KB -P 16 -i 2 -t 6000

However of course it still gives the error below :

> # numactl -l first-touch-rr -m 1 -c 1 /usr/local/bin/iperf
>
> numactl: numa_setaffinity: Invalid argument
> 
> And sounds like -m is not allowed with first-touch-rr.
> What should I use ?
> 
> Thank you !
> 
>> You can also try pinning the NIC threads to numa-domain 1 versus 0 (so
>> the second set of CPUs, not the first set.)
>> 
>> vmstat -ia | grep mlx (get the list of interrupt thread ids)
>> then for each:
>> 
>> cpuset -d 1 -x 
>> 
>> Run pcm-memory.x each time so we can see the before and after effects
>> on local versus remote memory access.
>> 
>> Thanks!
>> 
>> 
>> 
>> -adrian
> 

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-11 Thread Ben RUBSON

> On 11 Aug 2016, at 18:36, Adrian Chadd  wrote:
> 
> Hi!

Hi Adrian,

> mlx4_core0:  mem
> 0xfbe0-0xfbef,0xfb00-0xfb7f irq 64 at device 0.0
> numa-domain 1 on pci16
> mlx4_core: Initializing mlx4_core: Mellanox ConnectX VPI driver v2.1.6
> (Aug 11 2016)
> 
> so the NIC is in numa-domain 1. Try pinning the worker threads to
> numa-domain 1 when you run the test:
> 
> numactl -l first-touch-rr -m 1 -c 1 ./test-program

# numactl -l first-touch-rr -m 1 -c 1 /usr/local/bin/iperf -c 192.168.2.1 -l 
128KB -P 16 -i 2 -t 6000  
Could not parse policy: '128KB'

I did not manage to give arguments to command. Any idea ?

# numactl -l first-touch-rr -m 1 -c 1 /usr/local/bin/iperf  
 
numactl: numa_setaffinity: Invalid argument

And sounds like -m is not allowed with first-touch-rr.
What should I use ?

Thank you !

> You can also try pinning the NIC threads to numa-domain 1 versus 0 (so
> the second set of CPUs, not the first set.)
> 
> vmstat -ia | grep mlx (get the list of interrupt thread ids)
> then for each:
> 
> cpuset -d 1 -x 
> 
> Run pcm-memory.x each time so we can see the before and after effects
> on local versus remote memory access.
> 
> Thanks!
> 
> 
> 
> -adrian

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-11 Thread Adrian Chadd
Hi!

mlx4_core0:  mem
0xfbe0-0xfbef,0xfb00-0xfb7f irq 64 at device 0.0
numa-domain 1 on pci16
mlx4_core: Initializing mlx4_core: Mellanox ConnectX VPI driver v2.1.6
(Aug 11 2016)

so the NIC is in numa-domain 1. Try pinning the worker threads to
numa-domain 1 when you run the test:

numactl -l first-touch-rr -m 1 -c 1 ./test-program

You can also try pinning the NIC threads to numa-domain 1 versus 0 (so
the second set of CPUs, not the first set.)

vmstat -ia | grep mlx (get the list of interrupt thread ids)
then for each:

cpuset -d 1 -x 

Run pcm-memory.x each time so we can see the before and after effects
on local versus remote memory access.

Thanks!



-adrian
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-10 Thread Ben RUBSON

> On 11 Aug 2016, at 00:11, Adrian Chadd  wrote:
> 
> hi,
> 
> ok, lets start by getting the NUMA bits into the kernel so you can
> mess with things.
> 
> add this to the kernel
> 
> options MAXMEMDOM=8
> (which hopefully is enough)
> options VM_NUMA_ALLOC
> options DEVICE_NUMA
> 
> Then reboot and post your 'dmesg' output to the list. This should show
> exactly which domain devices are in.

http://pastebin.com/raw/yaYEytME

> Install the 'intel-pcm' package. There's a 'pcm-numa.x' command - do
> kldload cpuctl, then run pcm-numa.x and see if it works. It should
> give us some useful information about NUMA.
> (Same as pcm-memory.x, pcm-pcie.x, etc.)

Yes these tools work :

# pcm-numa.x

 Intel(r) Performance Counter Monitor: NUMA monitoring utility 
 Copyright (c) 2009-2016 Intel Corporation

Number of physical cores: 12
Number of logical cores: 24
Number of online logical cores: 24
Threads (logical cores) per physical core: 2
Num sockets: 2
Physical cores per socket: 6
Core PMU (perfmon) version: 3
Number of core PMU generic (programmable) counters: 4
Width of generic (programmable) counters: 48 bits
Number of core PMU fixed counters: 3
Width of fixed counters: 48 bits
Nominal core frequency: 24 Hz
Package thermal spec power: 85 Watt; Package minimum power: 31 Watt; Package 
maximum power: 170 Watt; 
ERROR: QPI LL monitoring device (0:127:9:2) is missing. The QPI statistics will 
be incomplete or missing.
Socket 0: 2 memory controllers detected with total number of 5 channels. 1 QPI 
ports detected.
ERROR: QPI LL monitoring device (0:255:9:2) is missing. The QPI statistics will 
be incomplete or missing.
Socket 1: 2 memory controllers detected with total number of 5 channels. 1 QPI 
ports detected.
Socket 0
Max QPI link 0 speed: 16.0 GBytes/second (8.0 GT/second)
Socket 1
Max QPI link 0 speed: 16.0 GBytes/second (8.0 GT/second)

Detected Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz "Intel(r) microarchitecture 
codename Haswell-EP/EN/EX"
Update every 1.0 seconds
Time elapsed: 1010 ms
Core | IPC  | Instructions | Cycles  |  Local DRAM accesses | Remote DRAM 
Accesses 
   0   0.70   1158 K 1655 K   577 245   
 
   1   0.33186 K  557 K   160  15   
 
   2   0.43317 K  745 K   385  31   
 
   3   0.36260 K  718 K   232  33   
 
   4   0.31186 K  602 K   188  11   
 
   5   0.39314 K  806 K   371  43   
 
   6   0.36235 K  659 K   257  46   
 
   7   0.35200 K  576 K   133  44   
 
   8   0.42423 K 1011 K   226  20   
 
   9   0.60   1309 K 2199 K   379 104   
 
  10   0.34192 K  562 K   161  26   
 
  11   0.38257 K  684 K   158  44   
 
  12   0.35185 K  528 K39 121   
 
  13   0.32199 K  616 K51 171   
 
  14   0.31184 K  594 K34 130   
 
  15   0.35272 K  783 K47 256   
 
  16   0.31178 K  579 K26 127   
 
  17   0.37272 K  729 K87 204   
 
  18   0.52485 K  942 K35 204   
 
  19   0.40285 K  723 K16 147   
 
  20   0.31195 K  620 K10 134   
 
  21   0.33201 K  615 K30 114   
 
  22   0.29176 K  612 K24 110   
 
  23   0.52896 K 1716 K86 895   
 
---
   *   0.43   8575 K   19 M  37123275   
 

> Then next is playing around with interrupt thread / userland cpuset
> and memory affinity. We can look at that next.

Waiting for your instructions !

Ben

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-10 Thread Adrian Chadd
On 10 August 2016 at 12:50, Ben RUBSON  wrote:
>
>> On 10 Aug 2016, at 21:47, Adrian Chadd  wrote:
>>
>> hi,
>>
>> yeah, I'd like you to do some further testing with NUMA. Are you able
>> to run freebsd-11 or -HEAD on these boxes?
>
> Hi Adrian,
>
> Yes I currently have 11 BETA3 running on them.
> I could also run BETA4.

hi,

ok, lets start by getting the NUMA bits into the kernel so you can
mess with things.

add this to the kernel

options MAXMEMDOM=8
(which hopefully is enough)
options VM_NUMA_ALLOC
options DEVICE_NUMA

Then reboot and post your 'dmesg' output to the list. This should show
exactly which domain devices are in.

Install the 'intel-pcm' package. There's a 'pcm-numa.x' command - do
kldload cpuctl, then run pcm-numa.x and see if it works. It should
give us some useful information about NUMA.
(Same as pcm-memory.x, pcm-pcie.x, etc.)

Then next is playing around with interrupt thread / userland cpuset
and memory affinity. We can look at that next. Currently the kernel
doesn't know about NUMA local memory for device driver memory, kernel
allocations for mbufs, etc, but we should still get a "good enough"
idea about things. We can talk about that here once the above steps
are done.

Thanks!



-adrian
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-10 Thread Ben RUBSON

> On 10 Aug 2016, at 21:47, Adrian Chadd  wrote:
> 
> hi,
> 
> yeah, I'd like you to do some further testing with NUMA. Are you able
> to run freebsd-11 or -HEAD on these boxes?

Hi Adrian,

Yes I currently have 11 BETA3 running on them.
I could also run BETA4.

Ben

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-10 Thread Adrian Chadd
hi,

yeah, I'd like you to do some further testing with NUMA. Are you able
to run freebsd-11 or -HEAD on these boxes?


-adrian


On 8 August 2016 at 07:01, Ben RUBSON  wrote:
>
>> On 04 Aug 2016, at 11:40, Ben RUBSON  wrote:
>>
>>
>>> On 02 Aug 2016, at 22:11, Ben RUBSON  wrote:
>>>
 On 02 Aug 2016, at 21:35, Hans Petter Selasky  wrote:

 The CX-3 driver doesn't bind the worker threads to specific CPU cores by 
 default, so if your CPU has more than one so-called numa, you'll end up 
 that the bottle-neck is the high-speed link between the CPU cores and not 
 the card. A quick and dirty workaround is to "cpuset" iperf and the 
 interrupt and taskqueue threads to specific CPU cores.
>>>
>>> My CPUs : 2x E5-2620v3 with DDR4@1866.
>>
>> OK, so I cpuset all Mellanox interrupts to one NUMA, as well as the iPerf 
>> processes, and I'm able to reach max bandwidth.
>> Choosing the wrong NUMA (or both, or one for interrupts, the other one for 
>> iPerf, etc...) totally kills throughput.
>>
>> However, full-duplex throughput is still limited, I can't manage to reach 
>> 2x40Gb/s, throttle is at about 45Gb/s.
>> I tried many different cpuset layouts, but I never went above 45Gb/s.
>> (Linux allowed me to reach 2x40Gb/s so hardware is not a bottleneck)
>
> OK, I then found a workaround.
>
> In the motherboards' BIOS, I disabled the following option :
> Advanced / ACPI Settings / NUMA
>
> And I'm now able to go up to 2x40Gb/s !
> I'm then even able to achieve this throughput without any cpuset !
>
> Strange that Linux was able to deal with this setting, but I'm pretty sure 
> production performance will be easier to maintain with only 1 NUMA.
>
> Feel free to ask me if you want further testing with 2 NUMA.
>
> Ben
>
> ___
> freebsd-net@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-08 Thread Ben RUBSON

> On 04 Aug 2016, at 11:40, Ben RUBSON  wrote:
> 
> 
>> On 02 Aug 2016, at 22:11, Ben RUBSON  wrote:
>> 
>>> On 02 Aug 2016, at 21:35, Hans Petter Selasky  wrote:
>>> 
>>> The CX-3 driver doesn't bind the worker threads to specific CPU cores by 
>>> default, so if your CPU has more than one so-called numa, you'll end up 
>>> that the bottle-neck is the high-speed link between the CPU cores and not 
>>> the card. A quick and dirty workaround is to "cpuset" iperf and the 
>>> interrupt and taskqueue threads to specific CPU cores.
>> 
>> My CPUs : 2x E5-2620v3 with DDR4@1866.
> 
> OK, so I cpuset all Mellanox interrupts to one NUMA, as well as the iPerf 
> processes, and I'm able to reach max bandwidth.
> Choosing the wrong NUMA (or both, or one for interrupts, the other one for 
> iPerf, etc...) totally kills throughput.
> 
> However, full-duplex throughput is still limited, I can't manage to reach 
> 2x40Gb/s, throttle is at about 45Gb/s.
> I tried many different cpuset layouts, but I never went above 45Gb/s.
> (Linux allowed me to reach 2x40Gb/s so hardware is not a bottleneck)

OK, I then found a workaround.

In the motherboards' BIOS, I disabled the following option :
Advanced / ACPI Settings / NUMA

And I'm now able to go up to 2x40Gb/s !
I'm then even able to achieve this throughput without any cpuset !

Strange that Linux was able to deal with this setting, but I'm pretty sure 
production performance will be easier to maintain with only 1 NUMA.

Feel free to ask me if you want further testing with 2 NUMA.

Ben

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-08 Thread Ben RUBSON

> On 05 Aug 2016, at 10:30, Hans Petter Selasky  wrote:
> 
> On 08/04/16 23:49, Ben RUBSON wrote:
>>> 
>>> On 04 Aug 2016, at 20:15, Ryan Stone  wrote:
>>> 
>>> On Thu, Aug 4, 2016 at 11:33 AM, Ben RUBSON  wrote:
>>> But even without RSS, I should be able to go up to 2x40Gbps, don't you 
>>> think so ?
>>> Nobody already did this ?
>>> 
>>> Try this patch
>>> (...)
>> 
>> I also just tested the NODEBUG kernel but it did not help.
> 
> Hi,
> 
> When running these tests, do you see any CPUs fully utilised?

No, CPUs look like this on both servers :

27 processes:  1 running, 26 sleeping
CPU 0:   1.1% user,  0.0% nice, 16.7% system,  0.0% interrupt, 82.2% idle
CPU 1:   1.1% user,  0.0% nice, 18.9% system,  0.0% interrupt, 80.0% idle
CPU 2:   1.9% user,  0.0% nice, 17.8% system,  0.0% interrupt, 80.4% idle
CPU 3:   1.1% user,  0.0% nice, 15.2% system,  0.0% interrupt, 83.7% idle
CPU 4:   0.4% user,  0.0% nice, 16.3% system,  0.0% interrupt, 83.3% idle
CPU 5:   1.1% user,  0.0% nice, 14.4% system,  0.0% interrupt, 84.4% idle
CPU 6:   2.6% user,  0.0% nice, 17.4% system,  0.0% interrupt, 80.0% idle
CPU 7:   2.2% user,  0.0% nice, 15.2% system,  0.0% interrupt, 82.6% idle
CPU 8:   1.1% user,  0.0% nice,  3.0% system, 15.9% interrupt, 80.0% idle
CPU 9:   0.0% user,  0.0% nice,  3.0% system, 32.2% interrupt, 64.8% idle
CPU 10:  0.0% user,  0.0% nice,  0.4% system, 58.9% interrupt, 40.7% idle
CPU 11:  0.0% user,  0.0% nice,  0.4% system, 77.4% interrupt, 22.2% idle
CPU 12:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 13:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 14:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 15:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 16:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 17:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 18:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 19:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 20:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 21:  0.0% user,  0.0% nice,  0.0% system,  0.4% interrupt, 99.6% idle
CPU 22:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 23:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle

Load is correctly spread over the NUMA connected to the NIC (the first 12 CPUs).
There is clearly enough power to fulfill the full-duplex link !
I tried many cpuset configurations (IRQs over the 12 CPUs etc...), but no 
improvement at all.

> Did you check the RX/TX pauseframes settings and the mlx4 sysctl statistics 
> counters, if there is packet loss?

I tried to disable RX/TX pauseframes, but it did not help.
And "sysctl -a | grep mlx | grep err" counters are all 0.
I also played with ring size, adaptive interrupt moderation... with no luck.

Ben

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-05 Thread Hans Petter Selasky

On 08/04/16 23:49, Ben RUBSON wrote:


On 04 Aug 2016, at 20:15, Ryan Stone  wrote:

On Thu, Aug 4, 2016 at 11:33 AM, Ben RUBSON  wrote:
But even without RSS, I should be able to go up to 2x40Gbps, don't you think so 
?
Nobody already did this ?

Try this patch
(...)


I also just tested the NODEBUG kernel but it did not help.


Hi,

When running these tests, do you see any CPUs fully utilized?

Did you check the RX/TX pauseframes settings and the mlx4 sysctl 
statistics counters, if there is packet loss?


--HPS

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-04 Thread Ben RUBSON
> 
> On 04 Aug 2016, at 20:15, Ryan Stone  wrote:
> 
> On Thu, Aug 4, 2016 at 11:33 AM, Ben RUBSON  wrote:
> But even without RSS, I should be able to go up to 2x40Gbps, don't you think 
> so ?
> Nobody already did this ?
> 
> Try this patch
> (...)

I also just tested the NODEBUG kernel but it did not help.
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-04 Thread Ben RUBSON

> On 04 Aug 2016, at 20:15, Ryan Stone  wrote:
> 
> On Thu, Aug 4, 2016 at 11:33 AM, Ben RUBSON  wrote:
> But even without RSS, I should be able to go up to 2x40Gbps, don't you think 
> so ?
> Nobody already did this ?
> 
> Try this patch
> (...)

I also just tested the NODEBUG kernel but I did not help.

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-04 Thread Ben RUBSON

> On 04 Aug 2016, at 20:15, Ryan Stone  wrote:
> 
> On Thu, Aug 4, 2016 at 11:33 AM, Ben RUBSON  wrote:
> But even without RSS, I should be able to go up to 2x40Gbps, don't you think 
> so ?
> Nobody already did this ?
> 
> Try this patch, which should improve performance when multiple TCP streams 
> are running in parallel over an mlx4_en port:
> 
> https://people.freebsd.org/~rstone/patches/mlxen_counters.diff

Thank you very much Ryan.
I just tried it, but it does not help :/

Below is the cpuload during bidirectional trafic.
We clearly see the 4 CPUs allocated to Mellanox IRQs, the others to iPerf 
processes.
No improvement if IRQs are spread over the 12 NUMA CPUs, but slightly less 
throughput.
Note that I get the same results if I only use 2 CPUs for IRQs.

27 processes:  1 running, 26 sleeping
CPU 0:   1.1% user,  0.0% nice, 16.7% system,  0.0% interrupt, 82.2% idle
CPU 1:   1.1% user,  0.0% nice, 18.9% system,  0.0% interrupt, 80.0% idle
CPU 2:   1.9% user,  0.0% nice, 17.8% system,  0.0% interrupt, 80.4% idle
CPU 3:   1.1% user,  0.0% nice, 15.2% system,  0.0% interrupt, 83.7% idle
CPU 4:   0.4% user,  0.0% nice, 16.3% system,  0.0% interrupt, 83.3% idle
CPU 5:   1.1% user,  0.0% nice, 14.4% system,  0.0% interrupt, 84.4% idle
CPU 6:   2.6% user,  0.0% nice, 17.4% system,  0.0% interrupt, 80.0% idle
CPU 7:   2.2% user,  0.0% nice, 15.2% system,  0.0% interrupt, 82.6% idle
CPU 8:   1.1% user,  0.0% nice,  3.0% system, 15.9% interrupt, 80.0% idle
CPU 9:   0.0% user,  0.0% nice,  3.0% system, 32.2% interrupt, 64.8% idle
CPU 10:  0.0% user,  0.0% nice,  0.4% system, 58.9% interrupt, 40.7% idle
CPU 11:  0.0% user,  0.0% nice,  0.4% system, 77.4% interrupt, 22.2% idle
CPU 12:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 13:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 14:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 15:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 16:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 17:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 18:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 19:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 20:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 21:  0.0% user,  0.0% nice,  0.0% system,  0.4% interrupt, 99.6% idle
CPU 22:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 23:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-04 Thread Ryan Stone
On Thu, Aug 4, 2016 at 11:33 AM, Ben RUBSON  wrote:

> But even without RSS, I should be able to go up to 2x40Gbps, don't you
> think so ?
> Nobody already did this ?
>

Try this patch, which should improve performance when multiple TCP streams
are running in parallel over an mlx4_en port:

https://people.freebsd.org/~rstone/patches/mlxen_counters.diff
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-04 Thread Ben RUBSON

> On 04 Aug 2016, at 17:33, Hans Petter Selasky  wrote:
> 
> On 08/04/16 17:24, Ben RUBSON wrote:
>> 
>>> On 04 Aug 2016, at 11:40, Ben RUBSON  wrote:
>>> 
 On 02 Aug 2016, at 22:11, Ben RUBSON  wrote:
 
> On 02 Aug 2016, at 21:35, Hans Petter Selasky  wrote:
> 
> The CX-3 driver doesn't bind the worker threads to specific CPU cores by 
> default, so if your CPU has more than one so-called numa, you'll end up 
> that the bottle-neck is the high-speed link between the CPU cores and not 
> the card. A quick and dirty workaround is to "cpuset" iperf and the 
> interrupt and taskqueue threads to specific CPU cores.
 
 My CPUs : 2x E5-2620v3 with DDR4@1866.
>>> 
>>> OK, so I cpuset all Mellanox interrupts to one NUMA, as well as the iPerf 
>>> processes, and I'm able to reach max bandwidth.
>>> Choosing the wrong NUMA (or both, or one for interrupts, the other one for 
>>> iPerf, etc...) totally kills throughput.
>>> 
>>> However, full-duplex throughput is still limited, I can't manage to reach 
>>> 2x40Gb/s, throttle is at about 45Gb/s.
>>> I tried many different cpuset layouts, but I never went above 45Gb/s.
>>> (Linux allowed me to reach 2x40Gb/s so hardware is not a bottleneck)
>>> 
> Are you using "options RSS" and "options PCBGROUP" in your kernel config?
>>> 
>>> I will then give RSS a try.
>> 
>> Without RSS :
>> A ---> B : 40Gbps (unidirectional)
>> A <--> B : 45Gbps (bidirectional)
>> 
>> With RSS :
>> A ---> B : 28Gbps (unidirectional)
>> A <--> B : 28Gbps (bidirectional)
>> 
>> Sounds like RSS does not help :/
>> 
>> Why, without RSS, do I have difficulties to reach 2x40Gbps (full-duplex) ?
>> 
> 
> Hi,
> 
> Possibly because the packets are arriving at the wrong CPU compared to what 
> RSS expects. Then RSS will invoke a taskqueue to process the packets on the 
> correct CPU, if I'm not mistaken.

But even without RSS, I should be able to go up to 2x40Gbps, don't you think so 
?
Nobody already did this ?

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-04 Thread Hans Petter Selasky

On 08/04/16 17:24, Ben RUBSON wrote:



On 04 Aug 2016, at 11:40, Ben RUBSON  wrote:


On 02 Aug 2016, at 22:11, Ben RUBSON  wrote:


On 02 Aug 2016, at 21:35, Hans Petter Selasky  wrote:

The CX-3 driver doesn't bind the worker threads to specific CPU cores by default, so if 
your CPU has more than one so-called numa, you'll end up that the bottle-neck is the 
high-speed link between the CPU cores and not the card. A quick and dirty workaround is 
to "cpuset" iperf and the interrupt and taskqueue threads to specific CPU cores.


My CPUs : 2x E5-2620v3 with DDR4@1866.


OK, so I cpuset all Mellanox interrupts to one NUMA, as well as the iPerf 
processes, and I'm able to reach max bandwidth.
Choosing the wrong NUMA (or both, or one for interrupts, the other one for 
iPerf, etc...) totally kills throughput.

However, full-duplex throughput is still limited, I can't manage to reach 
2x40Gb/s, throttle is at about 45Gb/s.
I tried many different cpuset layouts, but I never went above 45Gb/s.
(Linux allowed me to reach 2x40Gb/s so hardware is not a bottleneck)


Are you using "options RSS" and "options PCBGROUP" in your kernel config?


I will then give RSS a try.


Without RSS :
A ---> B : 40Gbps (unidirectional)
A <--> B : 45Gbps (bidirectional)

With RSS :
A ---> B : 28Gbps (unidirectional)
A <--> B : 28Gbps (bidirectional)

Sounds like RSS does not help :/

Why, without RSS, do I have difficulties to reach 2x40Gbps (full-duplex) ?



Hi,

Possibly because the packets are arriving at the wrong CPU compared to 
what RSS expects. Then RSS will invoke a taskqueue to process the 
packets on the correct CPU, if I'm not mistaken.


The mlx4 driver does not fully support RSS. Then mlx5 does.

--HPS
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-04 Thread Ben RUBSON

> On 04 Aug 2016, at 11:40, Ben RUBSON  wrote:
> 
>> On 02 Aug 2016, at 22:11, Ben RUBSON  wrote:
>> 
>>> On 02 Aug 2016, at 21:35, Hans Petter Selasky  wrote:
>>> 
>>> The CX-3 driver doesn't bind the worker threads to specific CPU cores by 
>>> default, so if your CPU has more than one so-called numa, you'll end up 
>>> that the bottle-neck is the high-speed link between the CPU cores and not 
>>> the card. A quick and dirty workaround is to "cpuset" iperf and the 
>>> interrupt and taskqueue threads to specific CPU cores.
>> 
>> My CPUs : 2x E5-2620v3 with DDR4@1866.
> 
> OK, so I cpuset all Mellanox interrupts to one NUMA, as well as the iPerf 
> processes, and I'm able to reach max bandwidth.
> Choosing the wrong NUMA (or both, or one for interrupts, the other one for 
> iPerf, etc...) totally kills throughput.
> 
> However, full-duplex throughput is still limited, I can't manage to reach 
> 2x40Gb/s, throttle is at about 45Gb/s.
> I tried many different cpuset layouts, but I never went above 45Gb/s.
> (Linux allowed me to reach 2x40Gb/s so hardware is not a bottleneck)
> 
>>> Are you using "options RSS" and "options PCBGROUP" in your kernel config?
> 
> I will then give RSS a try.

Without RSS :
A ---> B : 40Gbps (unidirectional)
A <--> B : 45Gbps (bidirectional)

With RSS :
A ---> B : 28Gbps (unidirectional)
A <--> B : 28Gbps (bidirectional)

Sounds like RSS does not help :/

Why, without RSS, do I have difficulties to reach 2x40Gbps (full-duplex) ?

Thank U !
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-04 Thread Ben RUBSON

> On 02 Aug 2016, at 22:11, Ben RUBSON  wrote:
> 
>> On 02 Aug 2016, at 21:35, Hans Petter Selasky  wrote:
>> 
>> The CX-3 driver doesn't bind the worker threads to specific CPU cores by 
>> default, so if your CPU has more than one so-called numa, you'll end up that 
>> the bottle-neck is the high-speed link between the CPU cores and not the 
>> card. A quick and dirty workaround is to "cpuset" iperf and the interrupt 
>> and taskqueue threads to specific CPU cores.
> 
> My CPUs : 2x E5-2620v3 with DDR4@1866.

OK, so I cpuset all Mellanox interrupts to one NUMA, as well as the iPerf 
processes, and I'm able to reach max bandwidth.
Choosing the wrong NUMA (or both, or one for interrupts, the other one for 
iPerf, etc...) totally kills throughput.

However, full-duplex throughput is still limited, I can't manage to reach 
2x40Gb/s, throttle is at about 45Gb/s.
I tried many different cpuset layouts, but I never went above 45Gb/s.
(Linux allowed me to reach 2x40Gb/s so hardware is not a bottleneck)

>> Are you using "options RSS" and "options PCBGROUP" in your kernel config?

I will then give RSS a try.

Any other clue perhaps regarding the full-duplex limitation ?

Many thanks !

Ben

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-03 Thread Ben RUBSON

> On 03 Aug 2016, at 20:02, Hans Petter Selasky  wrote:
> 
> The mlx4 send and receive queues have each their set of taskqueues. Look in 
> output from "ps auxww".

I can't find them, I even unloaded/reloaded the driver in order to catch the 
differences, but I did not found any relevant process.
Here are the process I have when driver is loaded (I removed my own process 
lines) :

# ps auxxw
USERPID   %CPU %MEMVSZ   RSS TT  STAT STARTEDTIME COMMAND
root 11 2398.1  0.0  0   384  -  RL   Mon10pm 65969:09.19 [idle]
root  00.0  0.0  0  8288  -  DLs  Mon10pm 4:59.31 [kernel]
root  10.0  0.0   9492   872  -  ILs  Mon10pm 0:00.04 /sbin/init --
root  20.0  0.0  096  -  DL   Mon10pm 0:00.82 [cam]
root  30.0  0.0  0   176  -  DL   Mon10pm 0:17.88 [zfskern]
root  40.0  0.0  016  -  DL   Mon10pm 0:00.00 
[sctp_iterator]
root  50.0  0.0  016  -  DL   Mon10pm 0:00.75 [enc_daemon0]
root  60.0  0.0  016  -  DL   Mon10pm 0:00.50 [enc_daemon1]
root  70.0  0.0  016  -  DL   Mon10pm 0:00.05 [enc_daemon2]
root  80.0  0.0  016  -  DL   Mon10pm 0:00.05 [enc_daemon3]
root  90.0  0.0  016  -  DL   Mon10pm 0:00.00 [g_mirror 
swap]
root 100.0  0.0  016  -  DL   Mon10pm 0:00.00 [audit]
root 120.0  0.0  0  1408  -  WL   Mon10pm   186:01.05 [intr]
root 130.0  0.0  048  -  DL   Mon10pm 0:05.24 [geom]
root 140.0  0.0  016  -  DL   Mon10pm 1:07.19 
[rand_harvestq]
root 150.0  0.0  0   160  -  DL   Mon10pm 0:08.22 [usb]
root 160.0  0.0  032  -  DL   Mon10pm 0:00.23 [pagedaemon]
root 170.0  0.0  016  -  DL   Mon10pm 0:00.00 [vmdaemon]
root 180.0  0.0  016  -  DL   Mon10pm 0:00.00 [pagezero]
root 190.0  0.0  016  -  DL   Mon10pm 0:00.12 [bufdaemon]
root 200.0  0.0  016  -  DL   Mon10pm 0:00.13 [vnlru]
root 210.0  0.0  016  -  DL   Mon10pm 2:13.02 [syncer]
root1240.0  0.0  12360  1736  -  Is   Mon10pm 0:00.00 adjkerntz -i
root6180.0  0.0  13628  4868  -  Ss   Mon10pm 0:00.03 /sbin/devd

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-03 Thread Hans Petter Selasky

On 08/03/16 18:57, Ben RUBSON wrote:

taskqueue threads ?


The mlx4 send and receive queues have each their set of taskqueues. Look 
in output from "ps auxww".


--HPS
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-03 Thread Ben RUBSON

> On 02 Aug 2016, at 21:35, Hans Petter Selasky  wrote:
> 
> The CX-3 driver doesn't bind the worker threads to specific CPU cores by 
> default, so if your CPU has more than one so-called numa, you'll end up that 
> the bottle-neck is the high-speed link between the CPU cores and not the 
> card. A quick and dirty workaround is to "cpuset" iperf and the interrupt and 
> taskqueue threads to specific CPU cores.

Hans Petter,

I'm testing "cpuset" and sometimes get better results, I'm still trying to get 
the best.
I've cpuset iperf & Mellanox interrupts, but what do you mean by taskqueue 
threads ?

Thank  U !

Ben
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-02 Thread Ben RUBSON

> On 03 Aug 2016, at 04:32, Eugene Grosbein  wrote:
> 
> If you have gateway_enable="YES" (sysctl net.inet.ip.forwarding=1)
> then try to disable this forwarding setting and rerun your tests to compare 
> results.

Thank you Eugene for this, but net.inet.ip.forwarding is disabled by default 
and I did not enabled it.
# sysctl net.inet.ip.forwarding
net.inet.ip.forwarding: 0
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-02 Thread Eugene Grosbein

03.08.2016 1:43, Ben RUBSON пишет:

Hello,

I'm trying to reach the 40Gb/s max throughtput between 2 hosts running a 
ConnectX-3 Mellanox network adapter.


If you have gateway_enable="YES" (sysctl net.inet.ip.forwarding=1)
then try to disable this forwarding setting and rerun your tests to compare 
results.


___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Unstable local network throughput

2016-08-02 Thread Ben RUBSON

> On 02 Aug 2016, at 21:35, Hans Petter Selasky  wrote:
> 
> Hi,

Thank you for your answer Hans Petter !

> The CX-3 driver doesn't bind the worker threads to specific CPU cores by 
> default, so if your CPU has more than one so-called numa, you'll end up that 
> the bottle-neck is the high-speed link between the CPU cores and not the 
> card. A quick and dirty workaround is to "cpuset" iperf and the interrupt and 
> taskqueue threads to specific CPU cores.

My CPUs : 2x E5-2620v3 with DDR4@1866.
What is strange is that even without using the card (iPerf on localhost), as my 
results show, I have very low and unstable random throughput (compared to Linux 
on the same host).

> Are you using "options RSS" and "options PCBGROUP" in your kernel config?

I only installed FreeBSD 10.3 and updated it, so I use the GENERIC kernel.
RSS and PCBGROUP are not defined in /usr/src/sys/amd64/conf/GENERIC, so I think 
I do not use them.

> Are you also testing CX-4 cards from Mellanox?

No, I only have CX-3 at my disposal :)

Ben

PS : in my previous mail I sometimes used GB/s, of course you must read Gb/s 
everywhere.
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Unstable local network throughput

2016-08-02 Thread Hans Petter Selasky

On 08/02/16 20:43, Ben RUBSON wrote:

Hello,

I'm trying to reach the 40Gb/s max throughtput between 2 hosts running a 
ConnectX-3 Mellanox network adapter.

FreeBSD 10.3 just installed, last updates performed.
Network adapters running last firmwares / last drivers.
No workload at all, just iPerf as the benchmark tool.



Hi,

The CX-3 driver doesn't bind the worker threads to specific CPU cores by 
default, so if your CPU has more than one so-called numa, you'll end up 
that the bottle-neck is the high-speed link between the CPU cores and 
not the card. A quick and dirty workaround is to "cpuset" iperf and the 
interrupt and taskqueue threads to specific CPU cores.


Are you using "options RSS" and "options PCBGROUP" in your kernel config?

Are you also testing CX-4 cards from Mellanox?

--HPS
___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Unstable local network throughput

2016-08-02 Thread Ben RUBSON
Hello,

I'm trying to reach the 40Gb/s max throughtput between 2 hosts running a 
ConnectX-3 Mellanox network adapter.

FreeBSD 10.3 just installed, last updates performed.
Network adapters running last firmwares / last drivers.
No workload at all, just iPerf as the benchmark tool.

### Step 1 :
I never achieved to go beyond around 30Gb/s.
I did the usual tuning (MTU, kern.ipc.maxsockbuf, net.inet.tcp.sendbuf_max, 
net.inet.tcp.recvbuf_max...).
I played with adapter interrupt moderation.
I played with iPerf options (window / buffer size, number of threads...).
But it did not help.
Results fluctuate, throughput is not sustained, and using 2 or more iPerf 
threads did not help but degraded the results "quality".

### Step 2 :
Let's start Linux on these 2 physical hosts.
I only had to use jumbo frames in order to achieve the 40Gb/s max throughtput...
OK, network between the 2 hosts is not the root cause, and my hardware can run 
these adapters up to their max throughput.
Good point.

### Step 3 :
Go back to FreeBSD on these physical hosts.
Let's run this simple command to test FreeBSD itself :
# iperf -c 127.0.0.1 -i 1 -t 60
Strangely enough, higher results are around 35GB/s.
Even more strange, from one run to another, I do not get identical results : 
sometimes 17Gb/s, sometimes 20, sometimes 30...
Throughput can also suddenly drop down, then increase again...
Power management in BIOS is totally disabled, as well as FreeBSD powerd, so CPU 
frequency is not throttled.
Another strange thing, increasing the number of iPerf threads (-P 2 for 
example), does not improve the results at all.
iPerf3 gave the same random results.

### Step 4 :
Let's start Linux again on these 2 hosts.
Let's run the same simple command :
# iperf -c 127.0.0.1 -i 1 -t 60
Result : 45Gb/s.
With 2 threads : 90Gb/s.
With 4 threads : 180Gb/s.
So here we have expected results, and they stay identical over the time.

### Step 5 :
Does FreeBSD suffers when sending or when receiving ?
Let's start one host with Linux, the other one with FreeBSD.
Results :
Linux --> FreeBSD : around 30GB/s.
FreeBSD --> Linux : 40Gb/s.
So sounds like FreeBSD suffers when receiving.

### Step 6 :
FreeBSD 11-BETA3 gave the same random results.

### Questions :
I think my tests show that there is something wrong with FreeBSD (tuning ? 
something else ?).
Do you have the same kind of random results on your hosts ?
Could you help me trying to have sustained througput @step3, as we have @step4 
(I think this is what we should expect) ?
There would then be no reason not to achieve max throughput through Mellanox 
adapters themselves.

Thank you very much !

Best regards,

Ben

___
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"