Re: [casper] NIC tuning and IRQ binding : Regarding

Borsenberger Jean Sat, 12 Sep 2020 04:11:19 -0700

Hello,

most of recent (I don't say modern, as it implies some progress) Linux ethernet 
drivers split IRQs on at least six instances. We operate two download channels 
at 1.1GB/s, which is close to the 10Gb ethernet capacitty. The role of the 
ROACH is here to tag each 8K block with a 8byte counter. We can mesure 
precisely the loss rate.
Leaving the thing ruled by the OS leads up to 1% loss ( 8 cores AMD 4GHz memory 
equal access)). Innaceptable. Thinking on the CPU architecture, I assigned all 
the IRQs of an ethernet card to a single core. There is still some loss, but 
around 10**-7. Acceptable. I think the initial problem comes from the fact that 
when you ditribute the ethernet device managment randomly accross cores, you 
encounter cache miss which impose global memory sync.
Of course, you have to get rid of the irqbalance soft which comes by default t 
least in DEBIAN or UBUNTU.


Regards
Jean Borsenberger
Le Vendredi, Septembre 11, 2020 20:46 CEST, Hariharan Krishnan 
<[email protected]> a écrit:
  Hello David,             Thank you for the detailed pointers to the 
networking issue, I'll certainly have a look at your suggestion about the NUMA 
topology.  We use a bifrost framework for our application that subscribes to 
the application code to the multicast network. I had already checked the 
interface statistics, clearly the kernel is dropping the packets in our case. 
It is true that IRQs are polled, in guess the IRQ numbers indicate that, 
however IRQ core mapping seems essential for optimized throughput with close to 
loss <0.5 %. Right now, we are losing almost all packets (95 %), I was actually 
wondering if there are ways to do the mapping more efficiently rather than a 
trial and error with the bitmask which seems to change on reboot which I don't 
understand fully. And I was curious with your statement " average processing 
rate isn't keeping up with the data rate" , are you suggesting that the 
application run-time is greater than the data rate ? (which often isn't the 
case, right ?) Regards, Hari On Thu, Sep 10, 2020 at 1:03 AM David MacMahon 
<[email protected]> wrote:Hi, Hari, I think modern Linux network drivers use 
a "polling" approach rather than an interrupt driven approach, so I've found 
IRQ affinity to be less important than it used to be.  This can be observed as 
relatively low interrupt counts in /proc/interrupts.  The main things that I've 
found beneficial are: 1. Ensuring that the processing code runs on CPU cores in 
the same socket that the NIC's PCIe slot is connected to.  If you have a 
multi-socket NUMA system you will want to become familiar with its NUMA 
topology.  The "hwloc" package includes the cool "lstopo" utility that will 
show you a lot about your system's topology.  Even on a single socket system it 
can help to stay away from core 0 where many OS things tend to run. 2. Ensuring 
that memory allocations happen after your processes/threads have had their CPU 
affinity set, either by "taskset" or "numactl" or its own built-in CPU affinity 
setting code.  This is mostly for NUMA systems. 3. Ensuring that various 
buffers are sized appropriately.  There are a number of settings that can be 
tweaked in this category, most via "sysctl".  I won't dare to make any specific 
recommendations here.  Everybody seems to have their own set of "these are the 
settings I used last time".  One of the most important things you can do in 
your packet receiving code is to keep track of how many packets you receive 
over a certain time interval.  If this value does not match the expected number 
of packets then you have a problem.  Any difference usually will be that the 
received packet count is lower than the expected packet count.  Some people 
call these dropped packets, but I prefer to call them "missed packets" at this 
point because all we can say is that we didn't get them.  We don't yet know 
what happened to them (maybe they were dropped, maybe they were misdirected, 
maybe they were never sent), but it helps to know where to look to find out. 4. 
Places to check for missing packets getting "dropped": 4.1 If you are using 
"normal" (aka SOCK_DGRAM) sockets to receive UDP packets, you will see a line 
in /proc/net/udp for your socket.  The last number on that line will be the 
number of packets that the kernel wanted to give to your socket but couldn't 
because the socket's receive buffer was full so the kernel had to drop the 
packet. 4.2 If you are using "packet" (aka SOCK_RAW) sockets to receive UDP 
packets, there are ways to get the total number of packets the kernel has 
handled for that socket and the number it had to drop because of lack of 
kernel/application buffer space.  I forget the details, but I'm sure you can 
google for it.  If you're using Hashpipe's packet socket support it has a 
function that will fetch these values for you. 4.3 The ifconfig utility will 
give you a count of "RX errors".  This is a generic category and I don't know 
all possible contributions to it, but one is that the NIC couldn't pass packets 
to the kernel. 4.4 Using "ethtool -S IFACE" (eg "ethtool -S eth4") will show 
you loads of stats.  These values all come from counters on the NIC.  Two 
interesting ones are called something like "rx_dropped" and "rx_fifo_errors".  
A non-zero rx_fifo_errors value means that the kernel was not keeping up with 
the packet rate for long enough that the NIC/kernel buffers filled up and 
packets had to be dropped. 4.5 If you're using a lower-level kernel bypass 
approach (e.g. IBVerbs or DPDK), then you may have to dig a little harder to 
find the packet drop counters as th kernel is no longer involved and all the 
previously mentioned counters will be useless (with the possible exception of 
the NIC counters). 4.6 You may be able to login to and query your switch for 
interface statistics.  That can show various data and packet rates as well as 
bytes sent, packets sent, and some various error counters. One thing to 
remember about buffer sizes is that if your average processing rate isn't 
keeping up with the data rate, larger buffers won't solve your problem.  Larger 
buffers will only allow the system to withstand slightly longer temporary lulls 
in throughput ("hiccups") if the overall throughput of the system (including 
the lulls/hiccups) is as fast or (ideally) faster than the incoming data rate. 
Hope this helps,Dave On Sep 9, 2020, at 22:15, Hariharan Krishnan 
<[email protected]> wrote: Hello Everyone,                   I'm 
trying to tune the NIC on a server with Ubuntu 18.04 OS to listen to a 
multicast network and optimize it for throughput through IRQ affinity binding. 
It is a Mellanox card and I tried using the "mlnx_tune" for doing this, but 
haven't been successful.I would really appreciate any help in this regard. 
Looking forward to responses from the group. Thank you. Regards, Hari --
You received this message because you are subscribed to the Google Groups 
"[email protected]" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/CAHNYk1yn5xkdjfDVMm0UMO%3DQ-vjfm4nmVQbf-Jt1b4kGjB9VUQ%40mail.gmail.com.
 --
You received this message because you are subscribed to the Google Groups 
"[email protected]" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/3E685598-8E83-429C-AD7F-3B44D3C90F05%40berkeley.edu.
 --
You received this message because you are subscribed to the Google Groups 
"[email protected]" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/CAHNYk1wjCo1gG0re1GDCE%2BxbHDXgaj0i0DoikxPoS5Pz%3DHaMgQ%40mail.gmail.com.


 

-- 
You received this message because you are subscribed to the Google Groups 
"[email protected]" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/a/lists.berkeley.edu/d/msgid/casper/6c76-5f5cac80-163-2ca695c0%4085587823.

Re: [casper] NIC tuning and IRQ binding : Regarding

Reply via email to