Hi,
I'm investigating a packet loss problem experienced on an Intel S5520UR
motherboard with two 82585EB ethernet controllers:
bash-4.2$ lspci | grep -i net
01:00.0 Ethernet controller: Intel Corporation 82575EB Gigabit Network
Connection (rev 02)
01:00.1 Ethernet controller: Intel Corporation 82575EB Gigabit Network
Connection (rev 02)
I am running Fedora 19 on this system with kernel 3.10.11-200.fc19.x86_64
and igb driver version 5.0.3-k.
This system receives lots of market data multicasts. My application
detects packet loss by observing sequence number gaps, and it keeps a
record of the number of packets received.
I have a faster Intel S2600GZ4 system attached to the same switch that
is not experiencing packet loss, so I believe the switch is delivering
the data.
There is no indication of packet drops in the output of "ethtool -S",
nor in the statistics contained in /proc/net/dev, /proc/net/snmp,
/proc/net/softnet_stats, or /proc/net/udp.
This puzzled me. At first, I thought the switch must be dropping the
data. The only thing that seemed at all out of order was the time_squeeze
column in softnet_stat which showed some squeezes occurring. But I also
see that on the system that does not drop data.
When I looked more closely, I noticed the following. On a day when when my
application on the system dropping packets received 5590 fewer packets than the
faster system that is not dropping data, I observed the following:
ethtool -S snapshot before:
rx_packets: 6718661095
rx_queue_0_packets: 1787543493
rx_queue_1_packets: 1571739058
rx_queue_2_packets: 1571740779
rx_queue_3_packets: 1787506454
ethtool -S snapshot after:
rx_packets: 7364660195
rx_queue_0_packets: 1959438862
rx_queue_1_packets: 1722841174
rx_queue_2_packets: 1722842720
rx_queue_3_packets: 1959400538
I noticed that the rx_packets counter was incrementing by more than the
sum of the rx_queue_[[:digit:]]_packets counters. In fact:
(rx_packets[new] 7364660195 - rx_packets[old] 6718661095 =
645999100)-(rx_queue_.*packets[new] 7364523294 - rx_queue.*packets[old]
6718529784 = 645993510) = 5590
This matches precisely the difference in the number of packets received
by the application on the good system and the one dropping packets.
The rx_queue_[[:digit:]]_(drops|csum_err|alloc_failed) counters are all zero.
So I am wondering: why is this happening? It appears as if packets are
being dropped, but why isn't this recorded in the rx_queue_[[:digit:]]_drops
counters?
After observing this, I reconfigured the CPU affinity and was able
to improve the packet drop situation. I was previously directing all
network interrupts to cpu 0 to avoid problems with C-states resulting in
packet drops. Since there are 2 NICs in question, I assigned one of them
to a different core, and that seems to help.
Is there a hardware or software problem here that results in the drop counters
not being incremented? Or am I thinking about this the wrong way?
Thanks,
Andy
------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing
conversations that shape the rapidly evolving mobile landscape. Sign up now.
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel® Ethernet, visit
http://communities.intel.com/community/wired