Hi,

I'm investigating a packet loss problem experienced on an Intel S5520UR
motherboard with two 82585EB ethernet controllers:

bash-4.2$ lspci | grep -i net
01:00.0 Ethernet controller: Intel Corporation 82575EB Gigabit Network 
Connection (rev 02)
01:00.1 Ethernet controller: Intel Corporation 82575EB Gigabit Network 
Connection (rev 02)

I am running Fedora 19 on this system with kernel 3.10.11-200.fc19.x86_64
and igb driver version 5.0.3-k.

This system receives lots of market data multicasts.  My application
detects packet loss by observing sequence number gaps, and it keeps a
record of the number of packets received.

I have a faster Intel S2600GZ4 system attached to the same switch that
is not experiencing packet loss, so I believe the switch is delivering
the data.

There is no indication of packet drops in the output of "ethtool -S",
nor in the statistics contained in /proc/net/dev, /proc/net/snmp,
/proc/net/softnet_stats, or /proc/net/udp.

This puzzled me.  At first, I thought the switch must be dropping the
data.  The only thing that seemed at all out of order was the time_squeeze 
column in softnet_stat which showed some squeezes occurring.  But I also
see that on the system that does not drop data.

When I looked more closely, I noticed the following.  On a day when when my
application on the system dropping packets received 5590 fewer packets than the
faster system that is not dropping data, I observed the following:

ethtool -S snapshot before:
     rx_packets: 6718661095
     rx_queue_0_packets: 1787543493
     rx_queue_1_packets: 1571739058
     rx_queue_2_packets: 1571740779
     rx_queue_3_packets: 1787506454

ethtool -S snapshot after:
     rx_packets: 7364660195
     rx_queue_0_packets: 1959438862
     rx_queue_1_packets: 1722841174
     rx_queue_2_packets: 1722842720
     rx_queue_3_packets: 1959400538

I noticed that the rx_packets counter was incrementing by more than the
sum of the rx_queue_[[:digit:]]_packets counters.  In fact:

(rx_packets[new] 7364660195 - rx_packets[old] 6718661095 = 
645999100)-(rx_queue_.*packets[new] 7364523294 - rx_queue.*packets[old] 
6718529784 = 645993510) = 5590

This matches precisely the difference in the number of packets received
by the application on the good system and the one dropping packets.

The rx_queue_[[:digit:]]_(drops|csum_err|alloc_failed) counters are all zero.

So I am wondering: why is this happening?  It appears as if packets are
being dropped, but why isn't this recorded in the rx_queue_[[:digit:]]_drops
counters?

After observing this, I reconfigured the CPU affinity and was able
to improve the packet drop situation.  I was previously directing all
network interrupts to cpu 0 to avoid problems with C-states resulting in
packet drops.  Since there are 2 NICs in question, I assigned one of them
to a different core, and that seems to help.

Is there a hardware or software problem here that results in the drop counters
not being incremented?  Or am I thinking about this the wrong way?

Thanks,
Andy

------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing 
conversations that shape the rapidly evolving mobile landscape. Sign up now. 
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel® Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to