Hi Carolyn,

Really appreciated you coming back to us. I can't honestly remember what
led us to find the cause in the end, but your suggestions were informative.

Ultimately, we traced the cause to interactions with the conntrack table.
It seems some bright spark had set up monitoring to count the entries in
/proc/net/ip_conntrack, rather than using the provided count variable. This
meant that as connections increased passed the 30,000 mark, the count took
longer and longer. This was locking something within the stack, probably
the conntrack table itself, in the process and preventing the interfaces
from responding correctly. Once that was corrected, we've been able to go
back to getting good performance out of the NICs.

However, one piece of feedback; the adapter resets that were triggered
during these events had a significantly detrimental effect on throughput
(as you would expect). What is interesting is that a Broadcom based NIC
brought in to provide alternative test data did not trigger resets under
the same condition. It subsequently performed much better, although still
with an exponential drop off as connections increased much higher.

Anyway, problem solved. Put it down to human error, albeit in a
quite humorous way.

Regards,

Simon

On 20 September 2012 00:00, Wyborny, Carolyn <[email protected]>wrote:

> > -----Original Message-----
> > From: Simon Utting [mailto:[email protected]]
> > Sent: Thursday, September 13, 2012 2:39 PM
> > To: [email protected]
> > Subject: [E1000-devel] Adapter reset on 82576 and 82580 bonded pair
> >
> > Hi,
> >
> > Apologies if this is missing any information, I will try to be as
> thorough as
> > possible. We have hit a wall and are looking for guidance in continuing
> > troubleshooting, because the driver seems to be resetting the adapter.
> This is
> > speculation without a deeper understanding :-)
> >
> [..}
> > - on the majority of machines, at regular, but unpredictable, intervals
> we see
> > unresponsive network connectivity from the physical machines (and
> therefore
> > obviously the VMs they host)
> [..}> I appreciate that there will need to be further diagnostic work done
> to ascertain
> > the problem. Any guidance is appreciated.
> >
> Hello Simon,
>
> I apologize for the delay in responding. Your setup is complicated and I
> need to consult a few experts for some advice.   What version of Xen are
> you running?  What Linux kernel version are you using for Dom0?  It seems
> possible that our interrupts are getting dropped somewhere along the way,
> possibly by Xen, as our drivers run in Dom0. If this happens, the driver is
> stuck until something (probably the watchdog) fires the interrupt vector
> again. Depending upon the timing, this can either result in a short pause,
> or (if the rings fill up) a spurious TX hang. In this case, it's not really
> a TX hang, but the ISR gets delayed so long it thinks the hardware is hung
> when it starts cleaning and sees how old the descriptors are.
>
> Things to try:
> - Make sure you are running latest stable Xen and Dom0, along with our
> latest driver on everything.
> - Switch to MSI or legacy interrupts
> - Could you migrate one of the machine to KVM to see if the problem goes
> away. I understand this may not be possible, but it would help eliminate
> Xen from the problem.
> - Change the watchdog timer to a much shorter interval - maybe 1/10 of
> second or something like that. This won't eliminate the underlying problem
> but will make the delays a lot shorter and easier to overlook. If this
> appears to solve the problem, it's kind of a smoking gun that our
> interrupts are disappearing.
>
> Let me know how it goes.
>
> Thanks,
>
> Carolyn
>
> Carolyn Wyborny
> Linux Development
> LAN Access Division
> Intel Corporation
>
>
>
------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to