Покотиленко Костик wrote: > В Вто, 09/02/2010 в 23:34 +0200, Покотиленко Костик пишет: > >>>> Also if ACPI is having an effect on the issue one other thing you >>>> might try changing in the BIOS would be to disable all CPU >>>> C-states. The system will consume more power as a result, but the >>>> CPU also ends up usually being much more responsive as a result, >>>> and we have seen in the past that this can sometimes resolve >>>> performance issues. >>> >>> I'll turn those off: >>> >>> CPU C State=1 ;Options: 1=Enabled: 0=Disabled >>> C1E=1 ;Options: 1=Enabled: 0=Disabled >> >> Turned off "CPU C State" and "Spread spectrum", C1E turned off >> automatically. > > With "CPU C State" and "Spread spectrum" turned off after 47 hours I > got: > > NETDEV WATCHDOG: eth1 (igb): transmit timed out > Modules linked in: ... > Call Trace: > ... > > Let summarize: > > - None of kernel (29, 30) and driver combinations solved the problem > - None of BIOS options helped > - I've figured out that when TX Unit Hang on 2 configured ports, > Loopback test fails on 2 unconfigured/used ports also > - When the NIC stops working, rest of the system feels Ok > > So the problem localized a bit, but the source of the problem it's not > clear. Is it hardware related or software... > > Also system is in use by ~300 customers, so more downtime that we > already have is not desireable. > > Server has 2 onboard NICs with one of which we have had similar > problem, and PCI-e Quad port NIC. > > We can still live with 2 NICs, so one of the options for further > testing I see is to go back using onboard NICs and put PCI-e Quad > port NIC to another server I support and do a loop back (Port1<-> > Port2, Port3<->Port4) stress test, but there is 2.6.26 kernel > (changing not an option). > > Let me know what you think and what are other options of further > testing. I'm going to try 2.6.32 before switching NIC to another > server. I Did not do this before because there was issues backporting > it to Lenny.
At this point it feels like we have pretty much eliminated the drivers as being an issue since the unused pair of ports is effected by whatever is causing the first pair to fail. The issue most likely resides somewhere in the path between the on-board PCIe bridge and the PCIe root complex on the system. I think testing the NIC in another system would be our best option for now. This will help to determine if the problem is something in the PCIe bridge on the adapter, or a problem in the root complex of the server. If the issue follows the adapter you will likely need to get it replaced, but if the issue disappears we will need to start investigating all BIOS options on the system related to PCIe. Thanks, Alex ------------------------------------------------------------------------------ SOLARIS 10 is the OS for Data Centers - provides features such as DTrace, Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW http://p.sf.net/sfu/solaris-dev2dev _______________________________________________ E1000-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired
