Hi, all

I use a linux kernel is 3.4.34. And a lot of tests including many network operation, such as MTU change, NIC up/down, and multi-Q creating are running on this linux host. This linux host is vSphere, which including 5 NIC, all of them are e1000 (Intel Corporation 82545EM Gigabit Ethernet Controller (Cpooer) (rev 01) and number is 8086:100f). The driver of e1000 is 7.3.21-k8-NAPI.
Before issue occur, there must be many reset adapter printing, such as:
/************************/
e1000 0000:02:01.0: eth1: Reset adapter
/************************/

When this problem happened, the following messages appeared.

/*****************************************************/
Jul 6 19:08:28 localhost kernel: e1000 0000:02:08.0: eth7: e1000_reinit_safe set __E1000_RESETTING Jul 6 19:08:28 localhost kernel: e1000 0000:02:08.0: eth7: e1000_reinit_safe take adapter's mutex Jul 6 19:08:28 localhost kernel: e1000 0000:02:08.0: eth7: e1000_watchdog take adapter's mutex Jul 6 19:08:28 localhost kernel: e1000 0000:02:03.0: eth3: e1000_reinit_safe release adapter's mutex Jul 6 19:08:28 localhost kernel: e1000 0000:02:03.0: eth3: e1000_reinit_safe reset __E1000_RESETTING
/*****************************************************/

I analyzed the source code. There is a time slot between __E1000_RESETTING and __E1000_DOWN.

When e1000_reinit_safe sets __E1000_RESETTING and takes adapter's mutex before sets __E1000_DOWN, e1000_watchdog is scheduled and take adapter's mutex, then e1000_reinit_safe shuts down nic while e1000_watchdog is processing. Then e1000 nic will hang.

My solution is to prevent e1000_watchdog is scheduled in this time slot between __E1000_RESETTING and __E1000_DOWN.

Is there anything wrong about this solution?

Best Regards!
zhuyj

On 08/15/2013 03:09 PM, zhuyj wrote:
Hi, maintainer

Would you like to comment on this patch?
Thanks a lot.

Best Regards!
Zhu Yanjun

On 08/15/2013 03:01 PM, zhuyj wrote:
Hi,

After a long time networking test case running, e1000 NIC driver may not work anymore. At this time, system is okay, we can execute some non-network command(such as ls, cp etc.), but if we execute network command(ifconfig), system will hang there, can not get response anymore. We add some log in driver and found this was caused by mutex nest, it means normaly, one mutex got and then release, another mutex was got, but when issue occur, from log, the first mutex was got, did not release, then got mutex again:

/*****************************************************/
Jul 6 19:08:28 localhost kernel: e1000 0000:02:08.0: eth7: e1000_reinit_safe set __E1000_RESETTING Jul 6 19:08:28 localhost kernel: e1000 0000:02:08.0: eth7: e1000_reinit_safe take adapter's mutex Jul 6 19:08:28 localhost kernel: e1000 0000:02:08.0: eth7: e1000_watchdog take adapter's mutex Jul 6 19:08:28 localhost kernel: e1000 0000:02:03.0: eth3: e1000_reinit_safe release adapter's mutex Jul 6 19:08:28 localhost kernel: e1000 0000:02:03.0: eth3: e1000_reinit_safe reset __E1000_RESETTING
/*****************************************************/

We made the following patch and applied this patch. This problem disappeared.
Please comment on this patch.
Thanks a lot.

/***********************************************/
diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index 7569ebb..2878308 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -2441,7 +2441,8 @@ static void e1000_watchdog(struct work_struct *work)
               struct e1000_tx_ring *txdr = adapter->tx_ring;
               u32 link, tctl;

-              if (test_bit(__E1000_DOWN, &adapter->flags))
+             if (test_bit(__E1000_DOWN, &adapter->flags) ||
+ test_bit(__E1000_RESETTING, &adapter->flags))
                               return;

/***********************************************/

zhuyj


------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead. 
Download for free and get started troubleshooting in minutes. 
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to