[E1000-devel] Year End Hampers

2011-11-18 Thread info

 OUR NEW WEBSIT: WWW.XTRUZION.CO.ZA
 CLICK BELOW LINK FOR PRICES!!!
 
http://127.0.0.1:29663/BrowseCategory.aspx?categoryGuid=76761c03-6492-4841-ae00-74b97c24c140&name=Hampers
 
--
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel® Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] [BUG] e1000: possible deadlock scenario caught by lockdep

2011-11-18 Thread Steven Rostedt
On Fri, 2011-11-18 at 15:17 -0800, Jesse Brandeburg wrote:

> this is a proposed patch to fix the issue:
> if it works for you please let me know and I will submit it officially
> through our process

Well, the one time I tested it, it didn't crash and it didn't give a
lockdep splat either. I'll add it and reboot it a few more times and
I'll holler if something bad happens.

Tested-by: Steven Rostedt 

-- Steve



--
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel® Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] Issue with (Intel 82576 Gigabit - Quad Port Server Adapter, ET2

2011-11-18 Thread Wyborny, Carolyn


>-Original Message-
>From: John Haechten [mailto:jhaech...@crossroads.com]
>Sent: Friday, November 18, 2011 3:23 PM
>To: e1000-devel@lists.sourceforge.net
>Subject: [E1000-devel] Issue with (Intel 82576 Gigabit - Quad Port
>Server Adapter, ET2
>
>I have run into an issue with an Intel 4 Port Ethernet Adapter card
>where ALL the Ethernet ports on a particular card appear to get over-run
>with
>Errors.  All 4 Ports are hung up even though there was only 1 port doing
>active data transfer.
>In my configuration, I have 2 Ethernet ports on the motherboard, with
>the following PCI-
>Express cards:
>- PCI-Express card with 10Gb x 2 Port  (Intel 82599EB 10-Gigabit)
>- PCI-Express card with 1Gb x 4 Port  (Intel 82576 Gigabit - Quad
>Port
>Server Adapter, Gigabit ET2 Intel no. E1G44ET2)
>eth0,1 are 1Gb ports on the motherboard
>Eth2,3 are 10Gb ports
>Eth4,5,6,7 are 1Gb ports
>Ethernet Ports 4,5,6,7 all show about the same symptoms at the same
>time, so it is like the board has some issue or possibly the firmware on
>the board is incompatible.
>A sample of the errors:
>  RX packets:43568 errors:16922171142300
>dropped:4230542785575 overruns:4230546819150
>frame:16922171142300
>
>  TX packets:5079 errors:8461085571150 dropped:0 overruns:0
>carrier:8461085571150
>
>My client is able to mount across all the Ethernet ports.  I create
>individual mounts for each port.
>I am able to do directory listings, etc. on the client and see the
>files, etc. across each mount.
>However, when I start transferring data over a single 1Gb port by
>copying the data or using "dd",   All 4 of the 1Gb Ethernet ports stop
>working and appears to hang.
>I am able to use the 2 x 10Gb Ports, so I think this is an issue with
>this particular adapter card.
>
>When I investigate the port, all the error counts are huge, even for
>ports that were not actively involved in the data transfer.
>
>However, this only appears to be true for the ports on that PCI-Express
>board, no the Motherboard which has the same chipset.
>
>What can cause this type of problem??
>
>I am running CentOS 6.0
>
>OS distribution:   CentOS Linux release 6.0 (Final)
>
>Linux kernel:  2.6.32-71.29.1.el6.x86_64
>
>NOTE: I was running this configuration with CentOS5.5, Linux kernel:
>2.6.35 and did not have this issue.  I then upgraded to CentOS 6.0.
>I can transfer data across the 10Gb ports just fine.  I was able to use
>the 2 x 10Gb ports and the 4 x 1Gb ports simultaneously with CentOS 5.5
>Is there a possible compatibility issue with the 10Gb card using NAPI
>and the 1Gb card does not?
>I have included the Ethernet Driver info for the 1Gb and the 10Gb
>drivers.
>Also, I am trying to maximize my performance over the 1Gb and 10Gb
>ports, once they are all working.  Are there driver parameters that I
>need to reconfigure to achieve this?
>
>-bash-4.1# lspci | grep Ethernet
>
>01:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network
>Connection (rev 01)
>
>01:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network
>Connection (rev 01)
>
>06:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit
>SFI/SFP+ Network Connection (rev 01)
>
>06:00.1 Ethernet controller: Intel Corporation 82599EB 10-Gigabit
>SFI/SFP+ Network Connection (rev 01)
>
>09:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network
>Connection (rev ff)
>
>09:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network
>Connection (rev ff)
>
>0a:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network
>Connection (rev ff)
>
>0a:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network
>Connection (rev ff)
>
>-bash-4.1# ifconfig
>eth0  Link encap:Ethernet  HWaddr 00:25:90:35:0B:30 inet
>addr:192.168.40.233  Bcast:192.168.40.255
>Mask:255.255.255.0
>  inet6 addr: fe80::225:90ff:fe35:b30/64 Scope:Link
>  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>  RX packets:18796 errors:0 dropped:0 overruns:0 frame:0
>  TX packets:12948 errors:0 dropped:0 overruns:0 carrier:0
>  collisions:0 txqueuelen:1000
>  RX bytes:2598944 (2.4 MiB)  TX bytes:3662092 (3.4 MiB)
>
>eth2  Link encap:Ethernet  HWaddr 00:1B:21:D4:63:DC
>  inet addr:192.168.31.110  Bcast:192.168.31.255
>Mask:255.255.255.0
>  inet6 addr: fe80::21b:21ff:fed4:63dc/64 Scope:Link
>  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>  RX packets:9760518 errors:8 dropped:0 overruns:0 frame:8
>  TX packets:27964383 errors:0 dropped:0 overruns:0 carrier:0
>  collisions:0 txqueuelen:1000
>  RX bytes:47591717266 (44.3 GiB)  TX bytes:36305129570
>(33.8GiB)
>
>eth3  Link encap:Ethernet  HWaddr 00:1B:21:D4:63:DD
>  inet addr:192.168.32.110  Bcast:192.168.32.255
>Mask:255.255.255.0
>  inet6 addr: fe80::21b:21ff:fed4:63dd/64 Scope:Link
>  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>  RX packets:6269483 errors:29 dropped:0

[E1000-devel] Issue with (Intel 82576 Gigabit - Quad Port Server Adapter, ET2

2011-11-18 Thread John Haechten
I have run into an issue with an Intel 4 Port Ethernet Adapter card where ALL 
the Ethernet ports on a particular card appear to get over-run with
Errors.  All 4 Ports are hung up even though there was only 1 port doing active 
data transfer.
In my configuration, I have 2 Ethernet ports on the motherboard, with the 
following PCI-
Express cards:
- PCI-Express card with 10Gb x 2 Port  (Intel 82599EB 10-Gigabit)
- PCI-Express card with 1Gb x 4 Port  (Intel 82576 Gigabit - Quad Port
Server Adapter, Gigabit ET2 Intel no. E1G44ET2)
eth0,1 are 1Gb ports on the motherboard
Eth2,3 are 10Gb ports
Eth4,5,6,7 are 1Gb ports
Ethernet Ports 4,5,6,7 all show about the same symptoms at the same time, so it 
is like the board has some issue or possibly the firmware on the board is 
incompatible.
A sample of the errors:
  RX packets:43568 errors:16922171142300
dropped:4230542785575 overruns:4230546819150
frame:16922171142300

  TX packets:5079 errors:8461085571150 dropped:0 overruns:0
carrier:8461085571150

My client is able to mount across all the Ethernet ports.  I create individual 
mounts for each port.
I am able to do directory listings, etc. on the client and see the files, etc. 
across each mount.
However, when I start transferring data over a single 1Gb port by copying the 
data or using "dd",   All 4 of the 1Gb Ethernet ports stop working and appears 
to hang.
I am able to use the 2 x 10Gb Ports, so I think this is an issue with this 
particular adapter card.

When I investigate the port, all the error counts are huge, even for ports that 
were not actively involved in the data transfer.

However, this only appears to be true for the ports on that PCI-Express board, 
no the Motherboard which has the same chipset.

What can cause this type of problem??

I am running CentOS 6.0

OS distribution:   CentOS Linux release 6.0 (Final)

Linux kernel:  2.6.32-71.29.1.el6.x86_64

NOTE: I was running this configuration with CentOS5.5, Linux kernel: 2.6.35 and 
did not have this issue.  I then upgraded to CentOS 6.0.
I can transfer data across the 10Gb ports just fine.  I was able to use the 2 x 
10Gb ports and the 4 x 1Gb ports simultaneously with CentOS 5.5
Is there a possible compatibility issue with the 10Gb card using NAPI and the 
1Gb card does not?
I have included the Ethernet Driver info for the 1Gb and the 10Gb drivers.
Also, I am trying to maximize my performance over the 1Gb and 10Gb ports, once 
they are all working.  Are there driver parameters that I need to reconfigure 
to achieve this?

-bash-4.1# lspci | grep Ethernet

01:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection 
(rev 01)

01:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection 
(rev 01)

06:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ 
Network Connection (rev 01)

06:00.1 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ 
Network Connection (rev 01)

09:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection 
(rev ff)

09:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection 
(rev ff)

0a:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection 
(rev ff)

0a:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection 
(rev ff)

-bash-4.1# ifconfig
eth0  Link encap:Ethernet  HWaddr 00:25:90:35:0B:30 inet 
addr:192.168.40.233  Bcast:192.168.40.255
Mask:255.255.255.0
  inet6 addr: fe80::225:90ff:fe35:b30/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:18796 errors:0 dropped:0 overruns:0 frame:0
  TX packets:12948 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:2598944 (2.4 MiB)  TX bytes:3662092 (3.4 MiB)

eth2  Link encap:Ethernet  HWaddr 00:1B:21:D4:63:DC
  inet addr:192.168.31.110  Bcast:192.168.31.255
Mask:255.255.255.0
  inet6 addr: fe80::21b:21ff:fed4:63dc/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:9760518 errors:8 dropped:0 overruns:0 frame:8
  TX packets:27964383 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:47591717266 (44.3 GiB)  TX bytes:36305129570 (33.8GiB)

eth3  Link encap:Ethernet  HWaddr 00:1B:21:D4:63:DD
  inet addr:192.168.32.110  Bcast:192.168.32.255
Mask:255.255.255.0
  inet6 addr: fe80::21b:21ff:fed4:63dd/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:6269483 errors:29 dropped:0 overruns:0 frame:29
  TX packets:14876350 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:32168238434 (29.9 GiB)  TX bytes:18210194044 (16.9GiB)

eth4  Link encap:Ethernet  HWaddr 00:1B:21:BA:F6:B8
  inet addr:192.168.11.60  Bcast:192.168.11.255
Mask:255.255.255.0
  

Re: [E1000-devel] [BUG] e1000: possible deadlock scenario caught by lockdep

2011-11-18 Thread Jesse Brandeburg
On Fri, 18 Nov 2011 08:57:37 -0800
Jesse Brandeburg  wrote:

> CC'd netdev, and e1000-devel
> 
> On Thu, 17 Nov 2011 17:27:00 -0800
> Steven Rostedt  wrote:
> > Here you see that we are calling 
> > cancel_delayed_work_sync(&adapter->watchdog_task);
> > 
> > The problem is that adapter->watchdog_task grabs the mutex &adapter->mutex.
> > 
> > If the work has started and it blocked on that mutex, the
> > cancel_delayed_work_sync() will block indefinitely and we have a
> > deadlock.
> > 
> > Not sure what's the best way around this. Can we call e1000_down()
> > without grabbing the adapter->mutex?
> 
> Thanks for the report, I'll look at it today and see if I can work out
> a way to avoid the bonk.

this is a proposed patch to fix the issue:
if it works for you please let me know and I will submit it officially
through our process

e1000: fix lockdep splat in shutdown handler

From: Jesse Brandeburg 

as reported by Steven Rostedt, e1000 has a lockdep splat added
during the recent merge window.  The issue is that
cancel_delayed_work is called while holding our private mutex.

There is no reason that I can see to hold the mutex during pci
shutdown, it was more just paranoia that I put the mutex_lock
around the call to e1000_down.

in a quick survey lots of drivers handle locking differently when
being called by the pci layer.  The assumption here is that we
don't need the mutexes' protection in this function because
the driver could not be unloaded while in the shutdown handler
which is only called at reboot or poweroff.

Reported-by: Steven Rostedt 
Signed-off-by: Jesse Brandeburg 
---

 drivers/net/ethernet/intel/e1000/e1000_main.c |8 +---
 1 files changed, 1 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c 
b/drivers/net/ethernet/intel/e1000/e1000_main.c
index cf480b5..97b46ba 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -4716,8 +4716,6 @@ static int __e1000_shutdown(struct pci_dev *pdev, bool 
*enable_wake)
 
netif_device_detach(netdev);
 
-   mutex_lock(&adapter->mutex);
-
if (netif_running(netdev)) {
WARN_ON(test_bit(__E1000_RESETTING, &adapter->flags));
e1000_down(adapter);
@@ -4725,10 +4723,8 @@ static int __e1000_shutdown(struct pci_dev *pdev, bool 
*enable_wake)
 
 #ifdef CONFIG_PM
retval = pci_save_state(pdev);
-   if (retval) {
-   mutex_unlock(&adapter->mutex);
+   if (retval)
return retval;
-   }
 #endif
 
status = er32(STATUS);
@@ -4783,8 +4779,6 @@ static int __e1000_shutdown(struct pci_dev *pdev, bool 
*enable_wake)
if (netif_running(netdev))
e1000_free_irq(adapter);
 
-   mutex_unlock(&adapter->mutex);
-
pci_disable_device(pdev);
 
return 0;

--
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel® Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] [PATCH] e1000e : Avoid wrong check on TX hang

2011-11-18 Thread Flavio Leitner
On Fri, 18 Nov 2011 08:53:00 -0800
Jesse Brandeburg  wrote:

> On Thu, 17 Nov 2011 17:46:46 -0800
> Flavio Leitner  wrote:
> 
> > On Fri, 18 Nov 2011 09:37:12 +0800
> > Michael Wang  wrote:
> > 
> > > From: Michael Wang 
> > > 
> > > Descriptors may not be write-back while checking TX hang with flag
> > > FLAG2_DMA_BURST on.
> > > So when we detect hang, we just flush the descriptor and detect
> > > again for once.
> > > 
> > > Signed-off-by: Michael Wang 
> > 
> > Jesse,
> > This is tested and fixes the issue I've reported in the other
> > thread: [E1000-devel] 82571EB: Detected Hardware Unit Hang
> > http://sourceforge.net/mailarchive/forum.php?thread_name=20111014140426.3d576173%40asterix.rh&forum_name=e1000-devel
> > 
> > Signed-off-by: Flavio Leitner 
> 
> Flavio/Michael, thanks for working on this, the patch itself seems
> okay, but it does increase the time to detect a tx hang doesn't it?

Yes, that is correct. The specific models having the flag
FLAG2_DMA_BURST will wait another watchdog round for the
detection.  We thought about scheduling the watchdog in a
short time to reduce the impact, but we don't know how much
time it takes to finish up the write-back. There is the 
interruption but then the fix would be rather large and
complex, I think.

> I'm okay with the patch functionality because you're implementing
> (effectively, if not a little indirectly) the fix our hardware
> engineer suggested which was two writes to the FPD bit.

Yes, we found about FPD register then I did a patch to create
a workqueue periodically writing FDP register. It had fixed
the issue as well. It was more like an experiment to confirm
the root cause without touching the driver's work flow.

The proposed patch avoids writing during normal situation to
keep the performance up. Otherwise, if we wrote to FPD each
time before watchdog, we could have up to 4 descriptors being
transferred with no good reason.

> We can test the patch in our lab here, Jeff Kirsher will push it
> upstream when it completes testing.

Ok, sounds like a plan to me.
Thanks for reviewing it.
fbl

--
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel® Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] [BUG] e1000: possible deadlock scenario caught by lockdep

2011-11-18 Thread Jesse Brandeburg
CC'd netdev, and e1000-devel

On Thu, 17 Nov 2011 17:27:00 -0800
Steven Rostedt  wrote:

> I hit the following lockdep splat:
> 
> ==
> [ INFO: possible circular locking dependency detected ]
> 3.2.0-rc2-test+ #14
> ---
> reboot/2316 is trying to acquire lock:
>  ((&(&adapter->watchdog_task)->work)){+.+...}, at: [] 
> wait_on_work+0x0/0xac
> 
> but task is already holding lock:
>  (&adapter->mutex){+.+...}, at: [] 
> __e1000_shutdown+0x56/0x1f5
> 
> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #1 (&adapter->mutex){+.+...}:
>[] lock_acquire+0x103/0x158
>[] __mutex_lock_common+0x6a/0x441
>[] mutex_lock_nested+0x1b/0x1d
>[] e1000_watchdog+0x56/0x4a4
>[] process_one_work+0x1ef/0x3e0
>[] worker_thread+0xda/0x15e
>[] kthread+0x9f/0xa7
>[] kernel_thread_helper+0x4/0x10
> 
> -> #0 ((&(&adapter->watchdog_task)->work)){+.+...}:
>[] __lock_acquire+0xa29/0xd06
>[] lock_acquire+0x103/0x158
>[] wait_on_work+0x3d/0xac
>[] __cancel_work_timer+0xb9/0xff
>[] cancel_delayed_work_sync+0x12/0x14
>[] e1000_down_and_stop+0x2e/0x4a
>[] e1000_down+0x116/0x176
>[] __e1000_shutdown+0x83/0x1f5
>[] e1000_shutdown+0x1a/0x43
>[] pci_device_shutdown+0x29/0x3d
>[] device_shutdown+0xbe/0xf9
>[] kernel_restart_prepare+0x31/0x38
>[] kernel_restart+0x14/0x51
>[] sys_reboot+0x157/0x1b0
>[] system_call_fastpath+0x16/0x1b
> 
> other info that might help us debug this:
> 
>  Possible unsafe locking scenario:
> 
>CPU0CPU1
>
>   lock(&adapter->mutex);
>lock((&(&adapter->watchdog_task)->work));
>lock(&adapter->mutex);
>   lock((&(&adapter->watchdog_task)->work));
> 
>  *** DEADLOCK ***
> 
> 2 locks held by reboot/2316:
>  #0:  (reboot_mutex){+.+.+.}, at: [] sys_reboot+0x9f/0x1b0
>  #1:  (&adapter->mutex){+.+...}, at: [] 
> __e1000_shutdown+0x56/0x1f5
> 
> stack backtrace:
> Pid: 2316, comm: reboot Not tainted 3.2.0-rc2-test+ #14
> Call Trace:
>  [] print_circular_bug+0x1f8/0x209
>  [] __lock_acquire+0xa29/0xd06
>  [] ? wait_on_cpu_work+0x94/0x94
>  [] lock_acquire+0x103/0x158
>  [] ? wait_on_cpu_work+0x94/0x94
>  [] ? trace_preempt_on+0x2a/0x2f
>  [] wait_on_work+0x3d/0xac
>  [] ? wait_on_cpu_work+0x94/0x94
>  [] __cancel_work_timer+0xb9/0xff
>  [] cancel_delayed_work_sync+0x12/0x14
>  [] e1000_down_and_stop+0x2e/0x4a
>  [] e1000_down+0x116/0x176
>  [] __e1000_shutdown+0x83/0x1f5
>  [] ? _raw_spin_unlock+0x33/0x56
>  [] ? device_shutdown+0x40/0xf9
>  [] e1000_shutdown+0x1a/0x43
>  [] ? sub_preempt_count+0xa1/0xb4
>  [] pci_device_shutdown+0x29/0x3d
>  [] device_shutdown+0xbe/0xf9
>  [] kernel_restart_prepare+0x31/0x38
>  [] kernel_restart+0x14/0x51
>  [] sys_reboot+0x157/0x1b0
>  [] ? hrtimer_cancel+0x17/0x24
>  [] ? do_nanosleep+0x74/0xac
>  [] ? trace_hardirqs_off_thunk+0x3a/0x3c
>  [] ? error_sti+0x5/0x6
>  [] ? time_hardirqs_off+0x2a/0x2f
>  [] ? trace_hardirqs_on_thunk+0x3a/0x3f
>  [] ? retint_swapgs+0x13/0x1b
>  [] ? retint_swapgs+0x13/0x1b
>  [] ? trace_hardirqs_on_caller+0x12d/0x164
>  [] ? audit_syscall_entry+0x11c/0x148
>  [] ? trace_hardirqs_on_thunk+0x3a/0x3f
>  [] system_call_fastpath+0x16/0x1b
> 
> 
> The issue comes from two recent commits:
> 
> commit a4010afef585b7142eb605e3a6e4210c0e1b2957
> Author: Jesse Brandeburg 
> Date:   Wed Oct 5 07:24:41 2011 +
> e1000: convert hardware management from timers to threads
> 
> and
> 
> commit 0ef4eedc2e98edd51cd106e1f6a27178622b7e57
> Author: Jesse Brandeburg 
> Date:   Wed Oct 5 07:24:51 2011 +
> e1000: convert to private mutex from rtnl
> 
> 
> What we have is on __e1000_shutdown():
> 
>   mutex_lock(&adapter->mutex);
> 
>   if (netif_running(netdev)) {
>   WARN_ON(test_bit(__E1000_RESETTING, &adapter->flags));
>   e1000_down(adapter);
>   }
> 
> but e1000_down() calls: e1000_down_and_stop():
> 
> static void e1000_down_and_stop(struct e1000_adapter *adapter)
> {
>   set_bit(__E1000_DOWN, &adapter->flags);
>   cancel_work_sync(&adapter->reset_task);
>   cancel_delayed_work_sync(&adapter->watchdog_task);
>   cancel_delayed_work_sync(&adapter->phy_info_task);
>   cancel_delayed_work_sync(&adapter->fifo_stall_task);
> }
> 
> 
> Here you see that we are calling 
> cancel_delayed_work_sync(&adapter->watchdog_task);
> 
> The problem is that adapter->watchdog_task grabs the mutex &adapter->mutex.
> 
> If the work has started and it blocked on that mutex, the
> cancel_delayed_work_sync() will block indefinitely and we have a
> deadlock.
> 
> Not sure what's the best way around this. Can we call e1000_down()
> without grabbing the adapter->mutex?

Thanks for the

Re: [E1000-devel] [PATCH] e1000e : Avoid wrong check on TX hang

2011-11-18 Thread Jesse Brandeburg
On Thu, 17 Nov 2011 17:46:46 -0800
Flavio Leitner  wrote:

> On Fri, 18 Nov 2011 09:37:12 +0800
> Michael Wang  wrote:
> 
> > From: Michael Wang 
> > 
> > Descriptors may not be write-back while checking TX hang with flag
> > FLAG2_DMA_BURST on.
> > So when we detect hang, we just flush the descriptor and detect
> > again for once.
> > 
> > Signed-off-by: Michael Wang 
> 
> Jesse,
> This is tested and fixes the issue I've reported in the other thread:
> [E1000-devel] 82571EB: Detected Hardware Unit Hang
> http://sourceforge.net/mailarchive/forum.php?thread_name=20111014140426.3d576173%40asterix.rh&forum_name=e1000-devel
> 
> Signed-off-by: Flavio Leitner 

Flavio/Michael, thanks for working on this, the patch itself seems
okay, but it does increase the time to detect a tx hang doesn't it?

I'm okay with the patch functionality because you're implementing
(effectively, if not a little indirectly) the fix our hardware engineer
suggested which was two writes to the FPD bit.

We can test the patch in our lab here, Jeff Kirsher will push it
upstream when it completes testing.

Reviewed-by: Jesse Brandeburg 

--
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel® Ethernet, visit 
http://communities.intel.com/community/wired