Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2011-10-21 Thread Michael Wang
On 10/19/2011 08:16 PM, Flavio Leitner wrote:
 On Wed, 19 Oct 2011 12:49:48 +0800
 wangyunwang...@linux.vnet.ibm.com  wrote:

 Hi, Flavio

 I am new to join the community, work on e1000e driver currently,
 And I found a thing strange in this issue, please check below.

 Thanks,
 Michael Wang

 On 10/18/2011 10:42 PM, Flavio Leitner wrote:
 On Mon, 17 Oct 2011 11:48:22 -0700
 Jesse Brandeburgjesse.brandeb...@intel.com   wrote:

 On Fri, 14 Oct 2011 10:04:26 -0700
 Flavio Leitnerf...@redhat.com   wrote:

 Hi,

 I got few reports so far that 82571EB models are having the
 Detected Hardware Unit Hang issue after upgrading the kernel.

 Further debugging with an instrumented kernel revealed that the
 socket buffer time stamp matches with the last time e1000_xmit_frame()
 was called. Also that the time stamp of e1000_clean_tx_irq() last run
 is prior to the one in socket buffer.

 However, ~1 second later, an interrupt is fired and the old entry
 is found. Sometimes, the scheduled print_hang_task dumps the
 information _after_ the old entry is sent (shows empty ring),
 indicating that the HW TX unit isn't really stuck and apparently
 just missed the signal to initiate the transmission.

 Order of events:
(1) skb is pushed down
(2) e1000_xmit_frame() is called
(3) ring is filled with one entry
(4) TDT is updated
 (5) nothing happens for little more than 1 second
(6) interrupt is fired
(7) e1000_clean_tx_irq() is called
(8) finds the entry not ready with an old time stamp,
schedules print_hang_task and stops the TX queue.
(9) print_hang_task runs, dump the info but the old entry is now sent
 (10) apparently the TX queue is back.
 Flavio, thanks for the detailed info, please be sure to supply us the
 bugzilla number.

 It was buried in the end of the first email:
 https://bugzilla.redhat.com/show_bug.cgi?id=746272

 TDH is probably not moving due to the writeback threshold settings in
 TXDCTL.  netperf UDP_RR test is likely a good way to test this.

 Yeah, makes sense. I haven't heard about new events after had removed
 the flag FLAG2_DMA_BURST.  Unfortunately, I don't have access to the exact
 same hardware and I haven't reproduced the issue in-house yet with another
 82571EB. See below about interface statistics from sar.


 I don't think the sequence is quite what you said.  We are going to
 work with the hardware team to get a sequence that works right, and we
 should have a fix for you soon.
 Yeah, the sequence might not be exact, but gives us a good idea of
 what could be happening.

 There are two events right after another:

 Oct  9 05:45:23  kernel:   TDH48
 Oct  9 05:45:23  kernel:   TDT49
 Oct  9 05:45:23  kernel:   next_to_use49
 Oct  9 05:45:23  kernel:   next_to_clean48
 Oct  9 05:45:23  kernel: buffer_info[next_to_clean]:
 Oct  9 05:45:23  kernel:   time_stamp102338ca6
 Oct  9 05:45:23  kernel:   next_to_watch48
 Oct  9 05:45:23  kernel:   jiffies102338dc1
 Oct  9 05:45:23  kernel:   next_to_watch.status0
 Oct  9 05:45:23  kernel: MAC Status80383
 Oct  9 05:45:23  kernel: PHY Status792d
 Oct  9 05:45:23  kernel: PHY 1000BASE-T Status3800
 Oct  9 05:45:23  kernel: PHY Extended Status3000
 Oct  9 05:45:23  kernel: PCI Status10
 Oct  9 05:51:54  kernel: e1000e :22:00.1: eth7: Detected Hardware Unit 
 Hang:
 Oct  9 05:51:54  kernel:   TDH55
 Oct  9 05:51:54  kernel:   TDT56
 Oct  9 05:51:54  kernel:   next_to_use56
 Oct  9 05:51:54  kernel:   next_to_clean55
 Oct  9 05:51:54  kernel: buffer_info[next_to_clean]:
 Oct  9 05:51:54  kernel:   time_stamp102350986
 Oct  9 05:51:54  kernel:   next_to_watch55
 Oct  9 05:51:54  kernel:   jiffies102350b07
 Oct  9 05:51:54  kernel:   next_to_watch.status0
 Oct  9 05:51:54  kernel: MAC Status80383
 Oct  9 05:51:54  kernel: PHY Status792d
 Oct  9 05:51:54  kernel: PHY 1000BASE-T Status3800
 Oct  9 05:51:54  kernel: PHY Extended Status3000
 Oct  9 05:51:54  kernel: PCI Status10

 I see the judgement of hang is:

 time_after(jiffies, tx_ring-buffer_info[i].time_stamp +
 (adapter-tx_timeout_factor * HZ))

 which means the hang happened when current jiffies minus buffer's time
 stamp is over
 (adapter-tx_timeout_factor * HZ).

 And I see the tx_timeout_factor will at least be 1, so on x86 the
 (jiffies-time_stamp) should
 over 1000, but here looks only around 300.

 Could you please check the HZ number of your platform?

 sure, adapter-tx_timeout_factor * HZ = 0xfa/250d
 That data came from a customer using kernel-xen, so HZ is 250.

 Here is the debugging patch used:
 http://people.redhat.com/~fleitner/linux-kernel-test.patch

 The idea was to capture all the relevant values at the time
 of the problem. (The print_hang_task is scheduled and sometimes
 it shows timestamp=0, TDH=TDT because the packet is already sent)

 This is the full output with debugging patch applied:
 Oct 11 02:03:52 kernel: e1000e :22:00.1: eth7: Detected Hardware Unit 
 Hang:
 Oct 11 02:03:52 kernel:   TDH25
 Oct 11 02:03:52 kernel:   TDT26
 Oct 11 

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

2011-10-21 Thread Flavio Leitner
On Fri, 21 Oct 2011 14:15:12 +0800
Michael Wang wang...@linux.vnet.ibm.com wrote:

 On 10/19/2011 08:16 PM, Flavio Leitner wrote:
  On Wed, 19 Oct 2011 12:49:48 +0800
  wangyunwang...@linux.vnet.ibm.com  wrote:
 
  Hi, Flavio
 
  I am new to join the community, work on e1000e driver currently,
  And I found a thing strange in this issue, please check below.
 
  Thanks,
  Michael Wang
 
  On 10/18/2011 10:42 PM, Flavio Leitner wrote:
  On Mon, 17 Oct 2011 11:48:22 -0700
  Jesse Brandeburgjesse.brandeb...@intel.com   wrote:
 
  On Fri, 14 Oct 2011 10:04:26 -0700
  Flavio Leitnerf...@redhat.com   wrote:
 
  Hi,
 
  I got few reports so far that 82571EB models are having the
  Detected Hardware Unit Hang issue after upgrading the kernel.
 
  Further debugging with an instrumented kernel revealed that the
  socket buffer time stamp matches with the last time e1000_xmit_frame()
  was called. Also that the time stamp of e1000_clean_tx_irq() last run
  is prior to the one in socket buffer.
 
  However, ~1 second later, an interrupt is fired and the old entry
  is found. Sometimes, the scheduled print_hang_task dumps the
  information _after_ the old entry is sent (shows empty ring),
  indicating that the HW TX unit isn't really stuck and apparently
  just missed the signal to initiate the transmission.
 
  Order of events:
 (1) skb is pushed down
 (2) e1000_xmit_frame() is called
 (3) ring is filled with one entry
 (4) TDT is updated
  (5) nothing happens for little more than 1 second
 (6) interrupt is fired
 (7) e1000_clean_tx_irq() is called
 (8) finds the entry not ready with an old time stamp,
 schedules print_hang_task and stops the TX queue.
 (9) print_hang_task runs, dump the info but the old entry is now sent
  (10) apparently the TX queue is back.
  Flavio, thanks for the detailed info, please be sure to supply us the
  bugzilla number.
 
  It was buried in the end of the first email:
  https://bugzilla.redhat.com/show_bug.cgi?id=746272
 
  TDH is probably not moving due to the writeback threshold settings in
  TXDCTL.  netperf UDP_RR test is likely a good way to test this.
 
  Yeah, makes sense. I haven't heard about new events after had removed
  the flag FLAG2_DMA_BURST.  Unfortunately, I don't have access to the exact
  same hardware and I haven't reproduced the issue in-house yet with another
  82571EB. See below about interface statistics from sar.
 
 
  I don't think the sequence is quite what you said.  We are going to
  work with the hardware team to get a sequence that works right, and we
  should have a fix for you soon.
  Yeah, the sequence might not be exact, but gives us a good idea of
  what could be happening.
 
  There are two events right after another:
 
  Oct  9 05:45:23  kernel:   TDH48
  Oct  9 05:45:23  kernel:   TDT49
  Oct  9 05:45:23  kernel:   next_to_use49
  Oct  9 05:45:23  kernel:   next_to_clean48
  Oct  9 05:45:23  kernel: buffer_info[next_to_clean]:
  Oct  9 05:45:23  kernel:   time_stamp102338ca6
  Oct  9 05:45:23  kernel:   next_to_watch48
  Oct  9 05:45:23  kernel:   jiffies102338dc1
  Oct  9 05:45:23  kernel:   next_to_watch.status0
  Oct  9 05:45:23  kernel: MAC Status80383
  Oct  9 05:45:23  kernel: PHY Status792d
  Oct  9 05:45:23  kernel: PHY 1000BASE-T Status3800
  Oct  9 05:45:23  kernel: PHY Extended Status3000
  Oct  9 05:45:23  kernel: PCI Status10
  Oct  9 05:51:54  kernel: e1000e :22:00.1: eth7: Detected Hardware 
  Unit Hang:
  Oct  9 05:51:54  kernel:   TDH55
  Oct  9 05:51:54  kernel:   TDT56
  Oct  9 05:51:54  kernel:   next_to_use56
  Oct  9 05:51:54  kernel:   next_to_clean55
  Oct  9 05:51:54  kernel: buffer_info[next_to_clean]:
  Oct  9 05:51:54  kernel:   time_stamp102350986
  Oct  9 05:51:54  kernel:   next_to_watch55
  Oct  9 05:51:54  kernel:   jiffies102350b07
  Oct  9 05:51:54  kernel:   next_to_watch.status0
  Oct  9 05:51:54  kernel: MAC Status80383
  Oct  9 05:51:54  kernel: PHY Status792d
  Oct  9 05:51:54  kernel: PHY 1000BASE-T Status3800
  Oct  9 05:51:54  kernel: PHY Extended Status3000
  Oct  9 05:51:54  kernel: PCI Status10
 
  I see the judgement of hang is:
 
  time_after(jiffies, tx_ring-buffer_info[i].time_stamp +
  (adapter-tx_timeout_factor * HZ))
 
  which means the hang happened when current jiffies minus buffer's time
  stamp is over
  (adapter-tx_timeout_factor * HZ).
 
  And I see the tx_timeout_factor will at least be 1, so on x86 the
  (jiffies-time_stamp) should
  over 1000, but here looks only around 300.
 
  Could you please check the HZ number of your platform?
 
  sure, adapter-tx_timeout_factor * HZ = 0xfa/250d
  That data came from a customer using kernel-xen, so HZ is 250.
 
  Here is the debugging patch used:
  http://people.redhat.com/~fleitner/linux-kernel-test.patch
 
  The idea was to capture all the relevant values at the time
  of the problem. (The print_hang_task is scheduled and sometimes
  it shows timestamp=0, TDH=TDT because the packet is already sent)
 
  

[E1000-devel] PTP query

2011-10-21 Thread Rayagond K
Hi All,

I just went through igb_main.c driver to understand the PTP implementation.

If I am not wrong only following things are handled in igb_main.c file ...
a. enabling/disabling tx/rx time stamp function through ioctl.
b. passing tx/rx time stamp value to upper layer through skb  skb_shinfo
struct.
c. initialization of system time.

So I am wondering where system time/clock/frequency adjustment is handled in
this driver ?
Does this driver handles system time/clock/frequency adjustment or not ?

IIRC, we should handle system time/clock/frequency adjustment using separate
clock driver right ?

Pleas let me know.

-- 
wwr
Rayagond K
--
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] intel card question

2011-10-21 Thread Brandeburg, Jesse
Yes, it should work just fine, and the driver in the kernel should work without 
any installation of a driver from sourceforge or intel.com (just make sure it 
is enabled in the kernel you're using)

-Original Message-
From: Ashley Buckley [mailto:ashley.buck...@covisia.com] 
Sent: Friday, October 21, 2011 12:50 PM
To: 'e1000-de...@lists.sf.net'
Subject: [E1000-devel] intel card question

Hello,

I was given this contact by the Intel tech support team. I am looking to use an 
Intel X520 Fiber Optic Card -PCI Express - 10GBase-X - Internal - Low-profile  
part #: E10G42BTDA and am wondering if this is compatible with Linux Kernel 3.0?

Thanks,
Ashley Buckley

Inside Sales  Marketing Associate
Covisia Solutions, Inc.
781-895-5249

--
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


[E1000-devel] Looking for comparison on various Intel 1G chipsets

2011-10-21 Thread Ben Greear
Hello!

There are a large quantity of Intel NIC chipsets out there, and we're
looking for performance characteristics.

Is there any performance comparisons between the various chipsets?

For instance: 82571, 82574, 82575, 82576, i350, 82580

If no hard data, which ones are (should be?) the best performers?

For what it's worth, we're seeing dodgy results on some 82576 NICs
on the 3.0.6+ kernel, but sometimes they work great, so we're not
100% sure that the issue is with the NICs.

Thanks,
Ben

-- 
Ben Greear gree...@candelatech.com
Candela Technologies Inc  http://www.candelatech.com


--
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] Looking for comparison on various Intel 1G chipsets

2011-10-21 Thread Ben Greear
On 10/21/2011 03:49 PM, Wyborny, Carolyn wrote:


 -Original Message-
 From: Ben Greear [mailto:gree...@candelatech.com]
 Sent: Friday, October 21, 2011 3:31 PM
 To: e1000-devel list
 Subject: [E1000-devel] Looking for comparison on various Intel 1G
 chipsets

 Hello!

 There are a large quantity of Intel NIC chipsets out there, and we're
 looking for performance characteristics.

 Is there any performance comparisons between the various chipsets?

 For instance: 82571, 82574, 82575, 82576, i350, 82580

 If no hard data, which ones are (should be?) the best performers?

 For what it's worth, we're seeing dodgy results on some 82576 NICs
 on the 3.0.6+ kernel, but sometimes they work great, so we're not
 100% sure that the issue is with the NICs.

 Thanks,
 Ben

 --
 Ben Greeargree...@candelatech.com
 Candela Technologies Inc  http://www.candelatech.com


 
 --
 The demand for IT networking professionals continues to grow, and the
 demand for specialized networking skills is growing even more rapidly.
 Take a complimentary Learning@Cisco Self-Assessment and learn
 about Cisco certifications, training, and career opportunities.
 http://p.sf.net/sfu/cisco-dev2dev
 ___
 E1000-devel mailing list
 E1000-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/e1000-devel
 To learn more about Intel#174; Ethernet, visit
 http://communities.intel.com/community/wired

 Hello Ben,

 Yes, we should have some of that data around.  I'll dig it up and send it 
 out.  I've also gotten some reports on some switch sensitivity with the 
 82576.  Let me know if that might be what you're seeing too and I'll add you 
 to the list on that issue we're trying repro.

We are directly connecting one NIC to another.  Sometimes we get crc failures, 
but a cold
boot 'fixes' it.  A warm boot does not.  ethtool -t sometimes wedges the NIC.

Some times it works great...and its often difficult to tell which NIC is to 
blame, but
it seems that when we see CRC errors, the error is with the receiving NIC (cold 
booting
that machine cures the problem).

This is with 82576 multiport nics from Silicom, at least mostly.  Might be that 
just
a few of our NICs in the lab have seen a bit too much abuse, so I'm not sure if 
it's
really a fundamental flaw in igb or the 82576 chipset.

Anyway, I ordered some new 4-port NICs with the i350 and 82580 chips, and
we'll see how these perform with our particular applications.  It would be
great to see some of your own performance numbers as well.

Thanks,
Ben



 Thanks,

 Carolyn

 Carolyn Wyborny
 Linux Development
 LAN Access Division
 Intel Corporation



-- 
Ben Greear gree...@candelatech.com
Candela Technologies Inc  http://www.candelatech.com


--
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired