Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On 10/19/2011 08:16 PM, Flavio Leitner wrote: On Wed, 19 Oct 2011 12:49:48 +0800 wangyunwang...@linux.vnet.ibm.com wrote: Hi, Flavio I am new to join the community, work on e1000e driver currently, And I found a thing strange in this issue, please check below. Thanks, Michael Wang On 10/18/2011 10:42 PM, Flavio Leitner wrote: On Mon, 17 Oct 2011 11:48:22 -0700 Jesse Brandeburgjesse.brandeb...@intel.com wrote: On Fri, 14 Oct 2011 10:04:26 -0700 Flavio Leitnerf...@redhat.com wrote: Hi, I got few reports so far that 82571EB models are having the Detected Hardware Unit Hang issue after upgrading the kernel. Further debugging with an instrumented kernel revealed that the socket buffer time stamp matches with the last time e1000_xmit_frame() was called. Also that the time stamp of e1000_clean_tx_irq() last run is prior to the one in socket buffer. However, ~1 second later, an interrupt is fired and the old entry is found. Sometimes, the scheduled print_hang_task dumps the information _after_ the old entry is sent (shows empty ring), indicating that the HW TX unit isn't really stuck and apparently just missed the signal to initiate the transmission. Order of events: (1) skb is pushed down (2) e1000_xmit_frame() is called (3) ring is filled with one entry (4) TDT is updated (5) nothing happens for little more than 1 second (6) interrupt is fired (7) e1000_clean_tx_irq() is called (8) finds the entry not ready with an old time stamp, schedules print_hang_task and stops the TX queue. (9) print_hang_task runs, dump the info but the old entry is now sent (10) apparently the TX queue is back. Flavio, thanks for the detailed info, please be sure to supply us the bugzilla number. It was buried in the end of the first email: https://bugzilla.redhat.com/show_bug.cgi?id=746272 TDH is probably not moving due to the writeback threshold settings in TXDCTL. netperf UDP_RR test is likely a good way to test this. Yeah, makes sense. I haven't heard about new events after had removed the flag FLAG2_DMA_BURST. Unfortunately, I don't have access to the exact same hardware and I haven't reproduced the issue in-house yet with another 82571EB. See below about interface statistics from sar. I don't think the sequence is quite what you said. We are going to work with the hardware team to get a sequence that works right, and we should have a fix for you soon. Yeah, the sequence might not be exact, but gives us a good idea of what could be happening. There are two events right after another: Oct 9 05:45:23 kernel: TDH48 Oct 9 05:45:23 kernel: TDT49 Oct 9 05:45:23 kernel: next_to_use49 Oct 9 05:45:23 kernel: next_to_clean48 Oct 9 05:45:23 kernel: buffer_info[next_to_clean]: Oct 9 05:45:23 kernel: time_stamp102338ca6 Oct 9 05:45:23 kernel: next_to_watch48 Oct 9 05:45:23 kernel: jiffies102338dc1 Oct 9 05:45:23 kernel: next_to_watch.status0 Oct 9 05:45:23 kernel: MAC Status80383 Oct 9 05:45:23 kernel: PHY Status792d Oct 9 05:45:23 kernel: PHY 1000BASE-T Status3800 Oct 9 05:45:23 kernel: PHY Extended Status3000 Oct 9 05:45:23 kernel: PCI Status10 Oct 9 05:51:54 kernel: e1000e :22:00.1: eth7: Detected Hardware Unit Hang: Oct 9 05:51:54 kernel: TDH55 Oct 9 05:51:54 kernel: TDT56 Oct 9 05:51:54 kernel: next_to_use56 Oct 9 05:51:54 kernel: next_to_clean55 Oct 9 05:51:54 kernel: buffer_info[next_to_clean]: Oct 9 05:51:54 kernel: time_stamp102350986 Oct 9 05:51:54 kernel: next_to_watch55 Oct 9 05:51:54 kernel: jiffies102350b07 Oct 9 05:51:54 kernel: next_to_watch.status0 Oct 9 05:51:54 kernel: MAC Status80383 Oct 9 05:51:54 kernel: PHY Status792d Oct 9 05:51:54 kernel: PHY 1000BASE-T Status3800 Oct 9 05:51:54 kernel: PHY Extended Status3000 Oct 9 05:51:54 kernel: PCI Status10 I see the judgement of hang is: time_after(jiffies, tx_ring-buffer_info[i].time_stamp + (adapter-tx_timeout_factor * HZ)) which means the hang happened when current jiffies minus buffer's time stamp is over (adapter-tx_timeout_factor * HZ). And I see the tx_timeout_factor will at least be 1, so on x86 the (jiffies-time_stamp) should over 1000, but here looks only around 300. Could you please check the HZ number of your platform? sure, adapter-tx_timeout_factor * HZ = 0xfa/250d That data came from a customer using kernel-xen, so HZ is 250. Here is the debugging patch used: http://people.redhat.com/~fleitner/linux-kernel-test.patch The idea was to capture all the relevant values at the time of the problem. (The print_hang_task is scheduled and sometimes it shows timestamp=0, TDH=TDT because the packet is already sent) This is the full output with debugging patch applied: Oct 11 02:03:52 kernel: e1000e :22:00.1: eth7: Detected Hardware Unit Hang: Oct 11 02:03:52 kernel: TDH25 Oct 11 02:03:52 kernel: TDT26 Oct 11
Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On Fri, 21 Oct 2011 14:15:12 +0800 Michael Wang wang...@linux.vnet.ibm.com wrote: On 10/19/2011 08:16 PM, Flavio Leitner wrote: On Wed, 19 Oct 2011 12:49:48 +0800 wangyunwang...@linux.vnet.ibm.com wrote: Hi, Flavio I am new to join the community, work on e1000e driver currently, And I found a thing strange in this issue, please check below. Thanks, Michael Wang On 10/18/2011 10:42 PM, Flavio Leitner wrote: On Mon, 17 Oct 2011 11:48:22 -0700 Jesse Brandeburgjesse.brandeb...@intel.com wrote: On Fri, 14 Oct 2011 10:04:26 -0700 Flavio Leitnerf...@redhat.com wrote: Hi, I got few reports so far that 82571EB models are having the Detected Hardware Unit Hang issue after upgrading the kernel. Further debugging with an instrumented kernel revealed that the socket buffer time stamp matches with the last time e1000_xmit_frame() was called. Also that the time stamp of e1000_clean_tx_irq() last run is prior to the one in socket buffer. However, ~1 second later, an interrupt is fired and the old entry is found. Sometimes, the scheduled print_hang_task dumps the information _after_ the old entry is sent (shows empty ring), indicating that the HW TX unit isn't really stuck and apparently just missed the signal to initiate the transmission. Order of events: (1) skb is pushed down (2) e1000_xmit_frame() is called (3) ring is filled with one entry (4) TDT is updated (5) nothing happens for little more than 1 second (6) interrupt is fired (7) e1000_clean_tx_irq() is called (8) finds the entry not ready with an old time stamp, schedules print_hang_task and stops the TX queue. (9) print_hang_task runs, dump the info but the old entry is now sent (10) apparently the TX queue is back. Flavio, thanks for the detailed info, please be sure to supply us the bugzilla number. It was buried in the end of the first email: https://bugzilla.redhat.com/show_bug.cgi?id=746272 TDH is probably not moving due to the writeback threshold settings in TXDCTL. netperf UDP_RR test is likely a good way to test this. Yeah, makes sense. I haven't heard about new events after had removed the flag FLAG2_DMA_BURST. Unfortunately, I don't have access to the exact same hardware and I haven't reproduced the issue in-house yet with another 82571EB. See below about interface statistics from sar. I don't think the sequence is quite what you said. We are going to work with the hardware team to get a sequence that works right, and we should have a fix for you soon. Yeah, the sequence might not be exact, but gives us a good idea of what could be happening. There are two events right after another: Oct 9 05:45:23 kernel: TDH48 Oct 9 05:45:23 kernel: TDT49 Oct 9 05:45:23 kernel: next_to_use49 Oct 9 05:45:23 kernel: next_to_clean48 Oct 9 05:45:23 kernel: buffer_info[next_to_clean]: Oct 9 05:45:23 kernel: time_stamp102338ca6 Oct 9 05:45:23 kernel: next_to_watch48 Oct 9 05:45:23 kernel: jiffies102338dc1 Oct 9 05:45:23 kernel: next_to_watch.status0 Oct 9 05:45:23 kernel: MAC Status80383 Oct 9 05:45:23 kernel: PHY Status792d Oct 9 05:45:23 kernel: PHY 1000BASE-T Status3800 Oct 9 05:45:23 kernel: PHY Extended Status3000 Oct 9 05:45:23 kernel: PCI Status10 Oct 9 05:51:54 kernel: e1000e :22:00.1: eth7: Detected Hardware Unit Hang: Oct 9 05:51:54 kernel: TDH55 Oct 9 05:51:54 kernel: TDT56 Oct 9 05:51:54 kernel: next_to_use56 Oct 9 05:51:54 kernel: next_to_clean55 Oct 9 05:51:54 kernel: buffer_info[next_to_clean]: Oct 9 05:51:54 kernel: time_stamp102350986 Oct 9 05:51:54 kernel: next_to_watch55 Oct 9 05:51:54 kernel: jiffies102350b07 Oct 9 05:51:54 kernel: next_to_watch.status0 Oct 9 05:51:54 kernel: MAC Status80383 Oct 9 05:51:54 kernel: PHY Status792d Oct 9 05:51:54 kernel: PHY 1000BASE-T Status3800 Oct 9 05:51:54 kernel: PHY Extended Status3000 Oct 9 05:51:54 kernel: PCI Status10 I see the judgement of hang is: time_after(jiffies, tx_ring-buffer_info[i].time_stamp + (adapter-tx_timeout_factor * HZ)) which means the hang happened when current jiffies minus buffer's time stamp is over (adapter-tx_timeout_factor * HZ). And I see the tx_timeout_factor will at least be 1, so on x86 the (jiffies-time_stamp) should over 1000, but here looks only around 300. Could you please check the HZ number of your platform? sure, adapter-tx_timeout_factor * HZ = 0xfa/250d That data came from a customer using kernel-xen, so HZ is 250. Here is the debugging patch used: http://people.redhat.com/~fleitner/linux-kernel-test.patch The idea was to capture all the relevant values at the time of the problem. (The print_hang_task is scheduled and sometimes it shows timestamp=0, TDH=TDT because the packet is already sent)
[E1000-devel] PTP query
Hi All, I just went through igb_main.c driver to understand the PTP implementation. If I am not wrong only following things are handled in igb_main.c file ... a. enabling/disabling tx/rx time stamp function through ioctl. b. passing tx/rx time stamp value to upper layer through skb skb_shinfo struct. c. initialization of system time. So I am wondering where system time/clock/frequency adjustment is handled in this driver ? Does this driver handles system time/clock/frequency adjustment or not ? IIRC, we should handle system time/clock/frequency adjustment using separate clock driver right ? Pleas let me know. -- wwr Rayagond K -- The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] intel card question
Yes, it should work just fine, and the driver in the kernel should work without any installation of a driver from sourceforge or intel.com (just make sure it is enabled in the kernel you're using) -Original Message- From: Ashley Buckley [mailto:ashley.buck...@covisia.com] Sent: Friday, October 21, 2011 12:50 PM To: 'e1000-de...@lists.sf.net' Subject: [E1000-devel] intel card question Hello, I was given this contact by the Intel tech support team. I am looking to use an Intel X520 Fiber Optic Card -PCI Express - 10GBase-X - Internal - Low-profile part #: E10G42BTDA and am wondering if this is compatible with Linux Kernel 3.0? Thanks, Ashley Buckley Inside Sales Marketing Associate Covisia Solutions, Inc. 781-895-5249 -- The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
[E1000-devel] Looking for comparison on various Intel 1G chipsets
Hello! There are a large quantity of Intel NIC chipsets out there, and we're looking for performance characteristics. Is there any performance comparisons between the various chipsets? For instance: 82571, 82574, 82575, 82576, i350, 82580 If no hard data, which ones are (should be?) the best performers? For what it's worth, we're seeing dodgy results on some 82576 NICs on the 3.0.6+ kernel, but sometimes they work great, so we're not 100% sure that the issue is with the NICs. Thanks, Ben -- Ben Greear gree...@candelatech.com Candela Technologies Inc http://www.candelatech.com -- The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] Looking for comparison on various Intel 1G chipsets
On 10/21/2011 03:49 PM, Wyborny, Carolyn wrote: -Original Message- From: Ben Greear [mailto:gree...@candelatech.com] Sent: Friday, October 21, 2011 3:31 PM To: e1000-devel list Subject: [E1000-devel] Looking for comparison on various Intel 1G chipsets Hello! There are a large quantity of Intel NIC chipsets out there, and we're looking for performance characteristics. Is there any performance comparisons between the various chipsets? For instance: 82571, 82574, 82575, 82576, i350, 82580 If no hard data, which ones are (should be?) the best performers? For what it's worth, we're seeing dodgy results on some 82576 NICs on the 3.0.6+ kernel, but sometimes they work great, so we're not 100% sure that the issue is with the NICs. Thanks, Ben -- Ben Greeargree...@candelatech.com Candela Technologies Inc http://www.candelatech.com -- The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired Hello Ben, Yes, we should have some of that data around. I'll dig it up and send it out. I've also gotten some reports on some switch sensitivity with the 82576. Let me know if that might be what you're seeing too and I'll add you to the list on that issue we're trying repro. We are directly connecting one NIC to another. Sometimes we get crc failures, but a cold boot 'fixes' it. A warm boot does not. ethtool -t sometimes wedges the NIC. Some times it works great...and its often difficult to tell which NIC is to blame, but it seems that when we see CRC errors, the error is with the receiving NIC (cold booting that machine cures the problem). This is with 82576 multiport nics from Silicom, at least mostly. Might be that just a few of our NICs in the lab have seen a bit too much abuse, so I'm not sure if it's really a fundamental flaw in igb or the 82576 chipset. Anyway, I ordered some new 4-port NICs with the i350 and 82580 chips, and we'll see how these perform with our particular applications. It would be great to see some of your own performance numbers as well. Thanks, Ben Thanks, Carolyn Carolyn Wyborny Linux Development LAN Access Division Intel Corporation -- Ben Greear gree...@candelatech.com Candela Technologies Inc http://www.candelatech.com -- The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired