On 10/19/2011 08:16 PM, Flavio Leitner wrote: > On Wed, 19 Oct 2011 12:49:48 +0800 > wangyun<wang...@linux.vnet.ibm.com> wrote: > >> Hi, Flavio >> >> I am new to join the community, work on e1000e driver currently, >> And I found a thing strange in this issue, please check below. >> >> Thanks, >> Michael Wang >> >> On 10/18/2011 10:42 PM, Flavio Leitner wrote: >>> On Mon, 17 Oct 2011 11:48:22 -0700 >>> Jesse Brandeburg<jesse.brandeb...@intel.com> wrote: >>> >>>> On Fri, 14 Oct 2011 10:04:26 -0700 >>>> Flavio Leitner<f...@redhat.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> I got few reports so far that 82571EB models are having the >>>>> "Detected Hardware Unit Hang" issue after upgrading the kernel. >>>>> >>>>> Further debugging with an instrumented kernel revealed that the >>>>> socket buffer time stamp matches with the last time e1000_xmit_frame() >>>>> was called. Also that the time stamp of e1000_clean_tx_irq() last run >>>>> is prior to the one in socket buffer. >>>>> >>>>> However, ~1 second later, an interrupt is fired and the old entry >>>>> is found. Sometimes, the scheduled print_hang_task dumps the >>>>> information _after_ the old entry is sent (shows empty ring), >>>>> indicating that the HW TX unit isn't really stuck and apparently >>>>> just missed the signal to initiate the transmission. >>>>> >>>>> Order of events: >>>>> (1) skb is pushed down >>>>> (2) e1000_xmit_frame() is called >>>>> (3) ring is filled with one entry >>>>> (4) TDT is updated >>>>>> (5) nothing happens for little more than 1 second >>>>> (6) interrupt is fired >>>>> (7) e1000_clean_tx_irq() is called >>>>> (8) finds the entry not ready with an old time stamp, >>>>> schedules print_hang_task and stops the TX queue. >>>>> (9) print_hang_task runs, dump the info but the old entry is now sent >>>>> (10) apparently the TX queue is back. >>>> Flavio, thanks for the detailed info, please be sure to supply us the >>>> bugzilla number. >>>> >>> It was buried in the end of the first email: >>> https://bugzilla.redhat.com/show_bug.cgi?id=746272 >>> >>>> TDH is probably not moving due to the writeback threshold settings in >>>> TXDCTL. netperf UDP_RR test is likely a good way to test this. >>>> >>> Yeah, makes sense. I haven't heard about new events after had removed >>> the flag FLAG2_DMA_BURST. Unfortunately, I don't have access to the exact >>> same hardware and I haven't reproduced the issue in-house yet with another >>> 82571EB. See below about interface statistics from sar. >>> >>> >>>> I don't think the sequence is quite what you said. We are going to >>>> work with the hardware team to get a sequence that works right, and we >>>> should have a fix for you soon. >>> Yeah, the sequence might not be exact, but gives us a good idea of >>> what could be happening. >>> >>> There are two events right after another: >>> >>> Oct 9 05:45:23 kernel: TDH<48> >>> Oct 9 05:45:23 kernel: TDT<49> >>> Oct 9 05:45:23 kernel: next_to_use<49> >>> Oct 9 05:45:23 kernel: next_to_clean<48> >>> Oct 9 05:45:23 kernel: buffer_info[next_to_clean]: >>> Oct 9 05:45:23 kernel: time_stamp<102338ca6> >>> Oct 9 05:45:23 kernel: next_to_watch<48> >>> Oct 9 05:45:23 kernel: jiffies<102338dc1> >>> Oct 9 05:45:23 kernel: next_to_watch.status<0> >>> Oct 9 05:45:23 kernel: MAC Status<80383> >>> Oct 9 05:45:23 kernel: PHY Status<792d> >>> Oct 9 05:45:23 kernel: PHY 1000BASE-T Status<3800> >>> Oct 9 05:45:23 kernel: PHY Extended Status<3000> >>> Oct 9 05:45:23 kernel: PCI Status<10> >>> Oct 9 05:51:54 kernel: e1000e 0000:22:00.1: eth7: Detected Hardware Unit >>> Hang: >>> Oct 9 05:51:54 kernel: TDH<55> >>> Oct 9 05:51:54 kernel: TDT<56> >>> Oct 9 05:51:54 kernel: next_to_use<56> >>> Oct 9 05:51:54 kernel: next_to_clean<55> >>> Oct 9 05:51:54 kernel: buffer_info[next_to_clean]: >>> Oct 9 05:51:54 kernel: time_stamp<102350986> >>> Oct 9 05:51:54 kernel: next_to_watch<55> >>> Oct 9 05:51:54 kernel: jiffies<102350b07> >>> Oct 9 05:51:54 kernel: next_to_watch.status<0> >>> Oct 9 05:51:54 kernel: MAC Status<80383> >>> Oct 9 05:51:54 kernel: PHY Status<792d> >>> Oct 9 05:51:54 kernel: PHY 1000BASE-T Status<3800> >>> Oct 9 05:51:54 kernel: PHY Extended Status<3000> >>> Oct 9 05:51:54 kernel: PCI Status<10> >>> >> I see the judgement of hang is: >> >> time_after(jiffies, tx_ring->buffer_info[i].time_stamp + >> (adapter->tx_timeout_factor * HZ)) >> >> which means the hang happened when current jiffies minus buffer's time >> stamp is over >> (adapter->tx_timeout_factor * HZ). >> >> And I see the tx_timeout_factor will at least be 1, so on x86 the >> (jiffies-time_stamp) should >> over 1000, but here looks only around 300. >> >> Could you please check the HZ number of your platform? >> > sure, adapter->tx_timeout_factor * HZ = 0xfa/250d > That data came from a customer using kernel-xen, so HZ is 250. > > Here is the debugging patch used: > http://people.redhat.com/~fleitner/linux-kernel-test.patch > > The idea was to capture all the relevant values at the time > of the problem. (The print_hang_task is scheduled and sometimes > it shows timestamp=0, TDH=TDT because the packet is already sent) > > This is the full output with debugging patch applied: > Oct 11 02:03:52 kernel: e1000e 0000:22:00.1: eth7: Detected Hardware Unit > Hang: > Oct 11 02:03:52 kernel: TDH<25> > Oct 11 02:03:52 kernel: TDT<26> > Oct 11 02:03:52 kernel: next_to_use<26> > Oct 11 02:03:52 kernel: next_to_clean<25> > Oct 11 02:03:52 kernel: buffer_info[next_to_clean]: > Oct 11 02:03:52 kernel: time_stamp<100b2aa22> > Oct 11 02:03:52 kernel: next_to_watch<25> > Oct 11 02:03:52 kernel: jiffies<100b2ab25> > Oct 11 02:03:52 kernel: next_to_watch.status<0> > Oct 11 02:03:52 kernel: stored_i =<25> > Oct 11 02:03:52 kernel: stored_first =<25> > Oct 11 02:03:52 kernel: stamp =<100b2aa22> > Oct 11 02:03:52 kernel: factor =<fa> > Oct 11 02:03:52 kernel: last_clean =<100b2aa1a> > Oct 11 02:03:52 kernel: last_tx =<100b2aa22> > Oct 11 02:03:52 kernel: count =<0>/<100>
The tx_ring->count seems not the default value which should be 256, could you please check if you have change this number to 100? Thanks, Michael Wang > Oct 11 02:03:52 kernel: next_to_use cached =<26> > Oct 11 02:03:52 kernel: e1000e 0000:22:00.1: eth7: Detected Hardware Unit > Hang: > Oct 11 02:03:52 kernel: TDH<25> > Oct 11 02:03:52 kernel: TDT<26> > Oct 11 02:03:52 kernel: next_to_use<26> > Oct 11 02:03:52 kernel: next_to_clean<25> > Oct 11 02:03:52 kernel: buffer_info[next_to_clean]: > Oct 11 02:03:52 kernel: time_stamp<100b2aa22> > Oct 11 02:03:52 kernel: next_to_watch<25> > Oct 11 02:03:52 kernel: jiffies<100b2ab26> > Oct 11 02:03:52 kernel: next_to_watch.status<0> > Oct 11 02:03:52 kernel: MAC Status<80383> > Oct 11 02:03:52 kernel: PHY Status<792d> > Oct 11 02:03:52 kernel: PHY 1000BASE-T Status<3800> > Oct 11 02:03:52 kernel: PHY Extended Status<3000> > Oct 11 02:03:52 kernel: PCI Status<10> > > I hope that helps. Let me know if you need anything else. > thanks, > fbl ------------------------------------------------------------------------------ The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired