On Fri, 21 Oct 2011 14:15:12 +0800
Michael Wang <wang...@linux.vnet.ibm.com> wrote:

> On 10/19/2011 08:16 PM, Flavio Leitner wrote:
> > On Wed, 19 Oct 2011 12:49:48 +0800
> > wangyun<wang...@linux.vnet.ibm.com>  wrote:
> >
> >> Hi, Flavio
> >>
> >> I am new to join the community, work on e1000e driver currently,
> >> And I found a thing strange in this issue, please check below.
> >>
> >> Thanks,
> >> Michael Wang
> >>
> >> On 10/18/2011 10:42 PM, Flavio Leitner wrote:
> >>> On Mon, 17 Oct 2011 11:48:22 -0700
> >>> Jesse Brandeburg<jesse.brandeb...@intel.com>   wrote:
> >>>
> >>>> On Fri, 14 Oct 2011 10:04:26 -0700
> >>>> Flavio Leitner<f...@redhat.com>   wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I got few reports so far that 82571EB models are having the
> >>>>> "Detected Hardware Unit Hang" issue after upgrading the kernel.
> >>>>>
> >>>>> Further debugging with an instrumented kernel revealed that the
> >>>>> socket buffer time stamp matches with the last time e1000_xmit_frame()
> >>>>> was called. Also that the time stamp of e1000_clean_tx_irq() last run
> >>>>> is prior to the one in socket buffer.
> >>>>>
> >>>>> However, ~1 second later, an interrupt is fired and the old entry
> >>>>> is found. Sometimes, the scheduled print_hang_task dumps the
> >>>>> information _after_ the old entry is sent (shows empty ring),
> >>>>> indicating that the HW TX unit isn't really stuck and apparently
> >>>>> just missed the signal to initiate the transmission.
> >>>>>
> >>>>> Order of events:
> >>>>>    (1) skb is pushed down
> >>>>>    (2) e1000_xmit_frame() is called
> >>>>>    (3) ring is filled with one entry
> >>>>>    (4) TDT is updated
> >>>>>> (5) nothing happens for little more than 1 second
> >>>>>    (6) interrupt is fired
> >>>>>    (7) e1000_clean_tx_irq() is called
> >>>>>    (8) finds the entry not ready with an old time stamp,
> >>>>>        schedules print_hang_task and stops the TX queue.
> >>>>>    (9) print_hang_task runs, dump the info but the old entry is now sent
> >>>>> (10) apparently the TX queue is back.
> >>>> Flavio, thanks for the detailed info, please be sure to supply us the
> >>>> bugzilla number.
> >>>>
> >>> It was buried in the end of the first email:
> >>> https://bugzilla.redhat.com/show_bug.cgi?id=746272
> >>>
> >>>> TDH is probably not moving due to the writeback threshold settings in
> >>>> TXDCTL.  netperf UDP_RR test is likely a good way to test this.
> >>>>
> >>> Yeah, makes sense. I haven't heard about new events after had removed
> >>> the flag FLAG2_DMA_BURST.  Unfortunately, I don't have access to the exact
> >>> same hardware and I haven't reproduced the issue in-house yet with another
> >>> 82571EB. See below about interface statistics from sar.
> >>>
> >>>
> >>>> I don't think the sequence is quite what you said.  We are going to
> >>>> work with the hardware team to get a sequence that works right, and we
> >>>> should have a fix for you soon.
> >>> Yeah, the sequence might not be exact, but gives us a good idea of
> >>> what could be happening.
> >>>
> >>> There are two events right after another:
> >>>
> >>> Oct  9 05:45:23  kernel:   TDH<48>
> >>> Oct  9 05:45:23  kernel:   TDT<49>
> >>> Oct  9 05:45:23  kernel:   next_to_use<49>
> >>> Oct  9 05:45:23  kernel:   next_to_clean<48>
> >>> Oct  9 05:45:23  kernel: buffer_info[next_to_clean]:
> >>> Oct  9 05:45:23  kernel:   time_stamp<102338ca6>
> >>> Oct  9 05:45:23  kernel:   next_to_watch<48>
> >>> Oct  9 05:45:23  kernel:   jiffies<102338dc1>
> >>> Oct  9 05:45:23  kernel:   next_to_watch.status<0>
> >>> Oct  9 05:45:23  kernel: MAC Status<80383>
> >>> Oct  9 05:45:23  kernel: PHY Status<792d>
> >>> Oct  9 05:45:23  kernel: PHY 1000BASE-T Status<3800>
> >>> Oct  9 05:45:23  kernel: PHY Extended Status<3000>
> >>> Oct  9 05:45:23  kernel: PCI Status<10>
> >>> Oct  9 05:51:54  kernel: e1000e 0000:22:00.1: eth7: Detected Hardware 
> >>> Unit Hang:
> >>> Oct  9 05:51:54  kernel:   TDH<55>
> >>> Oct  9 05:51:54  kernel:   TDT<56>
> >>> Oct  9 05:51:54  kernel:   next_to_use<56>
> >>> Oct  9 05:51:54  kernel:   next_to_clean<55>
> >>> Oct  9 05:51:54  kernel: buffer_info[next_to_clean]:
> >>> Oct  9 05:51:54  kernel:   time_stamp<102350986>
> >>> Oct  9 05:51:54  kernel:   next_to_watch<55>
> >>> Oct  9 05:51:54  kernel:   jiffies<102350b07>
> >>> Oct  9 05:51:54  kernel:   next_to_watch.status<0>
> >>> Oct  9 05:51:54  kernel: MAC Status<80383>
> >>> Oct  9 05:51:54  kernel: PHY Status<792d>
> >>> Oct  9 05:51:54  kernel: PHY 1000BASE-T Status<3800>
> >>> Oct  9 05:51:54  kernel: PHY Extended Status<3000>
> >>> Oct  9 05:51:54  kernel: PCI Status<10>
> >>>
> >> I see the judgement of hang is:
> >>
> >> time_after(jiffies, tx_ring->buffer_info[i].time_stamp +
> >> (adapter->tx_timeout_factor * HZ))
> >>
> >> which means the hang happened when current jiffies minus buffer's time
> >> stamp is over
> >> (adapter->tx_timeout_factor * HZ).
> >>
> >> And I see the tx_timeout_factor will at least be 1, so on x86 the
> >> (jiffies-time_stamp) should
> >> over 1000, but here looks only around 300.
> >>
> >> Could you please check the HZ number of your platform?
> >>
> > sure, adapter->tx_timeout_factor * HZ = 0xfa/250d
> > That data came from a customer using kernel-xen, so HZ is 250.
> >
> > Here is the debugging patch used:
> > http://people.redhat.com/~fleitner/linux-kernel-test.patch
> >
> > The idea was to capture all the relevant values at the time
> > of the problem. (The print_hang_task is scheduled and sometimes
> > it shows timestamp=0, TDH=TDT because the packet is already sent)
> >
> > This is the full output with debugging patch applied:
> > Oct 11 02:03:52 kernel: e1000e 0000:22:00.1: eth7: Detected Hardware Unit 
> > Hang:
> > Oct 11 02:03:52 kernel:   TDH<25>
> > Oct 11 02:03:52 kernel:   TDT<26>
> > Oct 11 02:03:52 kernel:   next_to_use<26>
> > Oct 11 02:03:52 kernel:   next_to_clean<25>
> > Oct 11 02:03:52 kernel: buffer_info[next_to_clean]:
> > Oct 11 02:03:52 kernel:   time_stamp<100b2aa22>
> > Oct 11 02:03:52 kernel:   next_to_watch<25>
> > Oct 11 02:03:52 kernel:   jiffies<100b2ab25>
> > Oct 11 02:03:52 kernel:   next_to_watch.status<0>
> > Oct 11 02:03:52 kernel:   stored_i =<25>
> > Oct 11 02:03:52 kernel:   stored_first =<25>
> > Oct 11 02:03:52 kernel:   stamp =<100b2aa22>
> > Oct 11 02:03:52 kernel:   factor =<fa>
> > Oct 11 02:03:52 kernel:   last_clean =<100b2aa1a>
> > Oct 11 02:03:52 kernel:   last_tx =<100b2aa22>
> > Oct 11 02:03:52 kernel:   count =<0>/<100>
> 
> The tx_ring->count seems not the default value which should be 256,
> could you please check if you have change this number to 100?
> 

If you look at the debugging patch mentioned above, you will see
that it prints in hexa, so that is actually 0x100 which is the 256d
you are looking for.

To be clear, the driver reproducing the issue is the one shipped
in 2.6.18-274.el5 (Red Hat Enterprise Linux 5 Update 7) with no
other patches.  In order to get more details at the time of the
event, we had applied the patch mentioned above, nothing else.

thanks,
fbl

------------------------------------------------------------------------------
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to