Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

Michael Wang Thu, 20 Oct 2011 23:18:22 -0700

On 10/19/2011 08:16 PM, Flavio Leitner wrote:
> On Wed, 19 Oct 2011 12:49:48 +0800
> wangyun<wang...@linux.vnet.ibm.com>  wrote:
>
>> Hi, Flavio
>>
>> I am new to join the community, work on e1000e driver currently,
>> And I found a thing strange in this issue, please check below.
>>
>> Thanks,
>> Michael Wang
>>
>> On 10/18/2011 10:42 PM, Flavio Leitner wrote:
>>> On Mon, 17 Oct 2011 11:48:22 -0700
>>> Jesse Brandeburg<jesse.brandeb...@intel.com>   wrote:
>>>
>>>> On Fri, 14 Oct 2011 10:04:26 -0700
>>>> Flavio Leitner<f...@redhat.com>   wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I got few reports so far that 82571EB models are having the
>>>>> "Detected Hardware Unit Hang" issue after upgrading the kernel.
>>>>>
>>>>> Further debugging with an instrumented kernel revealed that the
>>>>> socket buffer time stamp matches with the last time e1000_xmit_frame()
>>>>> was called. Also that the time stamp of e1000_clean_tx_irq() last run
>>>>> is prior to the one in socket buffer.
>>>>>
>>>>> However, ~1 second later, an interrupt is fired and the old entry
>>>>> is found. Sometimes, the scheduled print_hang_task dumps the
>>>>> information _after_ the old entry is sent (shows empty ring),
>>>>> indicating that the HW TX unit isn't really stuck and apparently
>>>>> just missed the signal to initiate the transmission.
>>>>>
>>>>> Order of events:
>>>>>    (1) skb is pushed down
>>>>>    (2) e1000_xmit_frame() is called
>>>>>    (3) ring is filled with one entry
>>>>>    (4) TDT is updated
>>>>>> (5) nothing happens for little more than 1 second
>>>>>    (6) interrupt is fired
>>>>>    (7) e1000_clean_tx_irq() is called
>>>>>    (8) finds the entry not ready with an old time stamp,
>>>>>        schedules print_hang_task and stops the TX queue.
>>>>>    (9) print_hang_task runs, dump the info but the old entry is now sent
>>>>> (10) apparently the TX queue is back.
>>>> Flavio, thanks for the detailed info, please be sure to supply us the
>>>> bugzilla number.
>>>>
>>> It was buried in the end of the first email:
>>> https://bugzilla.redhat.com/show_bug.cgi?id=746272
>>>
>>>> TDH is probably not moving due to the writeback threshold settings in
>>>> TXDCTL.  netperf UDP_RR test is likely a good way to test this.
>>>>
>>> Yeah, makes sense. I haven't heard about new events after had removed
>>> the flag FLAG2_DMA_BURST.  Unfortunately, I don't have access to the exact
>>> same hardware and I haven't reproduced the issue in-house yet with another
>>> 82571EB. See below about interface statistics from sar.
>>>
>>>
>>>> I don't think the sequence is quite what you said.  We are going to
>>>> work with the hardware team to get a sequence that works right, and we
>>>> should have a fix for you soon.
>>> Yeah, the sequence might not be exact, but gives us a good idea of
>>> what could be happening.
>>>
>>> There are two events right after another:
>>>
>>> Oct  9 05:45:23  kernel:   TDH<48>
>>> Oct  9 05:45:23  kernel:   TDT<49>
>>> Oct  9 05:45:23  kernel:   next_to_use<49>
>>> Oct  9 05:45:23  kernel:   next_to_clean<48>
>>> Oct  9 05:45:23  kernel: buffer_info[next_to_clean]:
>>> Oct  9 05:45:23  kernel:   time_stamp<102338ca6>
>>> Oct  9 05:45:23  kernel:   next_to_watch<48>
>>> Oct  9 05:45:23  kernel:   jiffies<102338dc1>
>>> Oct  9 05:45:23  kernel:   next_to_watch.status<0>
>>> Oct  9 05:45:23  kernel: MAC Status<80383>
>>> Oct  9 05:45:23  kernel: PHY Status<792d>
>>> Oct  9 05:45:23  kernel: PHY 1000BASE-T Status<3800>
>>> Oct  9 05:45:23  kernel: PHY Extended Status<3000>
>>> Oct  9 05:45:23  kernel: PCI Status<10>
>>> Oct  9 05:51:54  kernel: e1000e 0000:22:00.1: eth7: Detected Hardware Unit 
>>> Hang:
>>> Oct  9 05:51:54  kernel:   TDH<55>
>>> Oct  9 05:51:54  kernel:   TDT<56>
>>> Oct  9 05:51:54  kernel:   next_to_use<56>
>>> Oct  9 05:51:54  kernel:   next_to_clean<55>
>>> Oct  9 05:51:54  kernel: buffer_info[next_to_clean]:
>>> Oct  9 05:51:54  kernel:   time_stamp<102350986>
>>> Oct  9 05:51:54  kernel:   next_to_watch<55>
>>> Oct  9 05:51:54  kernel:   jiffies<102350b07>
>>> Oct  9 05:51:54  kernel:   next_to_watch.status<0>
>>> Oct  9 05:51:54  kernel: MAC Status<80383>
>>> Oct  9 05:51:54  kernel: PHY Status<792d>
>>> Oct  9 05:51:54  kernel: PHY 1000BASE-T Status<3800>
>>> Oct  9 05:51:54  kernel: PHY Extended Status<3000>
>>> Oct  9 05:51:54  kernel: PCI Status<10>
>>>
>> I see the judgement of hang is:
>>
>> time_after(jiffies, tx_ring->buffer_info[i].time_stamp +
>> (adapter->tx_timeout_factor * HZ))
>>
>> which means the hang happened when current jiffies minus buffer's time
>> stamp is over
>> (adapter->tx_timeout_factor * HZ).
>>
>> And I see the tx_timeout_factor will at least be 1, so on x86 the
>> (jiffies-time_stamp) should
>> over 1000, but here looks only around 300.
>>
>> Could you please check the HZ number of your platform?
>>
> sure, adapter->tx_timeout_factor * HZ = 0xfa/250d
> That data came from a customer using kernel-xen, so HZ is 250.
>
> Here is the debugging patch used:
> http://people.redhat.com/~fleitner/linux-kernel-test.patch
>
> The idea was to capture all the relevant values at the time
> of the problem. (The print_hang_task is scheduled and sometimes
> it shows timestamp=0, TDH=TDT because the packet is already sent)
>
> This is the full output with debugging patch applied:
> Oct 11 02:03:52 kernel: e1000e 0000:22:00.1: eth7: Detected Hardware Unit 
> Hang:
> Oct 11 02:03:52 kernel:   TDH<25>
> Oct 11 02:03:52 kernel:   TDT<26>
> Oct 11 02:03:52 kernel:   next_to_use<26>
> Oct 11 02:03:52 kernel:   next_to_clean<25>
> Oct 11 02:03:52 kernel: buffer_info[next_to_clean]:
> Oct 11 02:03:52 kernel:   time_stamp<100b2aa22>
> Oct 11 02:03:52 kernel:   next_to_watch<25>
> Oct 11 02:03:52 kernel:   jiffies<100b2ab25>
> Oct 11 02:03:52 kernel:   next_to_watch.status<0>
> Oct 11 02:03:52 kernel:   stored_i =<25>
> Oct 11 02:03:52 kernel:   stored_first =<25>
> Oct 11 02:03:52 kernel:   stamp =<100b2aa22>
> Oct 11 02:03:52 kernel:   factor =<fa>
> Oct 11 02:03:52 kernel:   last_clean =<100b2aa1a>
> Oct 11 02:03:52 kernel:   last_tx =<100b2aa22>
> Oct 11 02:03:52 kernel:   count =<0>/<100>


The tx_ring->count seems not the default value which should be 256,
could you please check if you have change this number to 100?

Thanks,
Michael Wang

> Oct 11 02:03:52 kernel:   next_to_use cached =<26>
> Oct 11 02:03:52 kernel: e1000e 0000:22:00.1: eth7: Detected Hardware Unit 
> Hang:
> Oct 11 02:03:52 kernel:   TDH<25>
> Oct 11 02:03:52 kernel:   TDT<26>
> Oct 11 02:03:52 kernel:   next_to_use<26>
> Oct 11 02:03:52 kernel:   next_to_clean<25>
> Oct 11 02:03:52 kernel: buffer_info[next_to_clean]:
> Oct 11 02:03:52 kernel:   time_stamp<100b2aa22>
> Oct 11 02:03:52 kernel:   next_to_watch<25>
> Oct 11 02:03:52 kernel:   jiffies<100b2ab26>
> Oct 11 02:03:52 kernel:   next_to_watch.status<0>
> Oct 11 02:03:52 kernel: MAC Status<80383>
> Oct 11 02:03:52 kernel: PHY Status<792d>
> Oct 11 02:03:52 kernel: PHY 1000BASE-T Status<3800>
> Oct 11 02:03:52 kernel: PHY Extended Status<3000>
> Oct 11 02:03:52 kernel: PCI Status<10>
>
> I hope that helps.  Let me know if you need anything else.
> thanks,
> fbl


------------------------------------------------------------------------------
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

Reply via email to