Re: [E1000-devel] igb transmit queue timeout

Alexander Duyck Thu, 25 Jan 2018 08:14:58 -0800

I agree with Todd, we need way more information on this.

For example, if we had the dmesg we could tell if the Tx hang message
is being reported or not. If not it might point to a problem with the
interrupts on the device. If I recall correctly the igb driver should
be generating an interrupt every 2 seconds on each of its TxRx
interrupt vectors. If you were to run 'watch -d "grep enp1s0f0-TxRx
/proc/interrupts"' what you should see is all of the interrupt vectors
increment by at least 1 every 2 seconds. If you don't see that then it
could be a sign of an issue in the interrupt handling logic of the
kernel as this is an issue we have seen with Xen in the past.


Thanks.

- Alex

On Wed, Jan 24, 2018 at 2:08 PM, Fujinaka, Todd <[email protected]> wrote:
> There's really not enough information here. Ideally you would send us the 
> dmesg of when it fails, and a register dump before and after.
>
> I would suggest opening on bug on sourceforge and attaching the dmesg & 
> register dumps to the bug. Don't just copy them into the bug because that's 
> much harder to read.
>
> We haven't heard of many issues with the 82576 like this, so you may also 
> want to ask Supermicro for help, but it also looks like your hardware is EOL.
>
> Todd Fujinaka
> Software Application Engineer
> Datacenter Engineering Group
> Intel Corporation
> [email protected]
>
>
> -----Original Message-----
> From: Kojedzinszky Richárd [mailto:[email protected]]
> Sent: Wednesday, January 24, 2018 1:44 AM
> To: [email protected]
> Subject: [E1000-devel] igb transmit queue timeout
>
> Dear maintainers,
>
> We have a xen virtualization environment, with 6 nearly identical nodes, 
> Supermicro X8DTU boards.
>
> We run debian stretch on them, the xen hypervisor and linux kernel is from 
> debian stretch, latest at the time of writing.
>
> Unfortunately, we are facing an issue where randomly our igb devices stop 
> working, with the error message:
>
> NETDEV WATCHDOG: enp1s0f0 (igb): transmit queue 0 timed out
>
> And while the driver tries to recover/reset the adapter, it does not succeed. 
> Shutting down the interface and then bringing it back even does not help, a 
> reboot is required to restore normal operation.
>
> The servers are connected to our switch with two interfaces, the problem 
> happens randomly on either one.
>
> We have tried to disable msi interrupts, but that did not help.
>
> Unfortunately, we cannot reproduce the problem, I mean it happens randomly, 
> frequently, but we cannot explicitly trigger it. It did happen on nearly all 
> our nodes, so I assume it is not a hardware problem.
>
> Our kernel/xen versions:
>
> # uname -a
> Linux node-3.cloud-b.dravanet.net 4.9.0-5-amd64 #1 SMP Debian
> 4.9.65-3+deb9u2 (2018-01-04) x86_64 GNU/Linux # xl info
> host                   : x
> release                : 4.9.0-5-amd64
> version                : #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04)
> machine                : x86_64
> nr_cpus                : 8
> max_cpu_id             : 23
> nr_nodes               : 2
> cores_per_socket       : 4
> threads_per_core       : 1
> cpu_mhz                : 3066
> hw_caps                :
> b7ebfbff:029ee3ff:2c100800:00000001:00000000:00000000:00000000:00000100
> virt_caps              : hvm hvm_directio
> total_memory           : 196599
> free_memory            : 94364
> sharing_freed_memory   : 0
> sharing_used_memory    : 0
> outstanding_claims     : 0
> free_cpus              : 0
> xen_major              : 4
> xen_minor              : 8
> xen_extra              : .3-pre
> xen_version            : 4.8.3-pre
> xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32
> hvm-3.0-x86_32p hvm-3.0-x86_64
> xen_scheduler          : credit
> xen_pagesize           : 4096
> platform_params        : virt_start=0xffff800000000000
> xen_changeset          :
> xen_commandline        : placeholder dom0_mem=4096M gnttab_max_frames=256
> cc_compiler            : gcc (Debian 6.3.0-18) 6.3.0 20170516
> cc_compile_by          : ijackson
> cc_compile_domain      : chiark.greenend.org.uk
> cc_compile_date        : Sat Nov 25 11:30:34 UTC 2017
> build_id               : 23ac95af74d2e3f84c90068ae674c34e764649e7
> xend_config_format     : 4
>
> What else could we try to resolve this issue?
>
> Thanks in advance,
>
> Kojedzinszky Richárd
> Euronet Magyarorszag Informatika Zrt.
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most engaging tech 
> sites, Slashdot.org! http://sdm.link/slashdot 
> _______________________________________________
> E1000-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/e1000-devel
> To learn more about Intel&#174; Ethernet, visit 
> http://communities.intel.com/community/wired
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> E1000-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/e1000-devel
> To learn more about Intel&#174; Ethernet, visit 
> http://communities.intel.com/community/wired

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] igb transmit queue timeout

Reply via email to