I agree with Todd, we need way more information on this. For example, if we had the dmesg we could tell if the Tx hang message is being reported or not. If not it might point to a problem with the interrupts on the device. If I recall correctly the igb driver should be generating an interrupt every 2 seconds on each of its TxRx interrupt vectors. If you were to run 'watch -d "grep enp1s0f0-TxRx /proc/interrupts"' what you should see is all of the interrupt vectors increment by at least 1 every 2 seconds. If you don't see that then it could be a sign of an issue in the interrupt handling logic of the kernel as this is an issue we have seen with Xen in the past.
Thanks. - Alex On Wed, Jan 24, 2018 at 2:08 PM, Fujinaka, Todd <todd.fujin...@intel.com> wrote: > There's really not enough information here. Ideally you would send us the > dmesg of when it fails, and a register dump before and after. > > I would suggest opening on bug on sourceforge and attaching the dmesg & > register dumps to the bug. Don't just copy them into the bug because that's > much harder to read. > > We haven't heard of many issues with the 82576 like this, so you may also > want to ask Supermicro for help, but it also looks like your hardware is EOL. > > Todd Fujinaka > Software Application Engineer > Datacenter Engineering Group > Intel Corporation > todd.fujin...@intel.com > > > -----Original Message----- > From: Kojedzinszky Richárd [mailto:kojedzinszky.rich...@euronetrt.hu] > Sent: Wednesday, January 24, 2018 1:44 AM > To: e1000-devel@lists.sourceforge.net > Subject: [E1000-devel] igb transmit queue timeout > > Dear maintainers, > > We have a xen virtualization environment, with 6 nearly identical nodes, > Supermicro X8DTU boards. > > We run debian stretch on them, the xen hypervisor and linux kernel is from > debian stretch, latest at the time of writing. > > Unfortunately, we are facing an issue where randomly our igb devices stop > working, with the error message: > > NETDEV WATCHDOG: enp1s0f0 (igb): transmit queue 0 timed out > > And while the driver tries to recover/reset the adapter, it does not succeed. > Shutting down the interface and then bringing it back even does not help, a > reboot is required to restore normal operation. > > The servers are connected to our switch with two interfaces, the problem > happens randomly on either one. > > We have tried to disable msi interrupts, but that did not help. > > Unfortunately, we cannot reproduce the problem, I mean it happens randomly, > frequently, but we cannot explicitly trigger it. It did happen on nearly all > our nodes, so I assume it is not a hardware problem. > > Our kernel/xen versions: > > # uname -a > Linux node-3.cloud-b.dravanet.net 4.9.0-5-amd64 #1 SMP Debian > 4.9.65-3+deb9u2 (2018-01-04) x86_64 GNU/Linux # xl info > host : x > release : 4.9.0-5-amd64 > version : #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04) > machine : x86_64 > nr_cpus : 8 > max_cpu_id : 23 > nr_nodes : 2 > cores_per_socket : 4 > threads_per_core : 1 > cpu_mhz : 3066 > hw_caps : > b7ebfbff:029ee3ff:2c100800:00000001:00000000:00000000:00000000:00000100 > virt_caps : hvm hvm_directio > total_memory : 196599 > free_memory : 94364 > sharing_freed_memory : 0 > sharing_used_memory : 0 > outstanding_claims : 0 > free_cpus : 0 > xen_major : 4 > xen_minor : 8 > xen_extra : .3-pre > xen_version : 4.8.3-pre > xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 > hvm-3.0-x86_32p hvm-3.0-x86_64 > xen_scheduler : credit > xen_pagesize : 4096 > platform_params : virt_start=0xffff800000000000 > xen_changeset : > xen_commandline : placeholder dom0_mem=4096M gnttab_max_frames=256 > cc_compiler : gcc (Debian 6.3.0-18) 6.3.0 20170516 > cc_compile_by : ijackson > cc_compile_domain : chiark.greenend.org.uk > cc_compile_date : Sat Nov 25 11:30:34 UTC 2017 > build_id : 23ac95af74d2e3f84c90068ae674c34e764649e7 > xend_config_format : 4 > > What else could we try to resolve this issue? > > Thanks in advance, > > Kojedzinszky Richárd > Euronet Magyarorszag Informatika Zrt. > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most engaging tech > sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > E1000-devel mailing list > E1000-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/e1000-devel > To learn more about Intel® Ethernet, visit > http://communities.intel.com/community/wired > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > E1000-devel mailing list > E1000-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/e1000-devel > To learn more about Intel® Ethernet, visit > http://communities.intel.com/community/wired ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired