It's difficult to troubleshoot a problem that's erratic and hard to replicate. A dump of the registers and status from the link partner would be a good idea. I wonder if it isn't something triggered on the other side (is it a switch?) that sees something odd and disables the port, and a reboot just takes long enough that the port comes back?
Todd Fujinaka Software Application Engineer Networking Division (ND) Intel Corporation todd.fujin...@intel.com (503) 712-4565 -----Original Message----- From: Brandon Whaley [mailto:redkr...@gmail.com] Sent: Monday, June 08, 2015 10:27 AM To: e1000-devel@lists.sourceforge.net Subject: [E1000-devel] igb driver sometimes stops responding after dkms build I use dkms to build the igb driver after new kernel installs on my fleet of servers using the following commands after every yum update: dkms build -m igb -v 5.2.18 dkms install -m igb -v 5.2.18 About once a month, one of my boxes (a different one each time) will stop responding after this. Nothing I do is able to recover network connectivity short of a reboot (not loading/unloading the driver, restarting networking, etc.) and since these are production machines, it causes some downtime for us. Below is what you see in the syslog when the event occurs: Jun 8 12:29:19 localhost kernel: [147419.444969] ------------[ cut here ]------------ Jun 8 12:29:19 localhost kernel: [147419.444978] WARNING: at net/sched/sch_generic.c:267 dev_watchdog+0x26b/0x280() (Tainted: G --------------- T) Jun 8 12:29:19 localhost kernel: [147419.444981] Hardware name: X9DRL-3F/iF Jun 8 12:29:19 localhost kernel: [147419.444982] NETDEV WATCHDOG: eth0 (igb): transmit queue 0 timed out Jun 8 12:29:19 localhost kernel: [147419.444984] Modules linked in: mpt3sas mpt2sas raid_class mptctl mptbase ip6t_rt ipt_addrtype xt_policy aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 aes_generic cbc kcare(U) vzethdev pio_kaio pio_nfs pio_direct pfmt_raw pfmt_ploop1 ploop simfs vziolimit vzdquota ip6t_REJECT xfrm6_mode_tunnel xfrm4_mode_tunnel nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_netlink xt_comment nfsd ip6_tunnel ip_vs ipip xt_NFQUEUE xt_pkttype ecryptfs(T) ip_gre ip_tunnel ipt_MASQUERADE nf_nat_irc xt_helper nf_conntrack_irc nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip6t_LOG xt_connlimit xt_recent pppoatm atm vzrst vzcpt nfs lockd fscache auth_rpcgss nfs_acl sunrpc xfrm4_mode_transport xfrm6_mode_transport ccm authenc esp6 ah6 cnic uio xfrm4_tunnel tunnel4 ipcomp6 xfrm6_tunnel tunnel6 ipcomp xfrm_ipcomp esp4 ah4 af_key arc4 ecb ppp_mppe ppp_deflate zlib_deflate ppp_async ppp_generic slhc crc_ccitt fuse tun xt_MARK xt_mark vzevent autofs4 vznetdev vzmon vzdev ipt Jun 8 12:29:19 localhost kernel: _REDIRECT xt_owner nf_nat_ftp nf_conntrack_ftp iptable_nat nf_nat xt_state xt_length xt_hl xt_tcpmss xt_TCPMSS xt_multiport xt_limit nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ipt_LOG xt_DSCP xt_dscp ipt_REJECT iptable_mangle xt_set iptable_filter iptable_raw ip_tables ip6table_mangle ip6table_filter ip6table_raw ip6_tables ipv6 ip_set_hash_ip ip_set nfnetlink iTCO_wdt iTCO_vendor_support ipmi_devintf ipmi_si ipmi_msghandler acpi_pad e1000e(U) ses enclosure sg igb(U) dca i2c_algo_bit ptp pps_core sb_edac edac_core i2c_i801 i2c_core lpc_ich mfd_core shpchp tcp_htcp ext4 jbd2 mbcache sd_mod crc_t10dif isci libsas scsi_transport_sas ahci megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] Jun 8 12:29:19 localhost kernel: [147419.445094] Pid: 0, comm: swapper veid: 0 Tainted: G --------------- T 2.6.32-042stab108.2 #1 Jun 8 12:29:19 localhost kernel: [147419.445096] Call Trace: Jun 8 12:29:19 localhost kernel: [147419.445098] <IRQ> [<ffffffff8107b827>] ? warn_slowpath_common+0x87/0xc0 Jun 8 12:29:19 localhost kernel: [147419.445109] [<ffffffff8107b916>] ? warn_slowpath_fmt+0x46/0x50 Jun 8 12:29:19 localhost kernel: [147419.445112] [<ffffffff8148fccb>] ? dev_watchdog+0x26b/0x280 Jun 8 12:29:19 localhost kernel: [147419.445117] [<ffffffff81015009>] ? sched_clock+0x9/0x10 Jun 8 12:29:19 localhost kernel: [147419.445123] [<ffffffff8108f6cc>] ? run_timer_softirq+0x1bc/0x380 Jun 8 12:29:19 localhost kernel: [147419.445126] [<ffffffff8148fa60>] ? dev_watchdog+0x0/0x280 Jun 8 12:29:19 localhost kernel: [147419.445130] [<ffffffff81034ddd>] ? lapic_next_event+0x1d/0x30 Jun 8 12:29:19 localhost kernel: [147419.445134] [<ffffffff81084c7d>] ? __do_softirq+0x10d/0x250 Jun 8 12:29:19 localhost kernel: [147419.445139] [<ffffffff8100c48c>] ? call_softirq+0x1c/0x30 Jun 8 12:29:19 localhost kernel: [147419.445142] [<ffffffff810102b5>] ? do_softirq+0x65/0xa0 Jun 8 12:29:19 localhost kernel: [147419.445145] [<ffffffff81084a9d>] ? irq_exit+0xcd/0xd0 Jun 8 12:29:19 localhost kernel: [147419.445149] [<ffffffff8153f44a>] ? smp_apic_timer_interrupt+0x4a/0x60 Jun 8 12:29:19 localhost kernel: [147419.445152] [<ffffffff8100bc93>] ? apic_timer_interrupt+0x13/0x20 Jun 8 12:29:19 localhost kernel: [147419.445154] <EOI> [<ffffffff812fa20e>] ? intel_idle+0xde/0x170 Jun 8 12:29:19 localhost kernel: [147419.445160] [<ffffffff812fa1f1>] ? intel_idle+0xc1/0x170 Jun 8 12:29:19 localhost kernel: [147419.445166] [<ffffffff81435d27>] ? cpuidle_idle_call+0xa7/0x140 Jun 8 12:29:19 localhost kernel: [147419.445170] [<ffffffff8100a026>] ? cpu_idle+0xb6/0x110 Jun 8 12:29:19 localhost kernel: [147419.445174] [<ffffffff8152df04>] ? start_secondary+0x2be/0x301 Jun 8 12:29:19 localhost kernel: [147419.445179] ---[ end trace 2179b48f00e92658 ]--- Jun 8 12:29:19 localhost kernel: [147419.445180] Tainting kernel with flag 0x9 Jun 8 12:29:19 localhost kernel: [147419.445182] Pid: 0, comm: swapper veid: 0 Tainted: G --------------- T 2.6.32-042stab108.2 #1 Jun 8 12:29:19 localhost kernel: [147419.445184] Call Trace: Jun 8 12:29:19 localhost kernel: [147419.445185] <IRQ> [<ffffffff8107b6b1>] ? add_taint+0x71/0x80 Jun 8 12:29:19 localhost kernel: [147419.445190] [<ffffffff8107b834>] ? warn_slowpath_common+0x94/0xc0 Jun 8 12:29:19 localhost kernel: [147419.445193] [<ffffffff8107b916>] ? warn_slowpath_fmt+0x46/0x50 Jun 8 12:29:19 localhost kernel: [147419.445196] [<ffffffff8148fccb>] ? dev_watchdog+0x26b/0x280 Jun 8 12:29:19 localhost kernel: [147419.445199] [<ffffffff81015009>] ? sched_clock+0x9/0x10 Jun 8 12:29:19 localhost kernel: [147419.445204] [<ffffffff8108f6cc>] ? run_timer_softirq+0x1bc/0x380 Jun 8 12:29:19 localhost kernel: [147419.445206] [<ffffffff8148fa60>] ? dev_watchdog+0x0/0x280 Jun 8 12:29:19 localhost kernel: [147419.445209] [<ffffffff81034ddd>] ? lapic_next_event+0x1d/0x30 Jun 8 12:29:19 localhost kernel: [147419.445213] [<ffffffff81084c7d>] ? __do_softirq+0x10d/0x250 Jun 8 12:29:19 localhost kernel: [147419.445217] [<ffffffff8100c48c>] ? call_softirq+0x1c/0x30 Jun 8 12:29:19 localhost kernel: [147419.445219] [<ffffffff810102b5>] ? do_softirq+0x65/0xa0 Jun 8 12:29:19 localhost kernel: [147419.445222] [<ffffffff81084a9d>] ? irq_exit+0xcd/0xd0 Jun 8 12:29:19 localhost kernel: [147419.445225] [<ffffffff8153f44a>] ? smp_apic_timer_interrupt+0x4a/0x60 Jun 8 12:29:19 localhost kernel: [147419.445227] [<ffffffff8100bc93>] ? apic_timer_interrupt+0x13/0x20 Jun 8 12:29:19 localhost kernel: [147419.445229] <EOI> [<ffffffff812fa20e>] ? intel_idle+0xde/0x170 Jun 8 12:29:19 localhost kernel: [147419.445233] [<ffffffff812fa1f1>] ? intel_idle+0xc1/0x170 Jun 8 12:29:19 localhost kernel: [147419.445238] [<ffffffff81435d27>] ? cpuidle_idle_call+0xa7/0x140 Jun 8 12:29:19 localhost kernel: [147419.445241] [<ffffffff8100a026>] ? cpu_idle+0xb6/0x110 Jun 8 12:29:19 localhost kernel: [147419.445243] [<ffffffff8152df04>] ? start_secondary+0x2be/0x301 Jun 8 12:29:27 localhost kernel: [147427.448304] igb 0000:03:00.0: eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None Jun 8 12:29:45 localhost kernel: [147445.478853] igb 0000:03:00.0: eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None Jun 8 12:29:59 localhost kernel: [147459.496009] igb 0000:03:00.0: eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None Jun 8 12:30:13 localhost kernel: [147473.459481] igb 0000:03:00.0: eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None Jun 8 12:30:27 localhost kernel: [147487.460716] igb 0000:03:00.0: eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None Jun 8 12:30:45 localhost kernel: [147505.498234] igb 0000:03:00.0: eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None Jun 8 12:31:03 localhost kernel: [147523.513873] igb 0000:03:00.0: eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None Is there any more information I can provide the next time this happens? If trends continue I should see it again in 3-5 weeks and will be able to collect necessary info then. Unfortunately I've yet to find a way to replicate this on demand. ------------------------------------------------------------------------------ _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired