Hi, Allow me to add my grain of salt to this: sorry if it's too long. We have a system exibiting the same behaviour and it is certainly not DRBD-related as it doesn't use it.
It runs Xen 4.4.2 though with kernel 3.19.3. And we have seen the same behaviour with 3.14.36 and 3.18.10. Xen and kernel are in-house compiled. Symptoms: before either simply crashing or starting being unresponsive network-wize. A few instances: Apr 7 15:33:10 kernel: [459694.670896] ------------[ cut here ]------------ Apr 7 15:33:10 kernel: [459694.670921] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:303 dev_watchdog+0x165/0x20d() Apr 7 15:33:10 kernel: [459694.670924] NETDEV WATCHDOG: eth0 (igb): transmit queue 6 timed out Apr 7 15:33:10 agrippa kernel: [459694.670927] Modules linked in: st nfsv3 nfsv4 xt_physdev br_netfilter autofs4 parport_pc af_packet ppdev lp xen_acpi_processor parport xen_netback xen_blkback xen_gntalloc fuse rpcsec_gss_krb5 nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc bridge stp llc ipv6 ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat ipt_REJECT nf_reject_ipv4 nf_conntr ack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_tcpudp iptable_filter iptable_mangle ip_tables x_tables ipmi_devintf iTCO_wdt iTCO_vendor_support coretemp pcspkr i2c_i801 evbug evdev joydev lpc_ich i7core_edac ioatdma edac_core i5500_temp rtc_cmos ipmi_si tpm_tis ipmi_msghandler tpm button processor usbkbd usbmouse usbhid sg igb i2c_algo_bit ehci_pci uhci_hcd i2c_core ehci_hcd usbcore ixgbe dca e1000e hwmon ptp pps_co re mdio megaraid_sas Apr 7 15:33:10 agrippa kernel: [459694.671006] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.19.3-i686-64-smp #1 Apr 7 15:33:10 agrippa kernel: [459694.671009] Hardware name: Supermicro X8DTU-6+/X8DTU-6+, BIOS 2.1b 11/15/2011 For 3.14.33: Mar 15 05:45:43 agrippa kernel: [1612129.231013] ------------[ cut here ]------------ Mar 15 05:45:43 agrippa kernel: [1612129.231024] WARNING: CPU: 14 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0x161/0x216() Mar 15 05:45:43 agrippa kernel: [1612129.231027] NETDEV WATCHDOG: eth0 (igb): transmit queue 2 timed out Mar 15 05:45:43 agrippa kernel: [1612129.231029] Modules linked in: x25 appletalk ipx p8023 p8022 psnap rose netrom ax25 ipt_MASQUERADE xt_state iptable_mangle xt_physdev st ipt_REJECT bridge stp llc nfsv3 nfsv4 autofs4 parport_pc ppdev lp parport af_packet xen_acpi_processor xen_netback xen_blkback xen_gntalloc fuse rpcsec_gss_krb5 nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc ipv6 iptable_filter xt_n at xt_tcpudp iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables ipmi_devintf iTCO_wdt iTCO_vendor_support coretemp microcode pcspkr evbug evdev joydev i2c_i801 lpc_ic h i7core_edac ioatdma edac_core rtc_cmos tpm_tis tpm ipmi_si ipmi_msghandler button processor usbkbd usbmouse usbhid sg ehci_pci igb uhci_hcd ehci_hcd i2c_algo_bit i2c_core usbcore ixgbe e1000e dca hwmon ptp pps _core megaraid_sas mdio Mar 15 05:45:43 agrippa kernel: [1612129.231118] CPU: 14 PID: 0 Comm: swapper/14 Not tainted 3.14.33-i686-64-smp #1 Mar 15 05:45:43 agrippa kernel: [1612129.231120] Hardware name: Supermicro X8DTU-6+/X8DTU-6+, BIOS 2.1b 11/15/2011 This happens around one or twice a week. Absolutely very annoying. We are still investigating weather it is related specifically to the onboard Intel Gigabit adapter. Or not. Board Manufacturer: Supermicro, Product Name: X8DTU-6+ ~# modinfo igb filename: /lib/modules/3.19.3-i686-64-smp/kernel/drivers/net/ethernet/intel/igb/igb.ko version: 5.2.15-k license: GPL description: Intel(R) Gigabit Ethernet Network Driver author: Intel Corporation, <[email protected]> [..%<..snip..%<..] HTH, jf * Igor Novgorodov <[email protected]> [20150410 15:32]: > I'd second previous poster - try another kernel, long-term-supported > 3.18 preferably. > DRBD by itself is rock solid, your problems with timeouts lie elsewhere. > I've been running DRBD server pairs on custom built 3.14 LTS kernel > with a year+ uptimes, no problems at all. > > And don't use Intel's out-of-kernel drivers if you don't clearly see > you need it. In-kernel ones very good > and will surely work more stable. > > So, concluding: > 1. LTS kernel > 2. In-kernel DRBD > 3. In-kernel drivers > 4. Maybe try to stress-test the system without Xen, only DRBD stuff > (run fio tool in random read-write for a couple of days on DRBD > device). > 5. Don't use unstable distributions, like Jessie, it's better to > backport the needed stuff to Wheezy if it's not already in > wheezy-backports repo. > > On 09/04/15 23:07, Alan Evetts wrote: > >Hi there, > > > >I am reaching out because we have been trying to find stability in our move > >to DRBD as it is amazing in concept, but have struggled for 6 months of > >time. I am going to just lay out everything we are doing, as the problem > >starts and stops when we introduce/remove DRBD from the picture. Obviously, > >these setups get complicated so hopefully this isn’t too much information > >here. > > > >What we are trying to do is have a pair of Dell R610 machines, each running > >DRBD and xen with about 8 DRBD partitions, each master running half of the > >Xen virtual machines. > > > >Seems, between 1 and 20 days we always receive a kernel panic on 1 machine, > >which will often drag down the second machine. Details of the most recent > >panic are below. > > > >In order to rule out problems we have: > > - Replace both Dell R610 (have 4 now total, all the same problem) > > - Upgraded to Debian Jessie from Debian Wheezy > > - Running xen-hypervisor-4.4-amd64, drbd debian version 8.9.2~rc1-2, > > kernel 3.16.0-4 > > - Switched from the on-board broadcom NICs to Intel E1G44HTBLK 4 port > > PCI-e NIC > > - Upgraded to igb kernel module 5.2.17 and rebuilt it into the initrd > > as well > > > > > >The 2 servers both have lots of resources (64 gigs of ram, quad xeon 2.4, 6 > >* 1 TB drives in a raid 10). There is a cross over cable on ETH3 for DRBD, > >each drbd instance runs on its own port on ETH3. The Xen config runs on a > >bridge. > > > >The problem has more or less been the same as we’ve moved through all of the > >hardware and software versions over the past 6 months. It rotates between > >the servers. > > > >I am hoping someone can spot a problem in our config, or guide us on what to > >try from here. All 4 dell machines have been patched and had the > >diagnostics ran on them without issue. > > > >The problem. One of the machines will have a transit queue time-out on an > >interface (oddly, not necessarily the drbd interface - but usually). From > >there, a panic, and the NIC will start going up and down. This then starts > >to drive the load up, the machines soon become unresponsive over shell. > >Connected over the dRAC remote access port, sooner or later we see errors > >about the drives not responding, I think this is from the load but I do not > >know for sure. From this point the machine will sometimes drag down its > >paired DRBD machine, and sometimes not. The one with the crash needs a hard > >reboot at this point. > > > >We love DRBD, its simplicity and functionality but it introduces these > >often crashes which are not worth it. Hoping someone can spot an error we > >are doing here, or have ideas on what to try. > > > >Thanks in advance for any help.. and FYI this crashed used to happen in the > >broadcom queue, now its the intel queue, and only when we have drbd enabled. > > > > > > > >Apr 9 03:39:17 v2 kernel: [141714.850432] ------------[ cut here > >]------------ > >Apr 9 03:39:17 v2 kernel: [141714.850521] WARNING: CPU: 0 PID: 0 at > >/build/linux-y7bjb0/linux-3.16.7-ckt4/net/sched/sch_generic.c:264 > >dev_watchdog+0x236/0x240() > >Apr 9 03:39:17 v2 kernel: [141714.850527] NETDEV WATCHDOG: eth1 (igb): > >transmit queue 0 timed out > >Apr 9 03:39:17 v2 kernel: [141714.850531] Modules linked in: xt_tcpudp > >xt_physdev iptable_filter ip_tables x_tables xen_netback xen_blkback > >nfnetlink_queue nfnetlink_log nfnetlink bluetooth 6lowpan_iphc rfkill > >xen_gntdev xen_evt > >chn xenfs xen_privcmd nfsd auth_rpcgss oid_registry nfs_acl nfs lockd > >fscache sunrpc bridge stp llc ttm drm_kms_helper joydev drm i2c_algo_bit > >i2c_core pcspkr wmi iTCO_wdt iTCO_vendor_support psmouse dcdbas serio_raw > >evdev tpm_ti > >s tpm lpc_ich mfd_core acpi_power_meter button coretemp i7core_edac > >edac_core shpchp processor thermal_sys loop ipmi_watchdog ipmi_si > >ipmi_poweroff ipmi_devintf ipmi_msghandler drbd lru_cache libcrc32c autofs4 > >ext4 crc16 mbcache > >jbd2 dm_mod sg sd_mod crc_t10dif crct10dif_generic sr_mod cdrom ses > >crct10dif_common enclosure ata_generic hid_generic usbhid hid crc32c_intel > >ata_piix ehci_pci uhci_hcd libata igb(O) megaraid_sas ehci_hcd scsi_mod > >usbcore dca pt > >p usb_common pps_core > >Apr 9 03:39:17 v2 kernel: [141714.850609] CPU: 0 PID: 0 Comm: swapper/0 > >Tainted: G O 3.16.0-4-amd64 #1 Debian 3.16.7-ckt4-3 > >Apr 9 03:39:17 v2 kernel: [141714.850613] Hardware name: Dell Inc. > >PowerEdge R610/0XDN97, BIOS 6.4.0 07/23/2013 > >Apr 9 03:39:17 v2 kernel: [141714.850617] 0000000000000009 > >ffffffff815096a7 ffff880079e03e28 ffffffff810676f7 > >Apr 9 03:39:17 v2 kernel: [141714.850622] 0000000000000000 > >ffff880079e03e78 0000000000000010 0000000000000000 > >Apr 9 03:39:17 v2 kernel: [141714.850626] ffff8800445c8000 > >ffffffff8106775c ffffffff81777270 ffffffff00000030 > >Apr 9 03:39:17 v2 kernel: [141714.850631] Call Trace: > >Apr 9 03:39:17 v2 kernel: [141714.850635] <IRQ> [<ffffffff815096a7>] ? > >dump_stack+0x41/0x51 > >Apr 9 03:39:17 v2 kernel: [141714.850652] [<ffffffff810676f7>] ? > >warn_slowpath_common+0x77/0x90 > >Apr 9 03:39:17 v2 kernel: [141714.850660] [<ffffffff8106775c>] ? > >warn_slowpath_fmt+0x4c/0x50 > >Apr 9 03:39:17 v2 kernel: [141714.850669] [<ffffffff81074647>] ? > >mod_timer+0x127/0x1e0 > >Apr 9 03:39:17 v2 kernel: [141714.850676] [<ffffffff8143ce76>] ? > >dev_watchdog+0x236/0x240 > >Apr 9 03:39:17 v2 kernel: [141714.850681] [<ffffffff8143cc40>] ? > >dev_graft_qdisc+0x70/0x70 > >Apr 9 03:39:17 v2 kernel: [141714.850686] [<ffffffff810729b1>] ? > >call_timer_fn+0x31/0x100 > >Apr 9 03:39:17 v2 kernel: [141714.850691] [<ffffffff8143cc40>] ? > >dev_graft_qdisc+0x70/0x70 > >Apr 9 03:39:17 v2 kernel: [141714.850698] [<ffffffff81073fe9>] ? > >run_timer_softirq+0x209/0x2f0 > >Apr 9 03:39:17 v2 kernel: [141714.850704] [<ffffffff8106c591>] ? > >__do_softirq+0xf1/0x290 > >Apr 9 03:39:17 v2 kernel: [141714.850709] [<ffffffff8106c965>] ? > >irq_exit+0x95/0xa0 > >Apr 9 03:39:17 v2 kernel: [141714.850718] [<ffffffff813579c5>] ? > >xen_evtchn_do_upcall+0x35/0x50 > >Apr 9 03:39:17 v2 kernel: [141714.850725] [<ffffffff8151141e>] ? > >xen_do_hypervisor_callback+0x1e/0x30 > >Apr 9 03:39:17 v2 kernel: [141714.850728] <EOI> [<ffffffff810013aa>] ? > >xen_hypercall_sched_op+0xa/0x20 > >Apr 9 03:39:17 v2 kernel: [141714.850737] [<ffffffff810013aa>] ? > >xen_hypercall_sched_op+0xa/0x20 > >Apr 9 03:39:17 v2 kernel: [141714.850746] [<ffffffff81009e0c>] ? > >xen_safe_halt+0xc/0x20 > >Apr 9 03:39:17 v2 kernel: [141714.850756] [<ffffffff8101c959>] ? > >default_idle+0x19/0xb0 > >Apr 9 03:39:17 v2 kernel: [141714.850764] [<ffffffff810a7dc0>] ? > >cpu_startup_entry+0x340/0x400 > >Apr 9 03:39:17 v2 kernel: [141714.850770] [<ffffffff81902071>] ? > >start_kernel+0x492/0x49d > >Apr 9 03:39:17 v2 kernel: [141714.850775] [<ffffffff81901a04>] ? > >set_init_arg+0x4e/0x4e > >Apr 9 03:39:17 v2 kernel: [141714.850781] [<ffffffff81903f64>] ? > >xen_start_kernel+0x569/0x573 > >Apr 9 03:39:17 v2 kernel: [141714.850785] ---[ end trace ee11063cf033829a > >]--- > >Apr 9 03:39:17 v2 kernel: [141714.871945] br1: port 1(eth1) entered > >disabled state > >Apr 9 03:39:20 v2 kernel: [141718.210743] igb 0000:05:00.1 eth1: igb: eth1 > >NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None > >Apr 9 03:39:20 v2 kernel: [141718.210913] br1: port 1(eth1) entered > >forwarding state > >Apr 9 03:39:20 v2 kernel: [141718.210923] br1: port 1(eth1) entered > >forwarding state > >Apr 9 03:39:26 v2 kernel: [141723.863194] br1: port 1(eth1) entered > >disabled state > >Apr 9 03:39:30 v2 kernel: [141727.650897] igb 0000:05:00.1 eth1: igb: eth1 > >NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None > >Apr 9 03:39:30 v2 kernel: [141727.651040] br1: port 1(eth1) entered > >forwarding state > >Apr 9 03:39:30 v2 kernel: [141727.651053] br1: port 1(eth1) entered > >forwarding state > >Apr 9 03:39:31 v2 kernel: [141728.890509] ata1: lost interrupt (Status 0x50) > >Apr 9 03:39:31 v2 kernel: [141728.890560] sr 1:0:0:0: CDB: > >Apr 9 03:39:31 v2 kernel: [141728.890563] Get event status notification: 4a > >01 00 00 10 00 00 00 08 00 > >Apr 9 03:39:31 v2 kernel: [141728.890630] ata1: hard resetting link > >Apr 9 03:39:31 v2 kernel: [141729.366592] ata1: SATA link up 1.5 Gbps > >(SStatus 113 SControl 300) > >Apr 9 03:39:32 v2 kernel: [141729.406749] ata1.00: configured for UDMA/100 > >Apr 9 03:39:32 v2 kernel: [141729.408192] ata1: EH complete > >Apr 9 03:39:35 v2 kernel: [141732.711653] br1: port 1(eth1) entered > >disabled state > >Apr 9 03:39:37 v2 kernel: [141734.678485] drbd s3: peer( Primary -> Unknown > >) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > >Apr 9 03:39:37 v2 kernel: [141734.678846] drbd s3: asender terminated > >Apr 9 03:39:37 v2 kernel: [141734.678852] drbd s3: Terminating drbd_a_s3 > >Apr 9 03:39:37 v2 kernel: [141734.678956] drbd s3: Connection closed > >Apr 9 03:39:37 v2 kernel: [141734.678972] drbd s3: conn( NetworkFailure -> > >Unconnected ) > >Apr 9 03:39:37 v2 kernel: [141734.678974] drbd s3: receiver terminated > >Apr 9 03:39:37 v2 kernel: [141734.678976] drbd s3: Restarting receiver > >thread > >Apr 9 03:39:37 v2 kernel: [141734.678977] drbd s3: receiver (re)started > >Apr 9 03:39:37 v2 kernel: [141734.678987] drbd s3: conn( Unconnected -> > >WFConnection ) > >Apr 9 03:39:38 v2 kernel: [141735.718898] igb 0000:05:00.1 eth1: igb: eth1 > >NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None > >Apr 9 03:39:38 v2 kernel: [141735.719086] br1: port 1(eth1) entered > >forwarding state > >Apr 9 03:39:38 v2 kernel: [141735.719095] br1: port 1(eth1) entered > >forwarding state > >Apr 9 03:39:39 v2 kernel: [141737.154575] drbd s4: peer( Secondary -> > >Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > >Apr 9 03:39:39 v2 kernel: [141737.154671] block drbd1: new current UUID > >461FF401E0489AAB:9279A3BA4A3A710B:0E977CC4BB5727A9:0E967CC4BB5727A9 > >Apr 9 03:39:39 v2 kernel: [141737.154921] drbd s4: asender terminated > >Apr 9 03:39:39 v2 kernel: [141737.154928] drbd s4: Terminating drbd_a_s4 > >Apr 9 03:39:39 v2 kernel: [141737.155289] drbd s4: Connection closed > >Apr 9 03:39:39 v2 kernel: [141737.155579] drbd s4: conn( NetworkFailure -> > >Unconnected ) > >Apr 9 03:39:39 v2 kernel: [141737.155583] drbd s4: receiver terminated > >Apr 9 03:39:39 v2 kernel: [141737.155585] drbd s4: Restarting receiver > >thread > >Apr 9 03:39:39 v2 kernel: [141737.155586] drbd s4: receiver (re)started > >Apr 9 03:39:39 v2 kernel: [141737.155601] drbd s4: conn( Unconnected -> > >WFConnection ) > >Apr 9 03:39:41 v2 kernel: [141738.458578] drbd n5: peer( Secondary -> > >Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > >Apr 9 03:39:41 v2 kernel: [141738.458671] block drbd8: new current UUID > >808265F24E5A3F21:B63FFF468380B383:240D9C7D536ACB97:240C9C7D536ACB97 > >Apr 9 03:39:41 v2 kernel: [141738.458885] drbd n5: asender terminated > >Apr 9 03:39:41 v2 kernel: [141738.458893] drbd n5: Terminating drbd_a_n5 > >Apr 9 03:39:41 v2 kernel: [141738.459160] drbd n5: Connection closed > >Apr 9 03:39:41 v2 kernel: [141738.459316] drbd n5: conn( NetworkFailure -> > >Unconnected ) > >Apr 9 03:39:41 v2 kernel: [141738.459319] drbd n5: receiver terminated > >Apr 9 03:39:41 v2 kernel: [141738.459321] drbd n5: Restarting receiver > >thread > >Apr 9 03:39:41 v2 kernel: [141738.459322] drbd n5: receiver (re)started > >Apr 9 03:39:41 v2 kernel: [141738.459336] drbd n5: conn( Unconnected -> > >WFConnection ) > >Apr 9 03:39:44 v2 kernel: [141742.202552] drbd r1: peer( Primary -> Unknown > >) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > >Apr 9 03:39:44 v2 kernel: [141742.202913] drbd r1: asender terminated > >Apr 9 03:39:44 v2 kernel: [141742.202920] drbd r1: Terminating drbd_a_r1 > >Apr 9 03:39:44 v2 kernel: [141742.203023] drbd r1: Connection closed > >Apr 9 03:39:44 v2 kernel: [141742.203039] drbd r1: conn( NetworkFailure -> > >Unconnected ) > >Apr 9 03:39:44 v2 kernel: [141742.203041] drbd r1: receiver terminated > >Apr 9 03:39:44 v2 kernel: [141742.203043] drbd r1: Restarting receiver > >thread > >Apr 9 03:39:44 v2 kernel: [141742.203044] drbd r1: receiver (re)started > >Apr 9 03:39:44 v2 kernel: [141742.203054] drbd r1: conn( Unconnected -> > >WFConnection ) > > > > > > > >Etc. > > > > > > > > > >_______________________________________________ > >drbd-user mailing list > >[email protected] > >http://lists.linbit.com/mailman/listinfo/drbd-user > > _______________________________________________ > drbd-user mailing list > [email protected] > http://lists.linbit.com/mailman/listinfo/drbd-user _______________________________________________ drbd-user mailing list [email protected] http://lists.linbit.com/mailman/listinfo/drbd-user
