Hi,

Allow me to add my grain of salt to this: sorry if it's too long. 
We have a system exibiting the same behaviour and it is certainly not
DRBD-related as it doesn't use it. 

It runs Xen 4.4.2 though with kernel 3.19.3.
And we have seen the same behaviour with 3.14.36 and 3.18.10.
Xen and kernel are in-house compiled.

Symptoms: before either simply crashing or starting being unresponsive
network-wize. A few instances:

Apr  7 15:33:10  kernel: [459694.670896] ------------[ cut here ]------------
Apr  7 15:33:10  kernel: [459694.670921] WARNING: CPU: 0 PID: 0 at 
net/sched/sch_generic.c:303 dev_watchdog+0x165/0x20d()
Apr  7 15:33:10  kernel: [459694.670924] NETDEV WATCHDOG: eth0 (igb): transmit 
queue 6 timed out
Apr  7 15:33:10 agrippa kernel: [459694.670927] Modules linked in: st nfsv3 
nfsv4 xt_physdev br_netfilter autofs4 parport_pc af_packet ppdev lp 
xen_acpi_processor parport xen_netback xen_blkback xen_gntalloc fuse 
rpcsec_gss_krb5 nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache 
sunrpc bridge stp llc ipv6 ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat 
nf_nat_ipv4 nf_nat ipt_REJECT nf_reject_ipv4 nf_conntr ack_ipv4 nf_defrag_ipv4 
xt_state nf_conntrack xt_tcpudp iptable_filter iptable_mangle ip_tables 
x_tables ipmi_devintf iTCO_wdt iTCO_vendor_support coretemp pcspkr i2c_i801 
evbug evdev joydev lpc_ich i7core_edac ioatdma edac_core i5500_temp rtc_cmos 
ipmi_si tpm_tis ipmi_msghandler tpm button processor usbkbd usbmouse usbhid sg 
igb i2c_algo_bit ehci_pci uhci_hcd i2c_core ehci_hcd usbcore ixgbe dca e1000e 
hwmon ptp pps_co re mdio megaraid_sas
Apr  7 15:33:10 agrippa kernel: [459694.671006] CPU: 0 PID: 0 Comm: swapper/0 
Not tainted 3.19.3-i686-64-smp #1
Apr  7 15:33:10 agrippa kernel: [459694.671009] Hardware name: Supermicro 
X8DTU-6+/X8DTU-6+, BIOS 2.1b       11/15/2011


For 3.14.33:

Mar 15 05:45:43 agrippa kernel: [1612129.231013] ------------[ cut here 
]------------
Mar 15 05:45:43 agrippa kernel: [1612129.231024] WARNING: CPU: 14 PID: 0 at 
net/sched/sch_generic.c:264 dev_watchdog+0x161/0x216()
Mar 15 05:45:43 agrippa kernel: [1612129.231027] NETDEV WATCHDOG: eth0 (igb): 
transmit queue 2 timed out
Mar 15 05:45:43 agrippa kernel: [1612129.231029] Modules linked in: x25 
appletalk ipx p8023 p8022 psnap rose netrom ax25 ipt_MASQUERADE xt_state 
iptable_mangle xt_physdev st ipt_REJECT bridge stp llc nfsv3 nfsv4 autofs4 
parport_pc ppdev lp parport af_packet xen_acpi_processor xen_netback 
xen_blkback xen_gntalloc fuse rpcsec_gss_krb5 nfsd auth_rpcgss oid_registry 
nfs_acl nfs lockd fscache sunrpc ipv6 iptable_filter xt_n at xt_tcpudp 
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack 
ip_tables x_tables ipmi_devintf iTCO_wdt iTCO_vendor_support coretemp microcode 
pcspkr evbug evdev joydev i2c_i801 lpc_ic h i7core_edac ioatdma edac_core 
rtc_cmos tpm_tis tpm ipmi_si ipmi_msghandler button processor usbkbd usbmouse 
usbhid sg ehci_pci igb uhci_hcd ehci_hcd i2c_algo_bit i2c_core usbcore ixgbe 
e1000e dca hwmon ptp pps _core megaraid_sas mdio
Mar 15 05:45:43 agrippa kernel: [1612129.231118] CPU: 14 PID: 0 Comm: 
swapper/14 Not tainted 3.14.33-i686-64-smp #1
Mar 15 05:45:43 agrippa kernel: [1612129.231120] Hardware name: Supermicro 
X8DTU-6+/X8DTU-6+, BIOS 2.1b       11/15/2011


This happens around one or twice a week. Absolutely very annoying. We
are still investigating weather it is related specifically to the
onboard Intel Gigabit adapter. Or not.

Board Manufacturer: Supermicro, Product Name: X8DTU-6+

~# modinfo igb
filename:
/lib/modules/3.19.3-i686-64-smp/kernel/drivers/net/ethernet/intel/igb/igb.ko
version:        5.2.15-k
license:        GPL
description:    Intel(R) Gigabit Ethernet Network Driver
author:         Intel Corporation, <[email protected]>
[..%<..snip..%<..]

HTH,
jf


* Igor Novgorodov <[email protected]> [20150410 15:32]:
> I'd second previous poster - try another kernel, long-term-supported
> 3.18 preferably.
> DRBD by itself is rock solid, your problems with timeouts lie elsewhere.
> I've been running DRBD server pairs on custom built 3.14 LTS kernel
> with a year+ uptimes, no problems at all.
> 
> And don't use Intel's out-of-kernel drivers if you don't clearly see
> you need it. In-kernel ones very good
> and will surely work more stable.
> 
> So, concluding:
> 1. LTS kernel
> 2. In-kernel DRBD
> 3. In-kernel drivers
> 4. Maybe try to stress-test the system without Xen, only DRBD stuff
> (run fio tool in random read-write for a couple of days on DRBD
> device).
> 5. Don't use unstable distributions, like Jessie, it's better to
> backport the needed stuff to Wheezy if it's not already in
> wheezy-backports repo.
> 
> On 09/04/15 23:07, Alan Evetts wrote:
> >Hi there,
> >
> >I am reaching out because we have been trying to find stability in our move 
> >to DRBD as it is amazing in concept, but have struggled for 6 months of 
> >time.  I am going to just lay out everything we are doing, as the problem 
> >starts and stops when we introduce/remove DRBD from the picture.  Obviously, 
> >these setups get complicated so hopefully this isn’t too much information 
> >here.
> >
> >What we are trying to do is have a pair of Dell R610 machines, each running 
> >DRBD and xen with about 8 DRBD partitions, each master running half of the 
> >Xen virtual machines.
> >
> >Seems, between 1 and 20 days we always receive a kernel panic on 1 machine, 
> >which will often drag down the second machine.  Details of the most recent 
> >panic are below.
> >
> >In order to rule out problems we have:
> >     - Replace both Dell R610 (have 4 now total, all the same problem)
> >     - Upgraded to Debian Jessie  from Debian Wheezy
> >     - Running  xen-hypervisor-4.4-amd64,  drbd debian version 8.9.2~rc1-2, 
> > kernel  3.16.0-4
> >     - Switched from the on-board broadcom NICs to Intel E1G44HTBLK  4 port 
> > PCI-e NIC
> >     - Upgraded to igb kernel module 5.2.17 and rebuilt it into the initrd 
> > as well
> >
> >
> >The 2 servers both have lots of resources (64 gigs of ram, quad xeon 2.4, 6 
> >* 1 TB drives in a raid 10).  There is a cross over cable on ETH3 for DRBD, 
> >each drbd instance runs on its own port on ETH3.  The Xen config runs on a 
> >bridge.
> >
> >The problem has more or less been the same as we’ve moved through all of the 
> >hardware and software versions over the past 6 months.  It rotates between 
> >the servers.
> >
> >I am hoping someone can spot a problem in our config, or guide us on what to 
> >try from here.  All 4 dell machines have been patched and had the 
> >diagnostics ran on them without issue.
> >
> >The problem.  One of the machines will have a transit queue time-out on an 
> >interface (oddly, not necessarily the drbd interface - but usually).   From 
> >there, a panic, and the NIC will start going up and down.  This then starts 
> >to drive the load up, the machines soon become unresponsive over shell.  
> >Connected over the dRAC remote access port, sooner or later we see errors 
> >about the drives not responding, I think this is from the load but I do not 
> >know for sure.  From this point the machine will sometimes drag down its 
> >paired DRBD machine, and sometimes not.  The one with the crash needs a hard 
> >reboot at this point.
> >
> >We love DRBD, its simplicity  and functionality but it introduces these 
> >often crashes which are not worth it.  Hoping someone can spot an error we 
> >are doing here, or have ideas on what to try.
> >
> >Thanks in advance for any help..  and FYI this crashed used to happen in the 
> >broadcom queue, now its the intel queue, and only when we have drbd enabled.
> >
> >
> >
> >Apr  9 03:39:17 v2 kernel: [141714.850432] ------------[ cut here 
> >]------------
> >Apr  9 03:39:17 v2 kernel: [141714.850521] WARNING: CPU: 0 PID: 0 at 
> >/build/linux-y7bjb0/linux-3.16.7-ckt4/net/sched/sch_generic.c:264 
> >dev_watchdog+0x236/0x240()
> >Apr  9 03:39:17 v2 kernel: [141714.850527] NETDEV WATCHDOG: eth1 (igb): 
> >transmit queue 0 timed out
> >Apr  9 03:39:17 v2 kernel: [141714.850531] Modules linked in: xt_tcpudp 
> >xt_physdev iptable_filter ip_tables x_tables xen_netback xen_blkback 
> >nfnetlink_queue nfnetlink_log nfnetlink bluetooth 6lowpan_iphc rfkill 
> >xen_gntdev xen_evt
> >chn xenfs xen_privcmd nfsd auth_rpcgss oid_registry nfs_acl nfs lockd 
> >fscache sunrpc bridge stp llc ttm drm_kms_helper joydev drm i2c_algo_bit 
> >i2c_core pcspkr wmi iTCO_wdt iTCO_vendor_support psmouse dcdbas serio_raw 
> >evdev tpm_ti
> >s tpm lpc_ich mfd_core acpi_power_meter button coretemp i7core_edac 
> >edac_core shpchp processor thermal_sys loop ipmi_watchdog ipmi_si 
> >ipmi_poweroff ipmi_devintf ipmi_msghandler drbd lru_cache libcrc32c autofs4 
> >ext4 crc16 mbcache
> >jbd2 dm_mod sg sd_mod crc_t10dif crct10dif_generic sr_mod cdrom ses 
> >crct10dif_common enclosure ata_generic hid_generic usbhid hid crc32c_intel 
> >ata_piix ehci_pci uhci_hcd libata igb(O) megaraid_sas ehci_hcd scsi_mod 
> >usbcore dca pt
> >p usb_common pps_core
> >Apr  9 03:39:17 v2 kernel: [141714.850609] CPU: 0 PID: 0 Comm: swapper/0 
> >Tainted: G           O  3.16.0-4-amd64 #1 Debian 3.16.7-ckt4-3
> >Apr  9 03:39:17 v2 kernel: [141714.850613] Hardware name: Dell Inc. 
> >PowerEdge R610/0XDN97, BIOS 6.4.0 07/23/2013
> >Apr  9 03:39:17 v2 kernel: [141714.850617]  0000000000000009 
> >ffffffff815096a7 ffff880079e03e28 ffffffff810676f7
> >Apr  9 03:39:17 v2 kernel: [141714.850622]  0000000000000000 
> >ffff880079e03e78 0000000000000010 0000000000000000
> >Apr  9 03:39:17 v2 kernel: [141714.850626]  ffff8800445c8000 
> >ffffffff8106775c ffffffff81777270 ffffffff00000030
> >Apr  9 03:39:17 v2 kernel: [141714.850631] Call Trace:
> >Apr  9 03:39:17 v2 kernel: [141714.850635]  <IRQ>  [<ffffffff815096a7>] ? 
> >dump_stack+0x41/0x51
> >Apr  9 03:39:17 v2 kernel: [141714.850652]  [<ffffffff810676f7>] ? 
> >warn_slowpath_common+0x77/0x90
> >Apr  9 03:39:17 v2 kernel: [141714.850660]  [<ffffffff8106775c>] ? 
> >warn_slowpath_fmt+0x4c/0x50
> >Apr  9 03:39:17 v2 kernel: [141714.850669]  [<ffffffff81074647>] ? 
> >mod_timer+0x127/0x1e0
> >Apr  9 03:39:17 v2 kernel: [141714.850676]  [<ffffffff8143ce76>] ? 
> >dev_watchdog+0x236/0x240
> >Apr  9 03:39:17 v2 kernel: [141714.850681]  [<ffffffff8143cc40>] ? 
> >dev_graft_qdisc+0x70/0x70
> >Apr  9 03:39:17 v2 kernel: [141714.850686]  [<ffffffff810729b1>] ? 
> >call_timer_fn+0x31/0x100
> >Apr  9 03:39:17 v2 kernel: [141714.850691]  [<ffffffff8143cc40>] ? 
> >dev_graft_qdisc+0x70/0x70
> >Apr  9 03:39:17 v2 kernel: [141714.850698]  [<ffffffff81073fe9>] ? 
> >run_timer_softirq+0x209/0x2f0
> >Apr  9 03:39:17 v2 kernel: [141714.850704]  [<ffffffff8106c591>] ? 
> >__do_softirq+0xf1/0x290
> >Apr  9 03:39:17 v2 kernel: [141714.850709]  [<ffffffff8106c965>] ? 
> >irq_exit+0x95/0xa0
> >Apr  9 03:39:17 v2 kernel: [141714.850718]  [<ffffffff813579c5>] ? 
> >xen_evtchn_do_upcall+0x35/0x50
> >Apr  9 03:39:17 v2 kernel: [141714.850725]  [<ffffffff8151141e>] ? 
> >xen_do_hypervisor_callback+0x1e/0x30
> >Apr  9 03:39:17 v2 kernel: [141714.850728]  <EOI>  [<ffffffff810013aa>] ? 
> >xen_hypercall_sched_op+0xa/0x20
> >Apr  9 03:39:17 v2 kernel: [141714.850737]  [<ffffffff810013aa>] ? 
> >xen_hypercall_sched_op+0xa/0x20
> >Apr  9 03:39:17 v2 kernel: [141714.850746]  [<ffffffff81009e0c>] ? 
> >xen_safe_halt+0xc/0x20
> >Apr  9 03:39:17 v2 kernel: [141714.850756]  [<ffffffff8101c959>] ? 
> >default_idle+0x19/0xb0
> >Apr  9 03:39:17 v2 kernel: [141714.850764]  [<ffffffff810a7dc0>] ? 
> >cpu_startup_entry+0x340/0x400
> >Apr  9 03:39:17 v2 kernel: [141714.850770]  [<ffffffff81902071>] ? 
> >start_kernel+0x492/0x49d
> >Apr  9 03:39:17 v2 kernel: [141714.850775]  [<ffffffff81901a04>] ? 
> >set_init_arg+0x4e/0x4e
> >Apr  9 03:39:17 v2 kernel: [141714.850781]  [<ffffffff81903f64>] ? 
> >xen_start_kernel+0x569/0x573
> >Apr  9 03:39:17 v2 kernel: [141714.850785] ---[ end trace ee11063cf033829a 
> >]---
> >Apr  9 03:39:17 v2 kernel: [141714.871945] br1: port 1(eth1) entered 
> >disabled state
> >Apr  9 03:39:20 v2 kernel: [141718.210743] igb 0000:05:00.1 eth1: igb: eth1 
> >NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
> >Apr  9 03:39:20 v2 kernel: [141718.210913] br1: port 1(eth1) entered 
> >forwarding state
> >Apr  9 03:39:20 v2 kernel: [141718.210923] br1: port 1(eth1) entered 
> >forwarding state
> >Apr  9 03:39:26 v2 kernel: [141723.863194] br1: port 1(eth1) entered 
> >disabled state
> >Apr  9 03:39:30 v2 kernel: [141727.650897] igb 0000:05:00.1 eth1: igb: eth1 
> >NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
> >Apr  9 03:39:30 v2 kernel: [141727.651040] br1: port 1(eth1) entered 
> >forwarding state
> >Apr  9 03:39:30 v2 kernel: [141727.651053] br1: port 1(eth1) entered 
> >forwarding state
> >Apr  9 03:39:31 v2 kernel: [141728.890509] ata1: lost interrupt (Status 0x50)
> >Apr  9 03:39:31 v2 kernel: [141728.890560] sr 1:0:0:0: CDB:
> >Apr  9 03:39:31 v2 kernel: [141728.890563] Get event status notification: 4a 
> >01 00 00 10 00 00 00 08 00
> >Apr  9 03:39:31 v2 kernel: [141728.890630] ata1: hard resetting link
> >Apr  9 03:39:31 v2 kernel: [141729.366592] ata1: SATA link up 1.5 Gbps 
> >(SStatus 113 SControl 300)
> >Apr  9 03:39:32 v2 kernel: [141729.406749] ata1.00: configured for UDMA/100
> >Apr  9 03:39:32 v2 kernel: [141729.408192] ata1: EH complete
> >Apr  9 03:39:35 v2 kernel: [141732.711653] br1: port 1(eth1) entered 
> >disabled state
> >Apr  9 03:39:37 v2 kernel: [141734.678485] drbd s3: peer( Primary -> Unknown 
> >) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> >Apr  9 03:39:37 v2 kernel: [141734.678846] drbd s3: asender terminated
> >Apr  9 03:39:37 v2 kernel: [141734.678852] drbd s3: Terminating drbd_a_s3
> >Apr  9 03:39:37 v2 kernel: [141734.678956] drbd s3: Connection closed
> >Apr  9 03:39:37 v2 kernel: [141734.678972] drbd s3: conn( NetworkFailure -> 
> >Unconnected )
> >Apr  9 03:39:37 v2 kernel: [141734.678974] drbd s3: receiver terminated
> >Apr  9 03:39:37 v2 kernel: [141734.678976] drbd s3: Restarting receiver 
> >thread
> >Apr  9 03:39:37 v2 kernel: [141734.678977] drbd s3: receiver (re)started
> >Apr  9 03:39:37 v2 kernel: [141734.678987] drbd s3: conn( Unconnected -> 
> >WFConnection )
> >Apr  9 03:39:38 v2 kernel: [141735.718898] igb 0000:05:00.1 eth1: igb: eth1 
> >NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
> >Apr  9 03:39:38 v2 kernel: [141735.719086] br1: port 1(eth1) entered 
> >forwarding state
> >Apr  9 03:39:38 v2 kernel: [141735.719095] br1: port 1(eth1) entered 
> >forwarding state
> >Apr  9 03:39:39 v2 kernel: [141737.154575] drbd s4: peer( Secondary -> 
> >Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> >Apr  9 03:39:39 v2 kernel: [141737.154671] block drbd1: new current UUID 
> >461FF401E0489AAB:9279A3BA4A3A710B:0E977CC4BB5727A9:0E967CC4BB5727A9
> >Apr  9 03:39:39 v2 kernel: [141737.154921] drbd s4: asender terminated
> >Apr  9 03:39:39 v2 kernel: [141737.154928] drbd s4: Terminating drbd_a_s4
> >Apr  9 03:39:39 v2 kernel: [141737.155289] drbd s4: Connection closed
> >Apr  9 03:39:39 v2 kernel: [141737.155579] drbd s4: conn( NetworkFailure -> 
> >Unconnected )
> >Apr  9 03:39:39 v2 kernel: [141737.155583] drbd s4: receiver terminated
> >Apr  9 03:39:39 v2 kernel: [141737.155585] drbd s4: Restarting receiver 
> >thread
> >Apr  9 03:39:39 v2 kernel: [141737.155586] drbd s4: receiver (re)started
> >Apr  9 03:39:39 v2 kernel: [141737.155601] drbd s4: conn( Unconnected -> 
> >WFConnection )
> >Apr  9 03:39:41 v2 kernel: [141738.458578] drbd n5: peer( Secondary -> 
> >Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> >Apr  9 03:39:41 v2 kernel: [141738.458671] block drbd8: new current UUID 
> >808265F24E5A3F21:B63FFF468380B383:240D9C7D536ACB97:240C9C7D536ACB97
> >Apr  9 03:39:41 v2 kernel: [141738.458885] drbd n5: asender terminated
> >Apr  9 03:39:41 v2 kernel: [141738.458893] drbd n5: Terminating drbd_a_n5
> >Apr  9 03:39:41 v2 kernel: [141738.459160] drbd n5: Connection closed
> >Apr  9 03:39:41 v2 kernel: [141738.459316] drbd n5: conn( NetworkFailure -> 
> >Unconnected )
> >Apr  9 03:39:41 v2 kernel: [141738.459319] drbd n5: receiver terminated
> >Apr  9 03:39:41 v2 kernel: [141738.459321] drbd n5: Restarting receiver 
> >thread
> >Apr  9 03:39:41 v2 kernel: [141738.459322] drbd n5: receiver (re)started
> >Apr  9 03:39:41 v2 kernel: [141738.459336] drbd n5: conn( Unconnected -> 
> >WFConnection )
> >Apr  9 03:39:44 v2 kernel: [141742.202552] drbd r1: peer( Primary -> Unknown 
> >) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> >Apr  9 03:39:44 v2 kernel: [141742.202913] drbd r1: asender terminated
> >Apr  9 03:39:44 v2 kernel: [141742.202920] drbd r1: Terminating drbd_a_r1
> >Apr  9 03:39:44 v2 kernel: [141742.203023] drbd r1: Connection closed
> >Apr  9 03:39:44 v2 kernel: [141742.203039] drbd r1: conn( NetworkFailure -> 
> >Unconnected )
> >Apr  9 03:39:44 v2 kernel: [141742.203041] drbd r1: receiver terminated
> >Apr  9 03:39:44 v2 kernel: [141742.203043] drbd r1: Restarting receiver 
> >thread
> >Apr  9 03:39:44 v2 kernel: [141742.203044] drbd r1: receiver (re)started
> >Apr  9 03:39:44 v2 kernel: [141742.203054] drbd r1: conn( Unconnected -> 
> >WFConnection )
> >
> >
> >
> >Etc.
> >
> >
> >
> >
> >_______________________________________________
> >drbd-user mailing list
> >[email protected]
> >http://lists.linbit.com/mailman/listinfo/drbd-user
> 
> _______________________________________________
> drbd-user mailing list
> [email protected]
> http://lists.linbit.com/mailman/listinfo/drbd-user
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to