Re: [DRBD-user] Many Kernel Crashes since switch to DRBD

Joel N. Burleson-Davis Fri, 10 Apr 2015 14:02:35 -0700

We've had some issues with NIC offloads that affected the stability of alot of different TCP based applications, so I tend to unconditionallydisable any NIC offloads on any machine that is transmitting data I careabout. I've never seen this negatively impact performance, and almostalways improve things for TCP based applications like DRBD. It couldalso be why the newer kernels seem affected, as they've likelyimplemented or enabled more of the native offloading in different NICs.

Every one of our DRBD clusters have all their NIC offloads disabled andwe've had no issues with them at all.

You could try the same and see how it works for you since it seems to beNIC related for you.

I know there have been lots of issues recently, not just DRBD, with NICoffloading and applications breaking.


Cheers,

Joel


On 04/10/2015 03:27 PM, Jean-Francois Malouin wrote:

Hi,

Allow me to add my grain of salt to this: sorry if it's too long.
We have a system exibiting the same behaviour and it is certainly not
DRBD-related as it doesn't use it.

It runs Xen 4.4.2 though with kernel 3.19.3.
And we have seen the same behaviour with 3.14.36 and 3.18.10.
Xen and kernel are in-house compiled.

Symptoms: before either simply crashing or starting being unresponsive
network-wize. A few instances:

Apr  7 15:33:10  kernel: [459694.670896] ------------[ cut here ]------------
Apr  7 15:33:10  kernel: [459694.670921] WARNING: CPU: 0 PID: 0 at 
net/sched/sch_generic.c:303 dev_watchdog+0x165/0x20d()
Apr  7 15:33:10  kernel: [459694.670924] NETDEV WATCHDOG: eth0 (igb): transmit 
queue 6 timed out
Apr  7 15:33:10 agrippa kernel: [459694.670927] Modules linked in: st nfsv3 
nfsv4 xt_physdev br_netfilter autofs4 parport_pc af_packet ppdev lp 
xen_acpi_processor parport xen_netback xen_blkback xen_gntalloc fuse 
rpcsec_gss_krb5 nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache 
sunrpc bridge stp llc ipv6 ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat 
nf_nat_ipv4 nf_nat ipt_REJECT nf_reject_ipv4 nf_conntr ack_ipv4 nf_defrag_ipv4 
xt_state nf_conntrack xt_tcpudp iptable_filter iptable_mangle ip_tables 
x_tables ipmi_devintf iTCO_wdt iTCO_vendor_support coretemp pcspkr i2c_i801 
evbug evdev joydev lpc_ich i7core_edac ioatdma edac_core i5500_temp rtc_cmos 
ipmi_si tpm_tis ipmi_msghandler tpm button processor usbkbd usbmouse usbhid sg 
igb i2c_algo_bit ehci_pci uhci_hcd i2c_core ehci_hcd usbcore ixgbe dca e1000e 
hwmon ptp pps_co re mdio megaraid_sas
Apr  7 15:33:10 agrippa kernel: [459694.671006] CPU: 0 PID: 0 Comm: swapper/0 
Not tainted 3.19.3-i686-64-smp #1
Apr  7 15:33:10 agrippa kernel: [459694.671009] Hardware name: Supermicro 
X8DTU-6+/X8DTU-6+, BIOS 2.1b       11/15/2011


For 3.14.33:

Mar 15 05:45:43 agrippa kernel: [1612129.231013] ------------[ cut here 
]------------
Mar 15 05:45:43 agrippa kernel: [1612129.231024] WARNING: CPU: 14 PID: 0 at 
net/sched/sch_generic.c:264 dev_watchdog+0x161/0x216()
Mar 15 05:45:43 agrippa kernel: [1612129.231027] NETDEV WATCHDOG: eth0 (igb): 
transmit queue 2 timed out
Mar 15 05:45:43 agrippa kernel: [1612129.231029] Modules linked in: x25 
appletalk ipx p8023 p8022 psnap rose netrom ax25 ipt_MASQUERADE xt_state 
iptable_mangle xt_physdev st ipt_REJECT bridge stp llc nfsv3 nfsv4 autofs4 
parport_pc ppdev lp parport af_packet xen_acpi_processor xen_netback 
xen_blkback xen_gntalloc fuse rpcsec_gss_krb5 nfsd auth_rpcgss oid_registry 
nfs_acl nfs lockd fscache sunrpc ipv6 iptable_filter xt_n at xt_tcpudp 
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack 
ip_tables x_tables ipmi_devintf iTCO_wdt iTCO_vendor_support coretemp microcode 
pcspkr evbug evdev joydev i2c_i801 lpc_ic h i7core_edac ioatdma edac_core 
rtc_cmos tpm_tis tpm ipmi_si ipmi_msghandler button processor usbkbd usbmouse 
usbhid sg ehci_pci igb uhci_hcd ehci_hcd i2c_algo_bit i2c_core usbcore ixgbe 
e1000e dca hwmon ptp pps _core megaraid_sas mdio
Mar 15 05:45:43 agrippa kernel: [1612129.231118] CPU: 14 PID: 0 Comm: 
swapper/14 Not tainted 3.14.33-i686-64-smp #1
Mar 15 05:45:43 agrippa kernel: [1612129.231120] Hardware name: Supermicro 
X8DTU-6+/X8DTU-6+, BIOS 2.1b       11/15/2011


This happens around one or twice a week. Absolutely very annoying. We
are still investigating weather it is related specifically to the
onboard Intel Gigabit adapter. Or not.

Board Manufacturer: Supermicro, Product Name: X8DTU-6+

~# modinfo igb
filename:
/lib/modules/3.19.3-i686-64-smp/kernel/drivers/net/ethernet/intel/igb/igb.ko
version:        5.2.15-k
license:        GPL
description:    Intel(R) Gigabit Ethernet Network Driver
author:         Intel Corporation, <[email protected]>
[..%<..snip..%<..]

HTH,
jf


* Igor Novgorodov <[email protected]> [20150410 15:32]:

I'd second previous poster - try another kernel, long-term-supported
3.18 preferably.
DRBD by itself is rock solid, your problems with timeouts lie elsewhere.
I've been running DRBD server pairs on custom built 3.14 LTS kernel
with a year+ uptimes, no problems at all.

And don't use Intel's out-of-kernel drivers if you don't clearly see
you need it. In-kernel ones very good
and will surely work more stable.

So, concluding:
1. LTS kernel
2. In-kernel DRBD
3. In-kernel drivers
4. Maybe try to stress-test the system without Xen, only DRBD stuff
(run fio tool in random read-write for a couple of days on DRBD
device).
5. Don't use unstable distributions, like Jessie, it's better to
backport the needed stuff to Wheezy if it's not already in
wheezy-backports repo.

On 09/04/15 23:07, Alan Evetts wrote:

Hi there,

I am reaching out because we have been trying to find stability in our move to 
DRBD as it is amazing in concept, but have struggled for 6 months of time.  I 
am going to just lay out everything we are doing, as the problem starts and 
stops when we introduce/remove DRBD from the picture.  Obviously, these setups 
get complicated so hopefully this isn’t too much information here.

What we are trying to do is have a pair of Dell R610 machines, each running 
DRBD and xen with about 8 DRBD partitions, each master running half of the Xen 
virtual machines.

Seems, between 1 and 20 days we always receive a kernel panic on 1 machine, 
which will often drag down the second machine.  Details of the most recent 
panic are below.

In order to rule out problems we have:
        - Replace both Dell R610 (have 4 now total, all the same problem)
        - Upgraded to Debian Jessie  from Debian Wheezy
        - Running  xen-hypervisor-4.4-amd64,  drbd debian version 8.9.2~rc1-2, 
kernel  3.16.0-4
        - Switched from the on-board broadcom NICs to Intel E1G44HTBLK  4 port 
PCI-e NIC
        - Upgraded to igb kernel module 5.2.17 and rebuilt it into the initrd 
as well


The 2 servers both have lots of resources (64 gigs of ram, quad xeon 2.4, 6 * 1 
TB drives in a raid 10).  There is a cross over cable on ETH3 for DRBD, each 
drbd instance runs on its own port on ETH3.  The Xen config runs on a bridge.

The problem has more or less been the same as we’ve moved through all of the 
hardware and software versions over the past 6 months.  It rotates between the 
servers.

I am hoping someone can spot a problem in our config, or guide us on what to 
try from here.  All 4 dell machines have been patched and had the diagnostics 
ran on them without issue.

The problem.  One of the machines will have a transit queue time-out on an 
interface (oddly, not necessarily the drbd interface - but usually).   From 
there, a panic, and the NIC will start going up and down.  This then starts to 
drive the load up, the machines soon become unresponsive over shell.  Connected 
over the dRAC remote access port, sooner or later we see errors about the 
drives not responding, I think this is from the load but I do not know for 
sure.  From this point the machine will sometimes drag down its paired DRBD 
machine, and sometimes not.  The one with the crash needs a hard reboot at this 
point.

We love DRBD, its simplicity  and functionality but it introduces these often 
crashes which are not worth it.  Hoping someone can spot an error we are doing 
here, or have ideas on what to try.

Thanks in advance for any help..  and FYI this crashed used to happen in the 
broadcom queue, now its the intel queue, and only when we have drbd enabled.



Apr  9 03:39:17 v2 kernel: [141714.850432] ------------[ cut here ]------------
Apr  9 03:39:17 v2 kernel: [141714.850521] WARNING: CPU: 0 PID: 0 at 
/build/linux-y7bjb0/linux-3.16.7-ckt4/net/sched/sch_generic.c:264 
dev_watchdog+0x236/0x240()
Apr  9 03:39:17 v2 kernel: [141714.850527] NETDEV WATCHDOG: eth1 (igb): 
transmit queue 0 timed out
Apr  9 03:39:17 v2 kernel: [141714.850531] Modules linked in: xt_tcpudp 
xt_physdev iptable_filter ip_tables x_tables xen_netback xen_blkback 
nfnetlink_queue nfnetlink_log nfnetlink bluetooth 6lowpan_iphc rfkill 
xen_gntdev xen_evt
chn xenfs xen_privcmd nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache 
sunrpc bridge stp llc ttm drm_kms_helper joydev drm i2c_algo_bit i2c_core 
pcspkr wmi iTCO_wdt iTCO_vendor_support psmouse dcdbas serio_raw evdev tpm_ti
s tpm lpc_ich mfd_core acpi_power_meter button coretemp i7core_edac edac_core 
shpchp processor thermal_sys loop ipmi_watchdog ipmi_si ipmi_poweroff 
ipmi_devintf ipmi_msghandler drbd lru_cache libcrc32c autofs4 ext4 crc16 mbcache
jbd2 dm_mod sg sd_mod crc_t10dif crct10dif_generic sr_mod cdrom ses 
crct10dif_common enclosure ata_generic hid_generic usbhid hid crc32c_intel 
ata_piix ehci_pci uhci_hcd libata igb(O) megaraid_sas ehci_hcd scsi_mod usbcore 
dca pt
p usb_common pps_core
Apr  9 03:39:17 v2 kernel: [141714.850609] CPU: 0 PID: 0 Comm: swapper/0 
Tainted: G           O  3.16.0-4-amd64 #1 Debian 3.16.7-ckt4-3
Apr  9 03:39:17 v2 kernel: [141714.850613] Hardware name: Dell Inc. PowerEdge 
R610/0XDN97, BIOS 6.4.0 07/23/2013
Apr  9 03:39:17 v2 kernel: [141714.850617]  0000000000000009 ffffffff815096a7 
ffff880079e03e28 ffffffff810676f7
Apr  9 03:39:17 v2 kernel: [141714.850622]  0000000000000000 ffff880079e03e78 
0000000000000010 0000000000000000
Apr  9 03:39:17 v2 kernel: [141714.850626]  ffff8800445c8000 ffffffff8106775c 
ffffffff81777270 ffffffff00000030
Apr  9 03:39:17 v2 kernel: [141714.850631] Call Trace:
Apr  9 03:39:17 v2 kernel: [141714.850635]  <IRQ>  [<ffffffff815096a7>] ? 
dump_stack+0x41/0x51
Apr  9 03:39:17 v2 kernel: [141714.850652]  [<ffffffff810676f7>] ? 
warn_slowpath_common+0x77/0x90
Apr  9 03:39:17 v2 kernel: [141714.850660]  [<ffffffff8106775c>] ? 
warn_slowpath_fmt+0x4c/0x50
Apr  9 03:39:17 v2 kernel: [141714.850669]  [<ffffffff81074647>] ? 
mod_timer+0x127/0x1e0
Apr  9 03:39:17 v2 kernel: [141714.850676]  [<ffffffff8143ce76>] ? 
dev_watchdog+0x236/0x240
Apr  9 03:39:17 v2 kernel: [141714.850681]  [<ffffffff8143cc40>] ? 
dev_graft_qdisc+0x70/0x70
Apr  9 03:39:17 v2 kernel: [141714.850686]  [<ffffffff810729b1>] ? 
call_timer_fn+0x31/0x100
Apr  9 03:39:17 v2 kernel: [141714.850691]  [<ffffffff8143cc40>] ? 
dev_graft_qdisc+0x70/0x70
Apr  9 03:39:17 v2 kernel: [141714.850698]  [<ffffffff81073fe9>] ? 
run_timer_softirq+0x209/0x2f0
Apr  9 03:39:17 v2 kernel: [141714.850704]  [<ffffffff8106c591>] ? 
__do_softirq+0xf1/0x290
Apr  9 03:39:17 v2 kernel: [141714.850709]  [<ffffffff8106c965>] ? 
irq_exit+0x95/0xa0
Apr  9 03:39:17 v2 kernel: [141714.850718]  [<ffffffff813579c5>] ? 
xen_evtchn_do_upcall+0x35/0x50
Apr  9 03:39:17 v2 kernel: [141714.850725]  [<ffffffff8151141e>] ? 
xen_do_hypervisor_callback+0x1e/0x30
Apr  9 03:39:17 v2 kernel: [141714.850728]  <EOI>  [<ffffffff810013aa>] ? 
xen_hypercall_sched_op+0xa/0x20
Apr  9 03:39:17 v2 kernel: [141714.850737]  [<ffffffff810013aa>] ? 
xen_hypercall_sched_op+0xa/0x20
Apr  9 03:39:17 v2 kernel: [141714.850746]  [<ffffffff81009e0c>] ? 
xen_safe_halt+0xc/0x20
Apr  9 03:39:17 v2 kernel: [141714.850756]  [<ffffffff8101c959>] ? 
default_idle+0x19/0xb0
Apr  9 03:39:17 v2 kernel: [141714.850764]  [<ffffffff810a7dc0>] ? 
cpu_startup_entry+0x340/0x400
Apr  9 03:39:17 v2 kernel: [141714.850770]  [<ffffffff81902071>] ? 
start_kernel+0x492/0x49d
Apr  9 03:39:17 v2 kernel: [141714.850775]  [<ffffffff81901a04>] ? 
set_init_arg+0x4e/0x4e
Apr  9 03:39:17 v2 kernel: [141714.850781]  [<ffffffff81903f64>] ? 
xen_start_kernel+0x569/0x573
Apr  9 03:39:17 v2 kernel: [141714.850785] ---[ end trace ee11063cf033829a ]---
Apr  9 03:39:17 v2 kernel: [141714.871945] br1: port 1(eth1) entered disabled 
state
Apr  9 03:39:20 v2 kernel: [141718.210743] igb 0000:05:00.1 eth1: igb: eth1 NIC 
Link is Up 1000 Mbps Full Duplex, Flow Control: None
Apr  9 03:39:20 v2 kernel: [141718.210913] br1: port 1(eth1) entered forwarding 
state
Apr  9 03:39:20 v2 kernel: [141718.210923] br1: port 1(eth1) entered forwarding 
state
Apr  9 03:39:26 v2 kernel: [141723.863194] br1: port 1(eth1) entered disabled 
state
Apr  9 03:39:30 v2 kernel: [141727.650897] igb 0000:05:00.1 eth1: igb: eth1 NIC 
Link is Up 1000 Mbps Full Duplex, Flow Control: None
Apr  9 03:39:30 v2 kernel: [141727.651040] br1: port 1(eth1) entered forwarding 
state
Apr  9 03:39:30 v2 kernel: [141727.651053] br1: port 1(eth1) entered forwarding 
state
Apr  9 03:39:31 v2 kernel: [141728.890509] ata1: lost interrupt (Status 0x50)
Apr  9 03:39:31 v2 kernel: [141728.890560] sr 1:0:0:0: CDB:
Apr  9 03:39:31 v2 kernel: [141728.890563] Get event status notification: 4a 01 
00 00 10 00 00 00 08 00
Apr  9 03:39:31 v2 kernel: [141728.890630] ata1: hard resetting link
Apr  9 03:39:31 v2 kernel: [141729.366592] ata1: SATA link up 1.5 Gbps (SStatus 
113 SControl 300)
Apr  9 03:39:32 v2 kernel: [141729.406749] ata1.00: configured for UDMA/100
Apr  9 03:39:32 v2 kernel: [141729.408192] ata1: EH complete
Apr  9 03:39:35 v2 kernel: [141732.711653] br1: port 1(eth1) entered disabled 
state
Apr  9 03:39:37 v2 kernel: [141734.678485] drbd s3: peer( Primary -> Unknown ) conn( 
Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Apr  9 03:39:37 v2 kernel: [141734.678846] drbd s3: asender terminated
Apr  9 03:39:37 v2 kernel: [141734.678852] drbd s3: Terminating drbd_a_s3
Apr  9 03:39:37 v2 kernel: [141734.678956] drbd s3: Connection closed
Apr  9 03:39:37 v2 kernel: [141734.678972] drbd s3: conn( NetworkFailure -> 
Unconnected )
Apr  9 03:39:37 v2 kernel: [141734.678974] drbd s3: receiver terminated
Apr  9 03:39:37 v2 kernel: [141734.678976] drbd s3: Restarting receiver thread
Apr  9 03:39:37 v2 kernel: [141734.678977] drbd s3: receiver (re)started
Apr  9 03:39:37 v2 kernel: [141734.678987] drbd s3: conn( Unconnected -> 
WFConnection )
Apr  9 03:39:38 v2 kernel: [141735.718898] igb 0000:05:00.1 eth1: igb: eth1 NIC 
Link is Up 1000 Mbps Full Duplex, Flow Control: None
Apr  9 03:39:38 v2 kernel: [141735.719086] br1: port 1(eth1) entered forwarding 
state
Apr  9 03:39:38 v2 kernel: [141735.719095] br1: port 1(eth1) entered forwarding 
state
Apr  9 03:39:39 v2 kernel: [141737.154575] drbd s4: peer( Secondary -> Unknown ) 
conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Apr  9 03:39:39 v2 kernel: [141737.154671] block drbd1: new current UUID 
461FF401E0489AAB:9279A3BA4A3A710B:0E977CC4BB5727A9:0E967CC4BB5727A9
Apr  9 03:39:39 v2 kernel: [141737.154921] drbd s4: asender terminated
Apr  9 03:39:39 v2 kernel: [141737.154928] drbd s4: Terminating drbd_a_s4
Apr  9 03:39:39 v2 kernel: [141737.155289] drbd s4: Connection closed
Apr  9 03:39:39 v2 kernel: [141737.155579] drbd s4: conn( NetworkFailure -> 
Unconnected )
Apr  9 03:39:39 v2 kernel: [141737.155583] drbd s4: receiver terminated
Apr  9 03:39:39 v2 kernel: [141737.155585] drbd s4: Restarting receiver thread
Apr  9 03:39:39 v2 kernel: [141737.155586] drbd s4: receiver (re)started
Apr  9 03:39:39 v2 kernel: [141737.155601] drbd s4: conn( Unconnected -> 
WFConnection )
Apr  9 03:39:41 v2 kernel: [141738.458578] drbd n5: peer( Secondary -> Unknown ) 
conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Apr  9 03:39:41 v2 kernel: [141738.458671] block drbd8: new current UUID 
808265F24E5A3F21:B63FFF468380B383:240D9C7D536ACB97:240C9C7D536ACB97
Apr  9 03:39:41 v2 kernel: [141738.458885] drbd n5: asender terminated
Apr  9 03:39:41 v2 kernel: [141738.458893] drbd n5: Terminating drbd_a_n5
Apr  9 03:39:41 v2 kernel: [141738.459160] drbd n5: Connection closed
Apr  9 03:39:41 v2 kernel: [141738.459316] drbd n5: conn( NetworkFailure -> 
Unconnected )
Apr  9 03:39:41 v2 kernel: [141738.459319] drbd n5: receiver terminated
Apr  9 03:39:41 v2 kernel: [141738.459321] drbd n5: Restarting receiver thread
Apr  9 03:39:41 v2 kernel: [141738.459322] drbd n5: receiver (re)started
Apr  9 03:39:41 v2 kernel: [141738.459336] drbd n5: conn( Unconnected -> 
WFConnection )
Apr  9 03:39:44 v2 kernel: [141742.202552] drbd r1: peer( Primary -> Unknown ) conn( 
Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Apr  9 03:39:44 v2 kernel: [141742.202913] drbd r1: asender terminated
Apr  9 03:39:44 v2 kernel: [141742.202920] drbd r1: Terminating drbd_a_r1
Apr  9 03:39:44 v2 kernel: [141742.203023] drbd r1: Connection closed
Apr  9 03:39:44 v2 kernel: [141742.203039] drbd r1: conn( NetworkFailure -> 
Unconnected )
Apr  9 03:39:44 v2 kernel: [141742.203041] drbd r1: receiver terminated
Apr  9 03:39:44 v2 kernel: [141742.203043] drbd r1: Restarting receiver thread
Apr  9 03:39:44 v2 kernel: [141742.203044] drbd r1: receiver (re)started
Apr  9 03:39:44 v2 kernel: [141742.203054] drbd r1: conn( Unconnected -> 
WFConnection )



Etc.




_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user


_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Re: [DRBD-user] Many Kernel Crashes since switch to DRBD

Reply via email to