Hello,

here we have a two nodes setup that are running CentOS 5.4, Xen 3.0 (CentOS RPMs) and DRBD 8.3.2 (again CentOS RPM). Both servers are Dell PowerEdge 1950 servers with two Quad-Core Xeon processors and 32GB of memory. The network card used by DRBD is an Intel 82571EB Gigabit Ethernet card (e1000 driver). Both are connected directly with a crossover cable.

DRBD is configured so that I have one resource (drbd0) on which I have configured a LVM VolumeGroup which is then sliced in two LVs. Both LVs are mapped to my Xen VM (PV) as sda and sdb disks.

Recently, we've had issues where the node that is in Primary state and hence running the VM locks up and throws a kernel panic. The situation seems to indicate that this might be a problem related to DRBD and/or the network stack because if we disconnect the DRBD resource, this problem will not occur.

Even worse, the problem occur very quickly after we connect the DRBD resource, either during resynchronization after being out-of-sync for a while or during normal syncing operations. No errors show up on the network interface (ifconfig, ethtool)

One thing to note is that the kernel panic seems to complain about checksum functions so that might be related (see below)

Here are the relevant informations

# rpm -qa | grep -e xen -e drbd
drbd83-8.3.2-6.el5_3
kmod-drbd83-xen-8.3.2-6.el5_3
xen-3.0.3-94.el5
kernel-xen-2.6.18-164.el5
xen-libs-3.0.3-94.el5

# cat /etc/drbd.conf
global {
  usage-count no;
}

common {
  protocol C;

  syncer {
    rate 33M;
    verify-alg crc32c;
    al-extents 1801;
  }
  net {
    cram-hmac-alg sha1;
    max-epoch-size 8192;
    max-buffers 8192;
  }

  disk {
    on-io-error detach;
    no-disk-flushes;
    no-disk-barrier;
    no-md-flushes;
  }
}

resource drbd0 {
  device /dev/drbd0;
  disk /dev/sda6;
  flexible-meta-disk internal;
  on node1 {
    address 10.11.1.1:7788;
  }
  on node2 {
    address 10.11.1.2:7788;
  }
}

### Kernel Panic ###
Unable to handle kernel paging request
 at ffff880011e3cc64 RIP:
 [<ffffffff80212bad>] csum_partial+0x56/0x4bc
PGD ed8067
PUD ed9067
PMD f69067
PTE 0

Oops: 0000 [1]
SMP

last sysfs file: /class/scsi_host/host0/proc_name
CPU 0

Modules linked in:
 xt_physdev
 netconsole
 drbd(U)
 netloop
 netbk
 blktap
 blkbk
 ipt_MASQUERADE
 iptable_nat
 ip_nat
 bridge
 ipv6
 xfrm_nalgo
 crypto_api
 xt_tcpudp
 xt_state
 ip_conntrack_irc
 xt_conntrack
 ip_conntrack_ftp
 xt_mac
 xt_length
 xt_limit
 xt_multiport
 ipt_ULOG
 ipt_TCPMSS
 ipt_TOS
 ipt_ttl
 ipt_owner
 ipt_REJECT
 ipt_ecn
 ipt_LOG
 ipt_recent
 ip_conntrack
 iptable_mangle
 iptable_filter
 ip_tables
 nfnetlink
 x_tables
 autofs4
 dm_mirror
 dm_multipath
 scsi_dh
 video
 hwmon
 backlight
 sbs
 i2c_ec
 i2c_core
 button
 battery
 asus_acpi
 ac
 parport_pc
 lp
 parport
 joydev
 ide_cd
 e1000e
 cdrom
 serial_core
 i5000_edac
 edac_mc
 bnx2
 serio_raw
 pcspkr
 sg
 dm_raid45
 dm_message
 dm_region_hash
 dm_log
 dm_mod
 dm_mem_cache
 ata_piix
 libata
 shpchp
 megaraid_sas
 sd_mod
 scsi_mod
 ext3
 jbd
 uhci_hcd
 ohci_hcd
 ehci_hcd

Pid: 12887, comm: drbd0_receiver Tainted: G      2.6.18-128.1.16.el5xen #1
RIP: e030:[<ffffffff80212bad>]
 [<ffffffff80212bad>] csum_partial+0x56/0x4bc
RSP: e02b:ffff88000c347718  EFLAGS: 00010202
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff880010ced500
RDX: 00000000000000e7 RSI: 000000000000039c RDI: ffff880011e3cc64
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000025b85e7c R11: 0000000000000002 R12: 0000000000000028
R13: 0000000000000028 R14: ffff88001c56f7b0 R15: 0000000025b85e7c
FS:  00002b391e123f60(0000) GS:ffffffff805ba000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process drbd0_receiver (pid: 12887, threadinfo ffff88000c346000, task ffff88001c207820)
Stack:
 000000000000039c
 00000000000005b4
 ffffffff8023d496
 ffff88001e7e48d8

 0000001400000000
 ffff8800000003c4
 ffff88001c56f7b0
 ffff88001e7e48d8

 ffff88001e7e48ec
 ffff88000c3478e8

Call Trace:
 [<ffffffff8023d496>] skb_checksum+0x11b/0x260
 [<ffffffff80411472>] skb_checksum_help+0x71/0xd0
 [<ffffffff8853f33e>] :iptable_nat:ip_nat_fn+0x56/0x1c3
 [<ffffffff8853f6cf>] :iptable_nat:ip_nat_local_fn+0x32/0xb7
 [<ffffffff8023550c>] nf_iterate+0x41/0x7d
 [<ffffffff8042f004>] dst_output+0x0/0xe
 [<ffffffff80258b28>] nf_hook_slow+0x58/0xbc
 [<ffffffff8042f004>] dst_output+0x0/0xe
 [<ffffffff802359ab>] ip_queue_xmit+0x41c/0x48c
 [<ffffffff8022c1cb>] local_bh_enable+0x9/0xa5
 [<ffffffff8020b6b7>] kmem_cache_alloc+0x62/0x6d
 [<ffffffff8023668d>] alloc_skb_from_cache+0x74/0x13c
 [<ffffffff80222a0b>] tcp_transmit_skb+0x62f/0x667
 [<ffffffff8043903a>] tcp_retransmit_skb+0x53d/0x638
 [<ffffffff80439353>] tcp_xmit_retransmit_queue+0x21e/0x2bb
 [<ffffffff80225cff>] tcp_ack+0x1705/0x1879
 [<ffffffff8021c6b1>] tcp_rcv_established+0x804/0x925
 [<ffffffff80263710>] schedule_timeout+0x1e/0xad
 [<ffffffff8023cef3>] tcp_v4_do_rcv+0x2a/0x2fa
 [<ffffffff8040bbfe>] sk_wait_data+0xac/0xbf
 [<ffffffff8029b018>] autoremove_wake_function+0x0/0x2e
 [<ffffffff80434f71>] tcp_prequeue_process+0x65/0x78
 [<ffffffff8021dd39>] tcp_recvmsg+0x492/0xb1f
 [<ffffffff80233102>] sock_common_recvmsg+0x2d/0x43
 [<ffffffff80233102>] sock_common_recvmsg+0x2d/0x43
 [<ffffffff80231c18>] sock_recvmsg+0x101/0x120
 [<ffffffff80231c18>] sock_recvmsg+0x101/0x120
 [<ffffffff8029b018>] autoremove_wake_function+0x0/0x2e
 [<ffffffff80343366>] swiotlb_map_sg+0xf7/0x205
 [<ffffffff880b563c>] :megaraid_sas:megasas_make_sgl64+0x78/0xa9
 [<ffffffff880b61bc>] :megaraid_sas:megasas_queue_command+0x343/0x3ed
 [<ffffffff884e119f>] :drbd:drbd_recv+0x7b/0x109
 [<ffffffff884e53b2>] :drbd:receive_DataRequest+0x3b/0x655
 [<ffffffff884e1c4b>] :drbd:drbdd+0x77/0x152
 [<ffffffff884e4870>] :drbd:drbdd_init+0xea/0x1dc
 [<ffffffff884f432a>] :drbd:drbd_thread_setup+0xa2/0x18b
 [<ffffffff80260b2c>] child_rip+0xa/0x12
 [<ffffffff884f4288>] :drbd:drbd_thread_setup+0x0/0x18b
 [<ffffffff80260b22>] child_rip+0x0/0x12


Code:
44
8b
0f
ff
ca
83
ee
04
48
83
c7
04
4d
01
c8
41
89
d2
41
89

RIP
 [<ffffffff80212bad>] csum_partial+0x56/0x4bc
 RSP <ffff88000c347718>
CR2: ffff880011e3cc64

Kernel panic - not syncing: Fatal exception
#######


Any ideas on how to diagnose this properly and eventually find the culprit?


Regards,
--
Jean-François Chevrette [iWeb]

_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Reply via email to