Your message dated Tue, 21 Feb 2012 04:24:43 -0600
with message-id <20120221102443.GA28089@burratino>
and subject line Re: [squeeze] Kernel bug seems to occur on ocfs2+drbd in
pri-pri
has caused the Debian Bug report #616726,
regarding DRBD+OCFS2: reproducible BUGs and GPFs on heavy load
to be marked as done.
This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.
(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact [email protected]
immediately.)
--
616726: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=616726
Debian Bug Tracking System
Contact [email protected] with problems
--- Begin Message ---
Package: linux-image-2.6.35.6
Version: 2.6.35.6-10.00.Custom
Severity: important
Hello.
First of all - this it my first bugreport to debian and I sorry if I do
something wrong - just tell me what need to fix in it.
I have 2 servers Dell 2950 and try to use it as a email cluster.
I use DRBD with OCFS2 over it. Both nodes is reboot on heavy load every time.
I report bug for a package linux-image-2.6.35.6 but it is not true - I have
this problem on 2.6.26(stable) and 2.6.32(testing). I just try latest kernel to
be sure.
I try ocfs2-tools from stable and from testing - nodes reboot. I try DRBD8 from
backports and then on 2.6.32 native and compile DRBD-8.3.8 from sourse with
2.6.35-6 - nodes reboot.
So I think it is a kernel relaited but I can be really wrong. Im not sure what
couse this reboots.
What I do:
1) Create a DRBD md on both nodes
drbdadm create-md drbd0
2) Sync it
drbdadm -- --overwrite-data-of-peer primary drbd0
drbdsetup /dev/drbd0 syncer -r 110M
3) Make both primary
drbdadm primary drbd0
4) Make FS
mkfs.ocfs2 -L ocfs2_drbd -N 2 -T mail --fs-feature-level=max-features /dev/drbd0
5) Mount it on both nodes
mount /var/spool/dovecot
(fstab options - nodev,noauto,noatime,data=writeback)
6) Make folders for test
mkdir /var/spool/dovecot/iozone1
mkdir /var/spool/dovecot/iozone2
7) Start IO test on both nodes in different folders
iozone -RK -t 4 -s 10g -i 0 -i 1 -i 2 -b /tmp/`hostname`.xls
8) Allways got reboot after 30-180 min. Sometimes with stack trace and halt but
not everytime.
OCFS2 partition seems to work ok at normal work.
P.S. If i was wrong to write this in sid like system - just tell me. This bug
easly repeatable on stable or testing.
-- System Information:
Debian Release: squeeze/sid
APT prefers testing
APT policy: (500, 'testing')
Architecture: amd64 (x86_64)
Kernel: Linux 2.6.35.6 (SMP w/4 CPU cores)
Locale: LANG=ru_RU.UTF-8, LC_CTYPE=ru_RU.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Versions of packages linux-image-2.6.35.6 depends on:
ii coreutils 8.5-1 GNU core utilities
ii debconf [debconf-2.0] 1.5.35 Debian configuration management sy
linux-image-2.6.35.6 recommends no packages.
Versions of packages linux-image-2.6.35.6 suggests:
pn fdutils <none> (no description available)
pn ksymoops <none> (no description available)
pn linux-doc-2.6.35.6 | linux-so <none> (no description available)
pn linux-image-2.6.35.6-dbg <none> (no description available)
-- debconf information:
linux-image-2.6.35.6/postinst/old-dir-initrd-link-2.6.35.6: true
linux-image-2.6.35.6/prerm/removing-running-kernel-2.6.35.6: true
linux-image-2.6.35.6/preinst/abort-overwrite-2.6.35.6:
linux-image-2.6.35.6/postinst/old-system-map-link-2.6.35.6: true
linux-image-2.6.35.6/preinst/already-running-this-2.6.35.6:
linux-image-2.6.35.6/preinst/overwriting-modules-2.6.35.6: true
linux-image-2.6.35.6/postinst/depmod-error-initrd-2.6.35.6: false
linux-image-2.6.35.6/postinst/kimage-is-a-directory:
linux-image-2.6.35.6/preinst/failed-to-move-modules-2.6.35.6:
linux-image-2.6.35.6/postinst/depmod-error-2.6.35.6: false
node:
ip_port = 7777
ip_address = 192.168.1.1
number = 0
name = mail01.fxclub.org
cluster = ocfs2
node:
ip_port = 7777
ip_address = 192.168.1.2
number = 1
name = mail02.fxclub.org
cluster = ocfs2
cluster:
node_count = 2
name = ocfs2
resource drbd0 {
on mail01.fxclub.org {
device /dev/drbd0;
disk /dev/sda9;
address 192.168.1.1:7789;
meta-disk internal;
}
on mail02.fxclub.org {
device /dev/drbd0;
disk /dev/sda9;
address 192.168.1.2:7789;
meta-disk internal;
}
}
global {
usage-count yes;
# minor-count dialog-refresh disable-ip-verification
}
common {
protocol C;
handlers {
# What should be done in case the node is primary, degraded
(=no connection) and has inconsistent data.
#pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot
-f";
#pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh;
/sbin/ifconfig eth1 down";
# The node is currently primary, but lost the after split brain
auto recovery procedure. As as consequence it should go away.
#pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
/usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot
-f";
#pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh;
/sbin/ifconfig eth1 down";
#local-io-error "/usr/lib/drbd/notify-io-error.sh;
/usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt
-f";
#outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
# fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
#split-brain "/usr/lib/drbd/notify-split-brain.sh root";
# out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
# before-resync-target
"/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
# after-resync-target
/usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
}
startup {
wfc-timeout 60;
degr-wfc-timeout 30;
outdated-wfc-timeout 15;
become-primary-on both;
# wait-after-sb;
}
disk {
fencing resource-and-stonith;
# RAID WITH BBU ONLY!!!
no-disk-flushes;
no-md-flushes;
no-disk-barrier;
# on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes
# no-disk-drain no-md-flushes max-bio-bvecs
}
net {
cram-hmac-alg sha1;
shared-secret "password";
allow-two-primaries;
ping-timeout 20;
#after-sb-0pri discard-zero-changes;
#after-sb-1pri discard-secondary;
#after-sb-2pri disconnect;
data-integrity-alg sha1;
# Tuning
max-buffers 8000;
max-epoch-size 8000;
sndbuf-size 0;
# snd.buf-size rcvbuf-size timeout connect-int ping-int
ping-timeout max-buffers
# max-epoch-size ko-count allow-two-primaries cram-hmac-alg
shared-secret
# after-sb-0pri after-sb-1pri after-sb-2pri data-integrity-alg
no-tcp-cork
}
syncer {
# MagaBYTE! Not Bit.
rate 40M;
al-extents 3389;
# rate after al-extents use-rle cpu-mask verify-alg csums-alg
}
}
Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Stack glue driver: Loaded
Stack plugin "o2cb": Loaded
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster ocfs2: Online
Heartbeat dead threshold = 31
Network idle timeout: 15000
Network keepalive delay: 2000
Network reconnect delay: 2000
Checking O2CB heartbeat: Not active
Stable:
Message from syslogd@mail02 at Sep 16 09:03:19 ...
kernel:[92182.173794] ------------[ cut here ]------------
Message from syslogd@mail02 at Sep 16 09:03:19 ...
kernel:[92182.173872] invalid opcode: 0000 [#1] SMP
Message from syslogd@mail02 at Sep 16 09:03:19 ...
kernel:[92182.173899] last sysfs file: /sys/module/ocfs2/refcnt
Testing:
Message from syslogd@mail01 at Sep 16 15:18:37 ...
kernel:[ 1432.310479] ------------[ cut here ]------------
Message from syslogd@mail01 at Sep 16 15:18:37 ...
kernel:[ 1432.310648] invalid opcode: 0000 [#1] SMP
Message from syslogd@mail01 at Sep 16 15:18:37 ...
kernel:[ 1432.310801] last sysfs file: /sys/fs/o2cb/interface_revision
Message from syslogd@mail01 at Sep 16 15:18:37 ...
kernel:[ 1432.312251] Stack:
Message from syslogd@mail01 at Sep 16 15:18:37 ...
kernel:[ 1432.312251] Call Trace:
Message from syslogd@mail01 at Sep 16 15:18:37 ...
kernel:[ 1432.312251] Code: 83 c3 08 48 83 3b 00 eb ec 48 83 fd 10 0f 86 89 00
00 00 48 89 ef e8 b9 e8 ff ff 48 89 c7 48 8b 00 84 c0 78 13 66 a9 00 c0 75 04
<0f> 0b eb fe 5b 5d 41 5c e9 94 58 fd ff 48 8b 4c 24 18 4c 8b 4f
Testing: 2.6.35 + DRBD 8.3.8
mail01:/usr/local/sbin# mount /var/spool/dovecot
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.451479] ------------[ cut here ]------------
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.451530] invalid opcode: 0000 [#1] SMP
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.451557] last sysfs file: /sys/module/drbd/parameters/cn_idx
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.452451] Stack:
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.452623] Call Trace:
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.452841] Code: c5 10 48 83 7d 00 00 eb e6 48 83 fb 10 0f 86 80 00
00 00 48 89 df e8 a9 f0 ff ff 48 89 c6 48 8b 00 84 c0 78 16 66 a9 00 c0 75 04
<0f> 0b eb fe 5b 5d 41 5c 48 89 f7 e9 7d 75 fd ff 48 8b 4c 24 18
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.461099] general protection fault: 0000 [#2] SMP
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.461269] last sysfs file: /sys/module/drbd/parameters/cn_idx
mail01:/usr/local/sbin#
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.465065] Stack:
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.465065] Call Trace:
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.465065] Code: 0f 1f 44 00 00 49 89 c7 fa 66 0f 1f 44 00 00 65 4c
8b 04 25 b0 ea 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48 63 45 18
<48> 8b 04 03 49 89 00 eb 11 83 ca ff 44 89 f6 48 89 ef e8 a1 f1
55921.451479] ------------[ cut here ]------------
[55921.451506] kernel BUG at mm/slub.c:2834!
[55921.451530] invalid opcode: 0000 [#1] SMP
[55921.451557] last sysfs file: /sys/module/drbd/parameters/cn_idx
[55921.451584] CPU 1
[55921.451589] Modules linked in: ocfs2 jbd2 quota_tree drbd xt_multiport
sha1_generic hmac lru_cache cn xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4
xt_state nf_conntrack iptable_filter ip_tables x_tables ocf
s2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs
ext2 loop snd_pcm i5000_edac edac_core i5k_amb snd_timer processor snd evdev
button rng_core shpchp soundcore snd_page_alloc tpm
_tis pci_hotplug psmouse dcdbas tpm pcspkr tpm_bios serio_raw ext3 jbd mbcache
ide_cd_mod uhci_hcd cdrom ata_generic ata_piix libata ses sd_mod enclosure
crc_t10dif ehci_hcd megaraid_sas piix ide_core usbcor
e scsi_mod nls_base bnx2 thermal thermal_sys [last unloaded: drbd]
[55921.451964]
[55921.451984] Pid: 2995, comm: udevd Not tainted 2.6.35.6 #1 0NH278/PowerEdge
2950
[55921.452027] RIP: 0010:[<ffffffff810df05d>] [<ffffffff810df05d>]
kfree+0x5b/0xc8
[55921.452076] RSP: 0018:ffff88012aa61d58 EFLAGS: 00010246
[55921.452102] RAX: 0200000000000400 RBX: ffff880100000001 RCX: 0000000000000002
[55921.452131] RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000001
[55921.452160] RBP: ffff8800375d8f00 R08: 0000000000000000 R09: 0000000000000000
[55921.452189] R10: ffff88012bce1070 R11: ffff8800375d8f00 R12: ffffffff810f061e
[55921.452219] R13: 0000000018000040 R14: ffff88012c375cf0 R15: ffff88012bce1070
[55921.452248] FS: 00007f7646a967a0(0000) GS:ffff880001a40000(0000)
knlGS:0000000000000000
[55921.452293] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[55921.452319] CR2: 00007f7646a9c000 CR3: 000000012d245000 CR4: 00000000000006e0
[55921.452349] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[55921.452378] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[55921.452407] Process udevd (pid: 2995, threadinfo ffff88012aa60000, task
ffff880121f4d890)
[55921.452451] Stack:
[55921.452471] 0000000000000000 ffff8800375d8f00 ffff88012bce1070
ffffffff810f061e
[55921.452505] <0> ffff880108000080 000000002bce1070 ffff88012c3759d0
ffff880100000001
[55921.452556] <0> 0000029d0000029d ffff8800375d8fa0 ffff88012f8a4900
ffff8800375d8f00
[55921.452623] Call Trace:
[55921.452647] [<ffffffff810f061e>] ? vfs_rename+0x3d3/0x3e4
[55921.452674] [<ffffffff810f1c78>] ? sys_renameat+0x1aa/0x22b
[55921.452702] [<ffffffff810d13ab>] ? free_pages_and_swap_cache+0x53/0x6e
[55921.452732] [<ffffffff810c83fb>] ? tlb_finish_mmu+0x2a/0x33
[55921.452759] [<ffffffff810c8470>] ? remove_vma+0x6c/0x74
[55921.452786] [<ffffffff810c95d8>] ? do_munmap+0x307/0x329
[55921.452814] [<ffffffff810089c2>] ? system_call_fastpath+0x16/0x1b
[55921.452841] Code: c5 10 48 83 7d 00 00 eb e6 48 83 fb 10 0f 86 80 00 00 00
48 89 df e8 a9 f0 ff ff 48 89 c6 48 8b 00 84 c0 78 16 66 a9 00 c0 75 04 <0f> 0b
eb fe 5b 5d 41 5c 48 89 f7 e9 7d 75 fd ff 48 8b 4
c 24 18
[55921.453030] RIP [<ffffffff810df05d>] kfree+0x5b/0xc8
[55921.453057] RSP <ffff88012aa61d58>
[55921.453437] ---[ end trace 3f96fca7c9cbfb03 ]---
[55921.454368] JBD: Ignoring recovery information on journal
[55921.461099] general protection fault: 0000 [#2] SMP
[55921.461269] last sysfs file: /sys/module/drbd/parameters/cn_idx
[55921.461338] CPU 1
[55921.461385] Modules linked in: ocfs2 jbd2 quota_tree drbd xt_multiport
sha1_generic hmac lru_cache cn xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4
xt_state nf_conntrack iptable_filter ip_tables x_tables ocfs2_dlmfs
ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs ext2 loop
snd_pcm i5000_edac edac_core i5k_amb snd_timer processor snd evdev button
rng_core shpchp soundcore snd_page_alloc tpm_tis pci_hotplug psmouse dcdbas tpm
pcspkr tpm_bios serio_raw ext3 jbd mbcache ide_cd_mod uhci_hcd cdrom
ata_generic ata_piix libata ses sd_mod enclosure crc_t10dif ehci_hcd
megaraid_sas piix ide_core usbcore scsi_mod nls_base bnx2 thermal thermal_sys
[last unloaded: drbd]
[55921.464840]
[55921.464902] Pid: 9281, comm: mount.ocfs2 Tainted: G D 2.6.35.6 #1
0NH278/PowerEdge 2950
[55921.464990] RIP: 0010:[<ffffffff810dffaa>] [<ffffffff810dffaa>]
__kmalloc+0xd3/0x136
[55921.465065] RSP: 0018:ffff880103e21ba8 EFLAGS: 00010006
[55921.465065] RAX: 0000000000000000 RBX: 0800000000000000 RCX: ffffffffa0449421
[55921.465065] RDX: 0000000000000000 RSI: ffff88012cfaf000 RDI: 0000000000000004
[55921.465065] RBP: ffffffff81625520 R08: ffff880001a524d0 R09: 0000000000000000
[55921.465065] R10: ffff88012cfaf260 R11: ffff88012ca24420 R12: 000000000000000a
[55921.465065] R13: 00000000000080d0 R14: 00000000000080d0 R15: 0000000000000246
[55921.465065] FS: 00007fee60afe720(0000) GS:ffff880001a40000(0000)
knlGS:0000000000000000
[55921.465065] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[55921.465065] CR2: 00007f764630ab8c CR3: 000000012eae3000 CR4: 00000000000006e0
[55921.465065] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[55921.465065] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[55921.465065] Process mount.ocfs2 (pid: 9281, threadinfo ffff880103e20000,
task ffff88012ca24420)
[55921.465065] Stack:
[55921.465065] 0000000000000000 ffffffffa0449421 ffff88012cfaf108
ffff88012cfaf000
[55921.465065] <0> ffff88012cfaf000 ffff88012cfaf000 ffff88012aa2e000
ffff88012ca24420
[55921.465065] <0> 0000000000000200 ffffffffa0449421 0000000000000000
ffffffffa044ccec
[55921.465065] Call Trace:
[55921.465065] [<ffffffffa0449421>] ? ocfs2_compute_replay_slots+0x31/0x10f
[ocfs2]
[55921.465065] [<ffffffffa0449421>] ? ocfs2_compute_replay_slots+0x31/0x10f
[ocfs2]
[55921.465065] [<ffffffffa044ccec>] ? ocfs2_journal_load+0x1d0/0x2b1 [ocfs2]
[55921.465065] [<ffffffffa0473525>] ? ocfs2_fill_super+0x19a2/0x2101 [ocfs2]
[55921.465065] [<ffffffff8118aa8f>] ? snprintf+0x36/0x3b
[55921.465065] [<ffffffff810e9f9e>] ? get_sb_bdev+0x137/0x19a
[55921.465065] [<ffffffffa0471b83>] ? ocfs2_fill_super+0x0/0x2101 [ocfs2]
[55921.465065] [<ffffffff810e9675>] ? vfs_kern_mount+0xa6/0x196
[55921.465065] [<ffffffff810e97c4>] ? do_kern_mount+0x49/0xe7
[55921.465065] [<ffffffff810fdabb>] ? do_mount+0x75c/0x7d6
[55921.465065] [<ffffffff810d829a>] ? alloc_pages_current+0x9f/0xc2
[55921.465065] [<ffffffff810fdbbd>] ? sys_mount+0x88/0xc3
[55921.465065] [<ffffffff810089c2>] ? system_call_fastpath+0x16/0x1b
[55921.465065] Code: 0f 1f 44 00 00 49 89 c7 fa 66 0f 1f 44 00 00 65 4c 8b 04
25 b0 ea 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48 63 45 18 <48> 8b
04 03 49 89 00 eb 11 83 ca ff 44 89 f6 48 89 ef e8 a1 f1
[55921.465065] RIP [<ffffffff810dffaa>] __kmalloc+0xd3/0x136
[55921.465065] RSP <ffff880103e21ba8>
[55921.465065] ---[ end trace 3f96fca7c9cbfb04 ]---
[55941.839304] o2net: accepted connection from node mail02.fxclub.org (num 1)
at 192.168.1.2:7777
[55946.003594] o2dlm: Node 1 joins domain E4B99C68B65449068DC403326917DC29
[55946.003673] o2dlm: Nodes in domain E4B99C68B65449068DC403326917DC29: 0 1
Message from syslogd@mail01 at Sep 28 07:27:03 ...
kernel:[57519.645448] general protection fault: 0000 [#3] SMP
Message from syslogd@mail01 at Sep 28 07:27:03 ...
kernel:[57519.645615] last sysfs file: /sys/module/drbd/parameters/cn_idx
Message from syslogd@mail01 at Sep 28 07:27:03 ...
kernel:[57519.649409] Stack:
Message from syslogd@mail01 at Sep 28 07:27:03 ...
kernel:[57519.649409] Call Trace:
Message from syslogd@mail01 at Sep 28 07:27:03 ...
kernel:[57519.649409] Code: 0f 1f 44 00 00 49 89 c7 fa 66 0f 1f 44 00 00 65 4c
8b 04 25 b0 ea 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48 63 45 18
<48> 8b 04 03 49 89 00 eb 11 83 ca ff 44 89 f6 48 89 ef e8 a1 f1
--- End Message ---
--- Begin Message ---
Version: 2.6.32-41
tags 616726 - unreproducible
quit
Tim Stoop wrote:
> We're currently using the linux-image-2.6.32-5-amd64 package
> (2.6.32-41) and we haven't seen the problem since. So it looks like
> it's solved.
Thanks, both. Marking accordingly.
--- End Message ---