Hi Alexander, Try to increase the value of "Heartbeat dead threshold" 5 minutes could be acceptable for the preliminary testing under load This will allow you to assess the problem before the node(s) die
Best regards, Karim -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Alexander Mestiashvili Sent: Sunday, February 01, 2009 7:31 PM To: [email protected] Subject: [Ocfs2-users] ocfs2 hosts reboot under load Hello , I have troubles with my 4 node ocfs2 cluster . Hosts reboot under load . hardware is 4 dell 1850 servers connected via 100M network . storage is raid 5 connected with fiber channel . I ran boonie++ simultaneously on two hosts for testing. On the second host (host 8) I got such messages in kern.log . first one(host 7) rebooted at Jan 30 16:23:48 host8 kernel: o2net: connection to node host7 (num 0) at 192.168.0.27:7777 has been idle for 30.0 seconds, shutting it down. mount | grep ocfs ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw) /dev/sda on /shared type ocfs2 (rw,_netdev,heartbeat=local) command I used : bonnie++ -d /shared/ocfs2_nutch8/ -u root -s 0 -n 100:100m:10k:100 Jan 30 16:23:48 host8 kernel: o2net: connection to node host7 (num 0) at 192.168.0.27:7777 has been idle for 30.0 seconds, shutting it down. Jan 30 16:23:48 host8 kernel: (0,0):o2net_idle_timer:1498 here are some times that might help debug the situation: (tmr 1233328998.315538 now 1233329028.313246 dr 1233328998.315530 adv 1233328998.315541:1233328998.315541 func (fa7e1976:502) 1233328900.631572:1233328900.631582) Jan 30 16:23:48 host8 kernel: o2net: no longer connected to node host7 (num 0) at 192.168.0.27:7777 Jan 30 16:23:48 host8 kernel: (16132,0):dlm_do_master_request:1335 ERROR: link to 0 went down! Jan 30 16:23:48 host8 kernel: (16132,0):dlm_get_lock_resource:912 ERROR: status = -112 Jan 30 16:23:55 host8 kernel: (2616,1):o2dlm_eviction_cb:258 o2dlm has evicted node 0 from group DE9BC917EFB247458EF221C2167F6CC1 Jan 30 16:23:58 host8 kernel: (16132,0):dlm_restart_lock_mastery:1218 ERROR: node down! 0 Jan 30 16:23:58 host8 kernel: (16132,0):dlm_wait_for_lock_mastery:1035 ERROR: status = -11 Jan 30 16:24:00 host8 kernel: (16132,0):dlm_get_lock_resource:893 DE9BC917EFB247458EF221C2167F6CC1:N0000000009f618da: at least one node (0) to recover before lock mastery can begin Jan 30 16:24:22 host8 last message repeated 2 times Jan 30 16:25:18 host8 kernel: o2net: connected to node host7 (num 0) at 192.168.0.27:7777 Jan 30 16:25:18 host8 kernel: ocfs2_dlm: Node 0 joins domain DE9BC917EFB247458EF221C2167F6CC1 Jan 30 16:25:18 host8 kernel: ocfs2_dlm: Nodes in domain ("DE9BC917EFB247458EF221C2167F6CC1"): 0 1 2 3 Jan 30 16:42:11 host8 kernel: INFO: task kswapd0:207 blocked for more than 120 seconds. Jan 30 16:42:11 host8 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 30 16:42:11 host8 kernel: kswapd0 D 0000000000000100 0 207 2 Jan 30 16:42:11 host8 kernel: ffff88012dd09cf0 0000000000000046 ffff88012e2dc148 ffffffff8021e03f Jan 30 16:42:11 host8 kernel: ffff88012fbd7340 ffff88012faf46a0 ffff88012fbd7600 0000000000000001 Jan 30 16:42:11 host8 kernel: 0000000000000286 0000000000000003 ffff88012dd09cf0 ffffffff8021ec30 Jan 30 16:42:11 host8 kernel: Call Trace: Jan 30 16:42:11 host8 kernel: [<ffffffff8021e03f>] 0xffffffff8021e03f Jan 30 16:42:11 host8 kernel: [<ffffffff8021ec30>] 0xffffffff8021ec30 Jan 30 16:42:11 host8 kernel: [<ffffffffa01ceb1b>] 0xffffffffa01ceb1b Jan 30 16:42:11 host8 kernel: [<ffffffff8023b605>] 0xffffffff8023b605 Jan 30 16:42:11 host8 kernel: [<ffffffff8028bbe0>] 0xffffffff8028bbe0 Jan 30 16:42:11 host8 kernel: [<ffffffff8028c201>] 0xffffffff8028c201 Jan 30 16:42:11 host8 kernel: [<ffffffff8028c469>] 0xffffffff8028c469 Jan 30 16:42:11 host8 kernel: [<ffffffff8025d7d8>] 0xffffffff8025d7d8 Jan 30 16:42:11 host8 kernel: [<ffffffff8025df2b>] 0xffffffff8025df2b Jan 30 16:42:11 host8 kernel: [<ffffffff8025cb00>] 0xffffffff8025cb00 Jan 30 16:42:11 host8 kernel: [<ffffffff80414d37>] 0xffffffff80414d37 Jan 30 16:42:11 host8 kernel: [<ffffffff8023b605>] 0xffffffff8023b605 Jan 30 16:42:11 host8 kernel: [<ffffffff8025dbea>] 0xffffffff8025dbea Jan 30 16:42:11 host8 kernel: [<ffffffff8023b4de>] 0xffffffff8023b4de Jan 30 16:42:11 host8 kernel: [<ffffffff80225a29>] 0xffffffff80225a29 Jan 30 16:42:11 host8 kernel: [<ffffffff80203c79>] 0xffffffff80203c79 Jan 30 16:42:11 host8 kernel: [<ffffffff8023b497>] 0xffffffff8023b497 Jan 30 16:42:11 host8 kernel: [<ffffffff80203c6f>] 0xffffffff80203c6f Jan 30 16:42:11 host8 kernel: Jan 30 16:42:11 host8 kernel: INFO: task bonnie++:16132 blocked for more than 120 seconds. Jan 30 16:42:11 host8 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 30 16:42:11 host8 kernel: bonnie++ D 0000000102fe0588 0 16132 2991 Jan 30 16:42:11 host8 kernel: ffff88010022f888 0000000000000086 0000000000000000 ffff880123438748 Jan 30 16:42:11 host8 kernel: ffff88012fbd6cf0 ffff88012fa7a6a0 ffff88012fbd6fb0 000000012f402380 Jan 30 16:42:11 host8 kernel: 0000000000000003 0000000000000001 0000000000000000 0000000000000000 Jan 30 16:42:11 host8 kernel: Call Trace: Jan 30 16:42:11 host8 kernel: [<ffffffff80415f99>] 0xffffffff80415f99 Jan 30 16:42:11 host8 kernel: [<ffffffffa01d4040>] 0xffffffffa01d4040 Jan 30 16:42:11 host8 kernel: [<ffffffffa01c78a5>] 0xffffffffa01c78a5 Jan 30 16:42:11 host8 kernel: [<ffffffff80299508>] 0xffffffff80299508 Jan 30 16:42:11 host8 kernel: [<ffffffffa01c9524>] 0xffffffffa01c9524 Jan 30 16:42:11 host8 kernel: [<ffffffffa01b60fb>] 0xffffffffa01b60fb Jan 30 16:42:11 host8 kernel: [<ffffffffa01be69d>] 0xffffffffa01be69d Jan 30 16:42:11 host8 kernel: [<ffffffff80415e90>] 0xffffffff80415e90 Jan 30 16:42:11 host8 kernel: [<ffffffffa01b8215>] 0xffffffffa01b8215 Jan 30 16:42:11 host8 kernel: [<ffffffff80254305>] 0xffffffff80254305 Jan 30 16:42:11 host8 kernel: [<ffffffffa01eada0>] 0xffffffffa01eada0 Jan 30 16:42:11 host8 kernel: [<ffffffffa01eada0>] 0xffffffffa01eada0 Jan 30 16:42:11 host8 kernel: [<ffffffff8022dd99>] 0xffffffff8022dd99 Jan 30 16:42:11 host8 kernel: [<ffffffff80253441>] 0xffffffff80253441 Jan 30 16:42:11 host8 kernel: [<ffffffff8028c750>] 0xffffffff8028c750 Jan 30 16:42:11 host8 kernel: [<ffffffff80254cfa>] 0xffffffff80254cfa Jan 30 16:42:11 host8 kernel: [<ffffffffa01bd79f>] 0xffffffffa01bd79f Jan 30 16:42:11 host8 kernel: [<ffffffff80254e17>] 0xffffffff80254e17 Jan 30 16:42:11 host8 kernel: [<ffffffffa01cc468>] 0xffffffffa01cc468 Jan 30 16:42:11 host8 kernel: [<ffffffffa01c6724>] 0xffffffffa01c6724 Jan 30 16:42:11 host8 kernel: [<ffffffff80279227>] 0xffffffff80279227 Jan 30 16:42:11 host8 kernel: [<ffffffff80277c41>] 0xffffffff80277c41 Jan 30 16:42:11 host8 kernel: [<ffffffff8023b605>] 0xffffffff8023b605 Jan 30 16:42:11 host8 kernel: [<ffffffff8028d122>] 0xffffffff8028d122 Jan 30 16:42:11 host8 kernel: [<ffffffffa01cbfe6>] 0xffffffffa01cbfe6 Jan 30 16:42:11 host8 kernel: [<ffffffff80279984>] 0xffffffff80279984 Jan 30 16:42:11 host8 kernel: [<ffffffff80279e0c>] 0xffffffff80279e0c Jan 30 16:42:11 host8 kernel: [<ffffffff80202d9b>] 0xffffffff80202d9b Jan 30 16:42:11 host8 kernel: Jan 31 09:50:10 host8 kernel: INFO: task kswapd0:207 blocked for more than 120 seconds. Jan 31 09:50:10 host8 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 31 09:50:10 host8 kernel: kswapd0 D 0000000000000080 0 207 2 Jan 31 09:50:10 host8 kernel: ffff88012dd09cf0 0000000000000046 ffff88012e2dc148 ffffffff8021e03f Jan 31 09:50:10 host8 kernel: ffff88012fbd7340 ffff88012faf46a0 ffff88012fbd7600 0000000000000001 Jan 31 09:50:10 host8 kernel: 0000000000000286 0000000000000003 ffff88012dd09cf0 ffffffff8021ec30 Jan 31 09:50:10 host8 kernel: Call Trace: Jan 31 09:50:10 host8 kernel: [<ffffffff8021e03f>] 0xffffffff8021e03f Jan 31 09:50:10 host8 kernel: [<ffffffff8021ec30>] 0xffffffff8021ec30 Jan 31 09:50:10 host8 kernel: [<ffffffffa01ceb1b>] 0xffffffffa01ceb1b Jan 31 09:50:10 host8 kernel: [<ffffffff8023b605>] 0xffffffff8023b605 Jan 31 09:50:10 host8 kernel: [<ffffffff8028bbe0>] 0xffffffff8028bbe0 Jan 31 09:50:10 host8 kernel: [<ffffffff8028c201>] 0xffffffff8028c201 Jan 31 09:50:10 host8 kernel: [<ffffffff8028c469>] 0xffffffff8028c469 Jan 31 09:50:10 host8 kernel: [<ffffffff8025d7d8>] 0xffffffff8025d7d8 Jan 31 09:50:10 host8 kernel: [<ffffffff8025df2b>] 0xffffffff8025df2b Jan 31 09:50:10 host8 kernel: [<ffffffff8025cb00>] 0xffffffff8025cb00 Jan 31 09:50:10 host8 kernel: [<ffffffff80414d37>] 0xffffffff80414d37 Jan 31 09:50:10 host8 kernel: [<ffffffff8023b605>] 0xffffffff8023b605 Jan 31 09:50:10 host8 kernel: [<ffffffff8025dbea>] 0xffffffff8025dbea Jan 31 09:50:10 host8 kernel: [<ffffffff8023b4de>] 0xffffffff8023b4de Jan 31 09:50:10 host8 kernel: [<ffffffff80225a29>] 0xffffffff80225a29 Jan 31 09:50:10 host8 kernel: [<ffffffff80203c79>] 0xffffffff80203c79 Jan 31 09:50:10 host8 kernel: [<ffffffff8023b497>] 0xffffffff8023b497 Jan 31 09:50:10 host8 kernel: [<ffffffff80203c6f>] 0xffffffff80203c6f Jan 31 09:50:10 host8 kernel: Jan 31 09:50:10 host8 kernel: INFO: task bonnie++:20292 blocked for more than 120 seconds. Jan 31 09:50:10 host8 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 31 09:50:10 host8 kernel: bonnie++ D ffff88005a533000 0 20292 2991 Jan 31 09:50:10 host8 kernel: ffff880075a83cb8 0000000000000086 0000000000000000 ffff880030cd7c10 Jan 31 09:50:10 host8 kernel: ffff880001410cf0 ffff88012f0946a0 ffff880001410fb0 00000001a01cf3cb Jan 31 09:50:10 host8 kernel: 000000000ba734ca ffff8800a57510c8 000000000025a000 ffff88003e8243a0 Jan 31 09:50:10 host8 kernel: Call Trace: Jan 31 09:50:10 host8 kernel: [<ffffffff80415f99>] 0xffffffff80415f99 Jan 31 09:50:10 host8 kernel: [<ffffffffa01d4040>] 0xffffffffa01d4040 Jan 31 09:50:10 host8 kernel: [<ffffffffa01d9258>] 0xffffffffa01d9258 Jan 31 09:50:10 host8 kernel: [<ffffffffa01d9975>] 0xffffffffa01d9975 Jan 31 09:50:10 host8 kernel: [<ffffffff802805dd>] 0xffffffff802805dd Jan 31 09:50:10 host8 kernel: [<ffffffff80281d90>] 0xffffffff80281d90 Jan 31 09:50:10 host8 kernel: [<ffffffff8028405d>] 0xffffffff8028405d Jan 31 09:50:10 host8 kernel: [<ffffffff8028d122>] 0xffffffff8028d122 Jan 31 09:50:10 host8 kernel: [<ffffffffa01cbfe6>] 0xffffffffa01cbfe6 Jan 31 09:50:10 host8 kernel: [<ffffffff80277a2b>] 0xffffffff80277a2b Jan 31 09:50:10 host8 kernel: [<ffffffff80202d9b>] 0xffffffff80202d9b Jan 31 09:50:10 host8 kernel: kernel version is vanilla 2.6.27.13 + atop + grsecurity patches ocfs-tools version is 1.4.1-1 here is timeouts : #/etc/init.d/o2cb status Driver for "configfs": Loaded Filesystem "configfs": Mounted Stack glue driver: Loaded Stack plugin "o2cb": Loaded Driver for "ocfs2_dlmfs": Loaded Filesystem "ocfs2_dlmfs": Mounted Checking O2CB cluster nutch: Online Heartbeat dead threshold = 31 Network idle timeout: 30000 Network keepalive delay: 2000 Network reconnect delay: 2000 Checking O2CB heartbeat: Active what can I adjust ? or may be I should use older kernel ? Thanks in advance . _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
