Karim Alkhayer wrote: > Hi Alexander, > Try to increase the value of "Heartbeat dead threshold" > 5 minutes could be acceptable for the preliminary testing under load > This will allow you to assess the problem before the node(s) die > > Best regards, > Karim > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Alexander > Mestiashvili > Sent: Sunday, February 01, 2009 7:31 PM > To: [email protected] > Subject: [Ocfs2-users] ocfs2 hosts reboot under load > > Hello , I have troubles with my 4 node ocfs2 cluster . Hosts reboot under > load . > > hardware is 4 dell 1850 servers connected via 100M network . > storage is raid 5 connected with fiber channel . > I ran boonie++ simultaneously on two hosts for testing. > On the second host (host 8) I got such messages in kern.log . > > first one(host 7) rebooted at > Jan 30 16:23:48 host8 kernel: o2net: connection to node host7 (num 0) at > 192.168.0.27:7777 has been idle for 30.0 seconds, shutting it down. > > > mount | grep ocfs > ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw) > /dev/sda on /shared type ocfs2 (rw,_netdev,heartbeat=local) > > command I used : bonnie++ -d /shared/ocfs2_nutch8/ -u root -s 0 -n > 100:100m:10k:100 > > Jan 30 16:23:48 host8 kernel: o2net: connection to node host7 (num 0) at > 192.168.0.27:7777 has been idle for 30.0 seconds, shutting it down. > Jan 30 16:23:48 host8 kernel: (0,0):o2net_idle_timer:1498 here are some > times that might help debug the situation: (tmr 1233328998.315538 now > 1233329028.313246 dr 1233328998.315530 adv > 1233328998.315541:1233328998.315541 func (fa7e1976:502) > 1233328900.631572:1233328900.631582) > Jan 30 16:23:48 host8 kernel: o2net: no longer connected to node host7 (num > 0) at 192.168.0.27:7777 > Jan 30 16:23:48 host8 kernel: (16132,0):dlm_do_master_request:1335 ERROR: > link to 0 went down! > Jan 30 16:23:48 host8 kernel: (16132,0):dlm_get_lock_resource:912 ERROR: > status = -112 > Jan 30 16:23:55 host8 kernel: (2616,1):o2dlm_eviction_cb:258 o2dlm has > evicted node 0 from group DE9BC917EFB247458EF221C2167F6CC1 > Jan 30 16:23:58 host8 kernel: (16132,0):dlm_restart_lock_mastery:1218 ERROR: > node down! 0 > Jan 30 16:23:58 host8 kernel: (16132,0):dlm_wait_for_lock_mastery:1035 > ERROR: status = -11 > Jan 30 16:24:00 host8 kernel: (16132,0):dlm_get_lock_resource:893 > DE9BC917EFB247458EF221C2167F6CC1:N0000000009f618da: at least one node (0) to > recover before lock mastery can begin > Jan 30 16:24:22 host8 last message repeated 2 times > Jan 30 16:25:18 host8 kernel: o2net: connected to node host7 (num 0) at > 192.168.0.27:7777 > Jan 30 16:25:18 host8 kernel: ocfs2_dlm: Node 0 joins domain > DE9BC917EFB247458EF221C2167F6CC1 > Jan 30 16:25:18 host8 kernel: ocfs2_dlm: Nodes in domain > ("DE9BC917EFB247458EF221C2167F6CC1"): 0 1 2 3 > Jan 30 16:42:11 host8 kernel: INFO: task kswapd0:207 blocked for more than > 120 seconds. > Jan 30 16:42:11 host8 kernel: "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Jan 30 16:42:11 host8 kernel: kswapd0 D 0000000000000100 0 207 > 2 > Jan 30 16:42:11 host8 kernel: ffff88012dd09cf0 0000000000000046 > ffff88012e2dc148 ffffffff8021e03f > Jan 30 16:42:11 host8 kernel: ffff88012fbd7340 ffff88012faf46a0 > ffff88012fbd7600 0000000000000001 > Jan 30 16:42:11 host8 kernel: 0000000000000286 0000000000000003 > ffff88012dd09cf0 ffffffff8021ec30 > Jan 30 16:42:11 host8 kernel: Call Trace: > Jan 30 16:42:11 host8 kernel: [<ffffffff8021e03f>] 0xffffffff8021e03f > Jan 30 16:42:11 host8 kernel: [<ffffffff8021ec30>] 0xffffffff8021ec30 > Jan 30 16:42:11 host8 kernel: [<ffffffffa01ceb1b>] 0xffffffffa01ceb1b > Jan 30 16:42:11 host8 kernel: [<ffffffff8023b605>] 0xffffffff8023b605 > Jan 30 16:42:11 host8 kernel: [<ffffffff8028bbe0>] 0xffffffff8028bbe0 > Jan 30 16:42:11 host8 kernel: [<ffffffff8028c201>] 0xffffffff8028c201 > Jan 30 16:42:11 host8 kernel: [<ffffffff8028c469>] 0xffffffff8028c469 > Jan 30 16:42:11 host8 kernel: [<ffffffff8025d7d8>] 0xffffffff8025d7d8 > Jan 30 16:42:11 host8 kernel: [<ffffffff8025df2b>] 0xffffffff8025df2b > Jan 30 16:42:11 host8 kernel: [<ffffffff8025cb00>] 0xffffffff8025cb00 > Jan 30 16:42:11 host8 kernel: [<ffffffff80414d37>] 0xffffffff80414d37 > Jan 30 16:42:11 host8 kernel: [<ffffffff8023b605>] 0xffffffff8023b605 > Jan 30 16:42:11 host8 kernel: [<ffffffff8025dbea>] 0xffffffff8025dbea > Jan 30 16:42:11 host8 kernel: [<ffffffff8023b4de>] 0xffffffff8023b4de > Jan 30 16:42:11 host8 kernel: [<ffffffff80225a29>] 0xffffffff80225a29 > Jan 30 16:42:11 host8 kernel: [<ffffffff80203c79>] 0xffffffff80203c79 > Jan 30 16:42:11 host8 kernel: [<ffffffff8023b497>] 0xffffffff8023b497 > Jan 30 16:42:11 host8 kernel: [<ffffffff80203c6f>] 0xffffffff80203c6f > Jan 30 16:42:11 host8 kernel: > Jan 30 16:42:11 host8 kernel: INFO: task bonnie++:16132 blocked for more > than 120 seconds. > Jan 30 16:42:11 host8 kernel: "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Jan 30 16:42:11 host8 kernel: bonnie++ D 0000000102fe0588 0 16132 > 2991 > Jan 30 16:42:11 host8 kernel: ffff88010022f888 0000000000000086 > 0000000000000000 ffff880123438748 > Jan 30 16:42:11 host8 kernel: ffff88012fbd6cf0 ffff88012fa7a6a0 > ffff88012fbd6fb0 000000012f402380 > Jan 30 16:42:11 host8 kernel: 0000000000000003 0000000000000001 > 0000000000000000 0000000000000000 > Jan 30 16:42:11 host8 kernel: Call Trace: > Jan 30 16:42:11 host8 kernel: [<ffffffff80415f99>] 0xffffffff80415f99 > Jan 30 16:42:11 host8 kernel: [<ffffffffa01d4040>] 0xffffffffa01d4040 > Jan 30 16:42:11 host8 kernel: [<ffffffffa01c78a5>] 0xffffffffa01c78a5 > Jan 30 16:42:11 host8 kernel: [<ffffffff80299508>] 0xffffffff80299508 > Jan 30 16:42:11 host8 kernel: [<ffffffffa01c9524>] 0xffffffffa01c9524 > Jan 30 16:42:11 host8 kernel: [<ffffffffa01b60fb>] 0xffffffffa01b60fb > Jan 30 16:42:11 host8 kernel: [<ffffffffa01be69d>] 0xffffffffa01be69d > Jan 30 16:42:11 host8 kernel: [<ffffffff80415e90>] 0xffffffff80415e90 > Jan 30 16:42:11 host8 kernel: [<ffffffffa01b8215>] 0xffffffffa01b8215 > Jan 30 16:42:11 host8 kernel: [<ffffffff80254305>] 0xffffffff80254305 > Jan 30 16:42:11 host8 kernel: [<ffffffffa01eada0>] 0xffffffffa01eada0 > Jan 30 16:42:11 host8 kernel: [<ffffffffa01eada0>] 0xffffffffa01eada0 > Jan 30 16:42:11 host8 kernel: [<ffffffff8022dd99>] 0xffffffff8022dd99 > Jan 30 16:42:11 host8 kernel: [<ffffffff80253441>] 0xffffffff80253441 > Jan 30 16:42:11 host8 kernel: [<ffffffff8028c750>] 0xffffffff8028c750 > Jan 30 16:42:11 host8 kernel: [<ffffffff80254cfa>] 0xffffffff80254cfa > Jan 30 16:42:11 host8 kernel: [<ffffffffa01bd79f>] 0xffffffffa01bd79f > Jan 30 16:42:11 host8 kernel: [<ffffffff80254e17>] 0xffffffff80254e17 > Jan 30 16:42:11 host8 kernel: [<ffffffffa01cc468>] 0xffffffffa01cc468 > Jan 30 16:42:11 host8 kernel: [<ffffffffa01c6724>] 0xffffffffa01c6724 > Jan 30 16:42:11 host8 kernel: [<ffffffff80279227>] 0xffffffff80279227 > Jan 30 16:42:11 host8 kernel: [<ffffffff80277c41>] 0xffffffff80277c41 > Jan 30 16:42:11 host8 kernel: [<ffffffff8023b605>] 0xffffffff8023b605 > Jan 30 16:42:11 host8 kernel: [<ffffffff8028d122>] 0xffffffff8028d122 > Jan 30 16:42:11 host8 kernel: [<ffffffffa01cbfe6>] 0xffffffffa01cbfe6 > Jan 30 16:42:11 host8 kernel: [<ffffffff80279984>] 0xffffffff80279984 > Jan 30 16:42:11 host8 kernel: [<ffffffff80279e0c>] 0xffffffff80279e0c > Jan 30 16:42:11 host8 kernel: [<ffffffff80202d9b>] 0xffffffff80202d9b > Jan 30 16:42:11 host8 kernel: > Jan 31 09:50:10 host8 kernel: INFO: task kswapd0:207 blocked for more than > 120 seconds. > Jan 31 09:50:10 host8 kernel: "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Jan 31 09:50:10 host8 kernel: kswapd0 D 0000000000000080 0 207 > 2 > Jan 31 09:50:10 host8 kernel: ffff88012dd09cf0 0000000000000046 > ffff88012e2dc148 ffffffff8021e03f > Jan 31 09:50:10 host8 kernel: ffff88012fbd7340 ffff88012faf46a0 > ffff88012fbd7600 0000000000000001 > Jan 31 09:50:10 host8 kernel: 0000000000000286 0000000000000003 > ffff88012dd09cf0 ffffffff8021ec30 > Jan 31 09:50:10 host8 kernel: Call Trace: > Jan 31 09:50:10 host8 kernel: [<ffffffff8021e03f>] 0xffffffff8021e03f > Jan 31 09:50:10 host8 kernel: [<ffffffff8021ec30>] 0xffffffff8021ec30 > Jan 31 09:50:10 host8 kernel: [<ffffffffa01ceb1b>] 0xffffffffa01ceb1b > Jan 31 09:50:10 host8 kernel: [<ffffffff8023b605>] 0xffffffff8023b605 > Jan 31 09:50:10 host8 kernel: [<ffffffff8028bbe0>] 0xffffffff8028bbe0 > Jan 31 09:50:10 host8 kernel: [<ffffffff8028c201>] 0xffffffff8028c201 > Jan 31 09:50:10 host8 kernel: [<ffffffff8028c469>] 0xffffffff8028c469 > Jan 31 09:50:10 host8 kernel: [<ffffffff8025d7d8>] 0xffffffff8025d7d8 > Jan 31 09:50:10 host8 kernel: [<ffffffff8025df2b>] 0xffffffff8025df2b > Jan 31 09:50:10 host8 kernel: [<ffffffff8025cb00>] 0xffffffff8025cb00 > Jan 31 09:50:10 host8 kernel: [<ffffffff80414d37>] 0xffffffff80414d37 > Jan 31 09:50:10 host8 kernel: [<ffffffff8023b605>] 0xffffffff8023b605 > Jan 31 09:50:10 host8 kernel: [<ffffffff8025dbea>] 0xffffffff8025dbea > Jan 31 09:50:10 host8 kernel: [<ffffffff8023b4de>] 0xffffffff8023b4de > Jan 31 09:50:10 host8 kernel: [<ffffffff80225a29>] 0xffffffff80225a29 > Jan 31 09:50:10 host8 kernel: [<ffffffff80203c79>] 0xffffffff80203c79 > Jan 31 09:50:10 host8 kernel: [<ffffffff8023b497>] 0xffffffff8023b497 > Jan 31 09:50:10 host8 kernel: [<ffffffff80203c6f>] 0xffffffff80203c6f > Jan 31 09:50:10 host8 kernel: > Jan 31 09:50:10 host8 kernel: INFO: task bonnie++:20292 blocked for more > than 120 seconds. > Jan 31 09:50:10 host8 kernel: "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Jan 31 09:50:10 host8 kernel: bonnie++ D ffff88005a533000 0 20292 > 2991 > Jan 31 09:50:10 host8 kernel: ffff880075a83cb8 0000000000000086 > 0000000000000000 ffff880030cd7c10 > Jan 31 09:50:10 host8 kernel: ffff880001410cf0 ffff88012f0946a0 > ffff880001410fb0 00000001a01cf3cb > Jan 31 09:50:10 host8 kernel: 000000000ba734ca ffff8800a57510c8 > 000000000025a000 ffff88003e8243a0 > Jan 31 09:50:10 host8 kernel: Call Trace: > Jan 31 09:50:10 host8 kernel: [<ffffffff80415f99>] 0xffffffff80415f99 > Jan 31 09:50:10 host8 kernel: [<ffffffffa01d4040>] 0xffffffffa01d4040 > Jan 31 09:50:10 host8 kernel: [<ffffffffa01d9258>] 0xffffffffa01d9258 > Jan 31 09:50:10 host8 kernel: [<ffffffffa01d9975>] 0xffffffffa01d9975 > Jan 31 09:50:10 host8 kernel: [<ffffffff802805dd>] 0xffffffff802805dd > Jan 31 09:50:10 host8 kernel: [<ffffffff80281d90>] 0xffffffff80281d90 > Jan 31 09:50:10 host8 kernel: [<ffffffff8028405d>] 0xffffffff8028405d > Jan 31 09:50:10 host8 kernel: [<ffffffff8028d122>] 0xffffffff8028d122 > Jan 31 09:50:10 host8 kernel: [<ffffffffa01cbfe6>] 0xffffffffa01cbfe6 > Jan 31 09:50:10 host8 kernel: [<ffffffff80277a2b>] 0xffffffff80277a2b > Jan 31 09:50:10 host8 kernel: [<ffffffff80202d9b>] 0xffffffff80202d9b > Jan 31 09:50:10 host8 kernel: > > kernel version is vanilla 2.6.27.13 + atop + grsecurity patches > ocfs-tools version is 1.4.1-1 > > here is timeouts : > #/etc/init.d/o2cb status > Driver for "configfs": Loaded > Filesystem "configfs": Mounted > Stack glue driver: Loaded > Stack plugin "o2cb": Loaded > Driver for "ocfs2_dlmfs": Loaded > Filesystem "ocfs2_dlmfs": Mounted > Checking O2CB cluster nutch: Online > Heartbeat dead threshold = 31 > Network idle timeout: 30000 > Network keepalive delay: 2000 > Network reconnect delay: 2000 > Checking O2CB heartbeat: Active > > what can I adjust ? or may be I should use older kernel ? > Thanks in advance . > > > _______________________________________________ > Ocfs2-users mailing list > [email protected] > http://oss.oracle.com/mailman/listinfo/ocfs2-users > > > _______________________________________________ > Ocfs2-users mailing list > [email protected] > http://oss.oracle.com/mailman/listinfo/ocfs2-users > Thanks Karim , I have changed timeouts , and hosts don't reboot any more .
Heartbeat dead threshold = 136 Network idle timeout: 160000 Network keepalive delay: 2000 Network reconnect delay: 2000 but I still have very high iowait and I see kernel Call Traces which are caused by blocked tasks. and ocfs2 is very slow with big amounts of small files , is there any way to increase ocfs2 performance for small files ? Alex _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
