Hard to say anything with the info provided. If you are on 1.2.1, upgrade to 1.2.3.
Then file a bug report rcr Peter Santos wrote: > Folks, > > I'm trying to piece together what happened during a recent event where > our 3 node RAC cluster had problems. > It appears that all 3 nodes restarted .. which is likely to occur if > all 3 nodes cannot communicate with the > shared ocfs2 storage. > > I did find out from our SA, that this happened during the time he was > replacing a failed drive on the storage > and the storage was in a degraded mode. I'm trying to understand if > the 3 nodes had a difficult time accessing > the shared ocfs2 volume or was it a tcp connectivity issue. There is > nobody currently using the cluster ..so > it should have been idle from a user perspective. > > > prompt># cat /etc/fstab | grep ocfs2 > > /dev/sdb1 /ocfs2 ocfs2 _netdev,datavolume,nointr 0 0 > /dev/sdb2 /backups ocfs2 _netdev,datavolume,nointr 0 0 > > we have 2 ocfs2 volumes.. once if for the voting and ocr files, while > the other is to be used as a > shared storage for backups of archivelog files etc. > > > /var/log/messages > > > NODE1 (dbo1) > ======================================================================================================== > Nov 15 17:12:49 dbo1 kernel: (13,3):o2hb_write_timeout:270 ERROR: > Heartbeat write timeout to device sdb2 > after 12000 milliseconds > Nov 15 17:12:49 dbo1 kernel: Heartbeat thread (13) printing last 24 > blocking operations (cur = 13): > Nov 16 05:44:58 dbo1 syslogd 1.4.1: restart. > > > NODE2 (dbo2) > ======================================================================================================== > > Nov 15 17:12:57 dbo2 kernel: o2net: connection to node dbo1 (num 0) at > 192.168.134.140:7777 has been idle for 10 > seconds, shutting it down. > Nov 15 17:12:57 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some > times that might help debug the situation: (tmr > 1163628767.826089 now 1163628777.825614 dr 1163628767.826070 adv > 1163628767.826104:1163628767.826105 func (f0735f96 > :506) 1163454320.893701:1163454320.893708) > Nov 15 17:12:57 dbo2 kernel: o2net: no longer connected to node dbo1 > (num 0) at 192.168.134.140:7777 > Nov 15 17:12:59 dbo2 kernel: o2net: connection to node dbo3 (num 2) at > 192.168.134.142:7777 has been idle for 10 > seconds, shutting it down. > Nov 15 17:12:59 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some > times that might help debug the situation: (tmr > 1163628769.44144 now 1163628779.43640 dr 1163628769.44123 adv > 1163628769.44159:1163628769.44160 func (f7e0383f:504) > 1163540424.444236:1163540424.444248) > Nov 15 17:12:59 dbo2 kernel: o2net: no longer connected to node dbo3 > (num 2) at 192.168.134.142:7777 > Nov 15 17:32:37 dbo2 -- MARK -- > Nov 15 17:33:03 dbo2 kernel: (11,1):o2quo_make_decision:121 ERROR: > fencing this node because it is only connected to 1 > nodes and 2 is needed to make a quorum out of 3 heartbeating nodes > Nov 15 17:33:03 dbo2 kernel: (11,1):o2hb_stop_all_regions:1889 ERROR: > stopping heartbeat on all active regions. > Nov 15 17:33:03 dbo2 kernel: Kernel panic: ocfs2 is very sorry to be > fencing this system by panicing > Nov 15 17:33:03 dbo2 kernel: > > NODE3 (dbo3) > ======================================================================================================== > Nov 15 17:12:49 dbo3 kernel: (13,3):o2hb_write_timeout:270 ERROR: > Heartbeat write timeout to device sdb2 > after 12000 milliseconds > Nov 15 17:12:49 dbo3 kernel: Heartbeat thread (13) printing last 24 > blocking operations (cur = 11): > Nov 16 10:45:32 dbo3 syslogd 1.4.1: restart. > > > any help is greatly appreciated (BTW, I've read the ocfs2 user guide). > > thanks > -peter > > > _______________________________________________ > Ocfs2-users mailing list > [email protected] > http://oss.oracle.com/mailman/listinfo/ocfs2-users > _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
