Re: [Ocfs2-users] re: o2hb_write_timeout:270 ERROR: Heartbeat write timeout

Sunil Mushran Thu, 16 Nov 2006 11:53:27 -0800

On nodes db01 and db03 hb timed-out at 17:12:49. However, the nodes
did not fully panic. As in, the network was shutdown but the hb thread
was still going strong for some reason.


Within 10 secs of that, by 17:12:59, db02 detected loss of network
connectivity with both nodes db01 and db03. However, it was still
seeing the nodes hb on disk and assumed that they were alive. As per
quorum rules, it paniced.

So the qs is: what was happening on nodes db01 and db03 after 17:12:49?

Peter Santos wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Folks,
        
I'm trying to piece together what happened during a recent event where our 3 
node RAC cluster had problems.
It appears that all 3 nodes restarted .. which is likely to occur if all 3 
nodes cannot communicate with the
shared ocfs2 storage.

I did find out from our SA, that this happened during the time he was replacing 
a failed drive on the storage
and the storage was in a degraded mode.  I'm trying to understand if the 3 
nodes had a difficult time accessing
the shared ocfs2 volume or was it a tcp connectivity issue. There is nobody 
currently using the cluster ..so
it should have been idle from a user perspective.


prompt># cat /etc/fstab | grep ocfs2

/dev/sdb1  /ocfs2       ocfs2      _netdev,datavolume,nointr  0 0
/dev/sdb2  /backups     ocfs2      _netdev,datavolume,nointr  0 0

we have 2 ocfs2 volumes.. once if for the voting and ocr files, while the other 
is to be used as a
shared storage for backups of archivelog files etc.


/var/log/messages


NODE1 (dbo1)
========================================================================================================
Nov 15 17:12:49 dbo1 kernel: (13,3):o2hb_write_timeout:270 ERROR: Heartbeat 
write timeout to device sdb2
                                    after 12000 milliseconds
Nov 15 17:12:49 dbo1 kernel: Heartbeat thread (13) printing last 24 blocking 
operations (cur = 13):
Nov 16 05:44:58 dbo1 syslogd 1.4.1: restart.


NODE2 (dbo2)
========================================================================================================

Nov 15 17:12:57 dbo2 kernel: o2net: connection to node dbo1 (num 0) at 
192.168.134.140:7777 has been idle for 10
seconds, shutting it down.
Nov 15 17:12:57 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some times 
that might help debug the situation: (tmr
1163628767.826089 now 1163628777.825614 dr 1163628767.826070 adv 
1163628767.826104:1163628767.826105 func (f0735f96
   :506) 1163454320.893701:1163454320.893708)
Nov 15 17:12:57 dbo2 kernel: o2net: no longer connected to node dbo1 (num 0) at 
192.168.134.140:7777
Nov 15 17:12:59 dbo2 kernel: o2net: connection to node dbo3 (num 2) at 
192.168.134.142:7777 has been idle for 10
seconds, shutting it down.
Nov 15 17:12:59 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some times 
that might help debug the situation: (tmr
1163628769.44144 now 1163628779.43640 dr 1163628769.44123 adv 
1163628769.44159:1163628769.44160 func (f7e0383f:504)
    1163540424.444236:1163540424.444248)
Nov 15 17:12:59 dbo2 kernel: o2net: no longer connected to node dbo3 (num 2) at 
192.168.134.142:7777
Nov 15 17:32:37 dbo2 -- MARK --
Nov 15 17:33:03 dbo2 kernel: (11,1):o2quo_make_decision:121 ERROR: fencing this 
node because it is only connected to 1
nodes and 2 is needed to make a quorum out of 3 heartbeating nodes
Nov 15 17:33:03 dbo2 kernel: (11,1):o2hb_stop_all_regions:1889 ERROR: stopping 
heartbeat on all active regions.
Nov 15 17:33:03 dbo2 kernel: Kernel panic: ocfs2 is very sorry to be fencing 
this system by panicing
Nov 15 17:33:03 dbo2 kernel:

NODE3 (dbo3)
========================================================================================================
Nov 15 17:12:49 dbo3 kernel: (13,3):o2hb_write_timeout:270 ERROR: Heartbeat 
write timeout to device sdb2
                                    after 12000 milliseconds
Nov 15 17:12:49 dbo3 kernel: Heartbeat thread (13) printing last 24 blocking 
operations (cur = 11):
Nov 16 10:45:32 dbo3 syslogd 1.4.1: restart.


any help is greatly appreciated (BTW, I've read the ocfs2 user guide).

thanks
- -peter

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFXKstoyy5QBCjoT0RAv82AJ9cAGUON4K2/ixbB3NxTtjL/yORlACeJFvH
RVxoqk930affeEnK3yw5SIU=
=eqqi
-----END PGP SIGNATURE-----

_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users


_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] re: o2hb_write_timeout:270 ERROR: Heartbeat write timeout

Reply via email to