Its a #define in the code at the moment. The timeout is triggered if the o2net code does not receive a valid message within the timeout interval. Thats a valid message from the point of view of the o2net layer, not the operating system.
Andy On Tue, 2006-11-28 at 12:00 -0500, Peter Santos wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > So are you saying that the "10 second" message is currently not configurable? > > Also, does this message result because of a failure to ping the hearbeat > device or > a network ping against the host's network card? > > - -peter > > > Sunil Mushran wrote: > > As ocfs2 heartbeats on the same device, unplugging a different device on > > the > > storage should not affect ocfs2 as long as the ios are completing. But > > the logs > > indicate otherwise. HB ios are erroring out. > > > > The o2net message is the tcp connect message. We will be providing a way > > to configure that too. > > > > Peter Santos wrote: > > Suni, > > > > after trying to chase this down, I think one of our sa's might have > > restarted the storage without > > notifying anyone. > > > > Similarly, today a disk that was not in use was re-initialized and > > caused everything to come down. I don't > > know if this is an issue with ocfs2 or ( old_storage + our sa doing > > this incorrectly). > > > > The idea was to re-initialize a disk that was not being used (sdc) and > > not have it affect > > the ocfs2 storage (sdb). > > > > After the re-initialization completed, I noticed that all 3 nodes > > weren't working and this was > > what I found on dbo3 > > > > ======================================================================================================================= > > > > Nov 21 11:40:36 dbo3 kernel: o2net: connection to node dbo2 (num 1) at > > 192.168.134.141:7777 has been idle for 10 > > seconds, shutting it down. > > > > Nov 21 11:40:36 dbo3 kernel: (0,1):o2net_idle_timer:1310 here are some > > times that might help debug the situation: (tmr > > 1164127226.293816 now 1164127236.291931 dr 1164127226.293797 adv > > 1164127226.293818:1164127226.293819 func (a77953f3:2) > > 1164124426.747626:1164124426.747628) > > > > > > Nov 21 11:40:36 dbo3 kernel: o2net: no longer connected to node dbo2 > > (num 1) at 192.168.134.141:7777 > > > > Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000 > > Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector > > 591502543 > > Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000 > > ... > > Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000 > > Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector > > 591502568 > > Nov 21 11:41:11 dbo3 kernel: (3711,0):o2hb_do_disk_heartbeat:954 > > ERROR: status = -5 > > Nov 21 11:41:11 dbo3 kernel: (3789,0):o2hb_do_disk_heartbeat:954 > > ERROR: status = -5 > > Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000 > > Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector 1983 > > Nov 21 11:41:11 dbo3 kernel: (6614,0):o2hb_bio_end_io:332 ERROR: IO > > Error -5 > > Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000 > > Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector > > 3921780 > > Nov 21 11:41:11 dbo3 kernel: (6614,0):o2hb_bio_end_io:332 ERROR: IO > > Error -5 > > Nov 21 11:41:11 dbo3 kernel: (3711,0):o2hb_do_disk_heartbeat:954 > > ERROR: status = -5 > > Nov 21 11:41:11 dbo3 kernel: (3789,0):o2hb_do_disk_heartbeat:954 > > ERROR: status = -5 > > ... > > Nov 21 11:41:11 dbo3 kernel: (3711,0):o2hb_do_disk_heartbeat:954 > > ERROR: status = -5 > > Nov 21 11:41:11 dbo3 kernel: (3789,0):o2hb_do_disk_heartbeat:954 > > ERROR: status = -5 > > Nov 21 11:41:11 dbo3 su: pam_unix2: session finished for user oracle, > > service su > > Nov 21 11:41:11 dbo3 logger: Oracle CSSD failure 134. > > Nov 21 11:45:07 dbo3 syslogd 1.4.1: restart. > > > > I'm curious about the message > > "o2net: connection to node dbo2 (num 1) at 192.168.134.141:7777 has > > been idle for 10 seconds, shutting it down." > > > > I have increased my O2CB_HEARTBEAT_THRESHOLD to 61, but where is this > > message getting "10 seconds" from? > > Also this message is displayed because dbo2 was not able to check into > > the hearbeat filesystem right ? > > > > -peter > > > > > > > > > > > > Sunil Mushran wrote: > > > >>>> On nodes db01 and db03 hb timed-out at 17:12:49. However, the nodes > >>>> did not fully panic. As in, the network was shutdown but the hb thread > >>>> was still going strong for some reason. > >>>> > >>>> Within 10 secs of that, by 17:12:59, db02 detected loss of network > >>>> connectivity with both nodes db01 and db03. However, it was still > >>>> seeing the nodes hb on disk and assumed that they were alive. As per > >>>> quorum rules, it paniced. > >>>> > >>>> So the qs is: what was happening on nodes db01 and db03 after 17:12:49? > >>>> > >>>> Peter Santos wrote: > >>>> Folks, > >>>> I'm trying to piece together what happened during a recent event > >>>> where > >>>> our 3 node RAC cluster had problems. > >>>> It appears that all 3 nodes restarted .. which is likely to occur if > >>>> all 3 nodes cannot communicate with the > >>>> shared ocfs2 storage. > >>>> > >>>> I did find out from our SA, that this happened during the time he was > >>>> replacing a failed drive on the storage > >>>> and the storage was in a degraded mode. I'm trying to understand if > >>>> the 3 nodes had a difficult time accessing > >>>> the shared ocfs2 volume or was it a tcp connectivity issue. There is > >>>> nobody currently using the cluster ..so > >>>> it should have been idle from a user perspective. > >>>> > >>>> > >>>> prompt># cat /etc/fstab | grep ocfs2 > >>>> > >>>> /dev/sdb1 /ocfs2 ocfs2 _netdev,datavolume,nointr 0 0 > >>>> /dev/sdb2 /backups ocfs2 _netdev,datavolume,nointr 0 0 > >>>> > >>>> we have 2 ocfs2 volumes.. once if for the voting and ocr files, while > >>>> the other is to be used as a > >>>> shared storage for backups of archivelog files etc. > >>>> > >>>> > >>>> /var/log/messages > >>>> > >>>> > >>>> NODE1 (dbo1) > >>>> ======================================================================================================== > >>>> > >>>> > >>>> Nov 15 17:12:49 dbo1 kernel: (13,3):o2hb_write_timeout:270 ERROR: > >>>> Heartbeat write timeout to device sdb2 > >>>> after 12000 milliseconds > >>>> Nov 15 17:12:49 dbo1 kernel: Heartbeat thread (13) printing last 24 > >>>> blocking operations (cur = 13): > >>>> Nov 16 05:44:58 dbo1 syslogd 1.4.1: restart. > >>>> > >>>> > >>>> NODE2 (dbo2) > >>>> ======================================================================================================== > >>>> > >>>> > >>>> > >>>> Nov 15 17:12:57 dbo2 kernel: o2net: connection to node dbo1 (num 0) at > >>>> 192.168.134.140:7777 has been idle for 10 > >>>> seconds, shutting it down. > >>>> Nov 15 17:12:57 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some > >>>> times that might help debug the situation: (tmr > >>>> 1163628767.826089 now 1163628777.825614 dr 1163628767.826070 adv > >>>> 1163628767.826104:1163628767.826105 func (f0735f96 > >>>> :506) 1163454320.893701:1163454320.893708) > >>>> Nov 15 17:12:57 dbo2 kernel: o2net: no longer connected to node dbo1 > >>>> (num 0) at 192.168.134.140:7777 > >>>> Nov 15 17:12:59 dbo2 kernel: o2net: connection to node dbo3 (num 2) at > >>>> 192.168.134.142:7777 has been idle for 10 > >>>> seconds, shutting it down. > >>>> Nov 15 17:12:59 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some > >>>> times that might help debug the situation: (tmr > >>>> 1163628769.44144 now 1163628779.43640 dr 1163628769.44123 adv > >>>> 1163628769.44159:1163628769.44160 func (f7e0383f:504) > >>>> 1163540424.444236:1163540424.444248) > >>>> Nov 15 17:12:59 dbo2 kernel: o2net: no longer connected to node dbo3 > >>>> (num 2) at 192.168.134.142:7777 > >>>> Nov 15 17:32:37 dbo2 -- MARK -- > >>>> Nov 15 17:33:03 dbo2 kernel: (11,1):o2quo_make_decision:121 ERROR: > >>>> fencing this node because it is only connected to 1 > >>>> nodes and 2 is needed to make a quorum out of 3 heartbeating nodes > >>>> Nov 15 17:33:03 dbo2 kernel: (11,1):o2hb_stop_all_regions:1889 ERROR: > >>>> stopping heartbeat on all active regions. > >>>> Nov 15 17:33:03 dbo2 kernel: Kernel panic: ocfs2 is very sorry to be > >>>> fencing this system by panicing > >>>> Nov 15 17:33:03 dbo2 kernel: > >>>> > >>>> NODE3 (dbo3) > >>>> ======================================================================================================== > >>>> > >>>> > >>>> Nov 15 17:12:49 dbo3 kernel: (13,3):o2hb_write_timeout:270 ERROR: > >>>> Heartbeat write timeout to device sdb2 > >>>> after 12000 milliseconds > >>>> Nov 15 17:12:49 dbo3 kernel: Heartbeat thread (13) printing last 24 > >>>> blocking operations (cur = 11): > >>>> Nov 16 10:45:32 dbo3 syslogd 1.4.1: restart. > >>>> > >>>> > >>>> any help is greatly appreciated (BTW, I've read the ocfs2 user guide). > >>>> > >>>> thanks > >>>> -peter > >>>> > >>>> > > _______________________________________________ > > Ocfs2-users mailing list > > [email protected] > > http://oss.oracle.com/mailman/listinfo/ocfs2-users > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.1 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFFbGs/oyy5QBCjoT0RAs5YAJ9Rcks/NmKQ2iu4x8I4ZLcpp8wfxgCgmjWJ > PUEYoxg/1p1XrcylzVnGo/Y= > =aBV/ > -----END PGP SIGNATURE----- > > _______________________________________________ > Ocfs2-users mailing list > [email protected] > http://oss.oracle.com/mailman/listinfo/ocfs2-users > > ________________________________________________________________________ > In order to protect our email recipients, Betfair use SkyScan from > MessageLabs to scan all Incoming and Outgoing mail for viruses. > > ________________________________________________________________________ -- Andy Phillips Systems Architecture Manager, Betfair.com Office: 0208 8348436 Betfair Ltd|Winslow Road|Hammersmith Embankment|London|W69HP Company No. 5140986 The information in this e-mail and any attachment is confidential and is intended only for the named recipient(s). The e-mail may not be disclosed or used by any person other than the addressee, nor may it be copied in any way. If you are not a named recipient please notify the sender immediately and delete any copies of this message. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden. Any view or opinions presented are solely those of the author and do not necessarily represent those of the company. _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
