Disk controller or network controller? For network, check duplex mode and interface errors, or try separate cross-cable connection for the heartbeat.
For disk, you can configure timeout (# of ticks lost before system fence cluster). ----- Original Message ----- From: "Andrew Phillips" <[EMAIL PROTECTED]> To: "enohi ibekwe" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[email protected]> Sent: Wednesday, April 11, 2007 2:43 AM Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic > Do you see anything else odd in your system logs? For example "losing > too many ticks"? > We've traced our problem, that may be similar to yours, to a disk > controller/firmware/driver > that was blocking interrupts for various periods of time. We've tried a > variety of ways > to get it to play nice, but without much luck. If the system is > unresponsive, or unable > to handle packet transmission or reception for 10s (unless you use the > 1.2.5 release) then > you'll trigger the o2net_idle_timer shutdown. > > Andy > > On Wed, 2007-04-11 at 09:13 +0000, enohi ibekwe wrote: > > Thanks for your help so far. > > > > My issue is the frequency at which node 0 gets fenced, it has happened at > > least once a day in the last 2 days. > > > > More details: > > > > I am attempting to add a node (node 2) to an existing 2 node ( node 0 and > > node1) cluster. Alll nodes are curently running SLES9 (2.6.5-7.283-bigsmp > > i686) + ocfs 1.2.1-4.2. This is the ocfs package that ships with SLES9. Node > > 2 is not part of the RAC cluster yet, I have only installed ocfs on it. I > > can mount the ocfs file system on all nodes, and the ocfs file system is > > accessible from all nodes. > > > > Node 0 is the node alway fenced and gets fenced very frequently. Before I > > added the kernel.panic parameter, node 0 would get fenced, panic and hang. > > Only a power reboot would make it responsive again. > > > > This is what happened this morning. > > > > I was remotely connected to node 0 via ssh. Then I suddenly lost the > > connection. I tried to ssh again but node 0 refused the connection. > > > > Checking node 1 dmesg I found : > > ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 0 1 2 > > o2net: connection to node ora1 (num 0) at 10.12.1.34:7777 has been idle for > > 10 seconds, shutting it down. > > (0,3):o2net_idle_timer:1310 here are some times that might help debug the > > situation: (tmr 1176207822.713473 now 1176207832.712008 dr 1176207822.713466 > > adv 1176207822.713475:1176207822.713476 func (1459c2a9:504) > > 1176196519.600486:1176196519.600489) > > o2net: no longer connected to node ora1 (num 0) at 10.12.1.34:7777 > > > > checking node 2 dmesg I found: > > ocfs2_dlm: Nodes in domain ("A7AE746FB3D34479A4B04C0535A0A341"): 0 1 2 > > o2net: connection to node ora1 (num 0) at 10.12.1.34:7777 has been idle for > > 10 seconds, shutting it down. > > (0,0):o2net_idle_timer:1310 here are some times that might help debug the > > situation: (tmr 1176207823.774296 now 1176207833.772712 dr 1176207823.774293 > > adv 1176207823.774297:1176207823.774297 func (1459c2a9:504) > > 1176196505.704238:1176196505.704240) > > o2net: no longer connected to node ora1 (num 0) at 10.12.1.34:7777 > > > > Since I had reboot on panic on both node 0, node 0 restarted. Checking > > /var/log/messages I found: > > Apr 10 09:39:50 ora1 kernel: (12,2):o2quo_make_decision:121 ERROR: fencing > > this node because it is only connected to 1 nodes and 2 is needed to make a > > quorum out of 3 heartbeating nodes > > Apr 10 09:39:50 ora1 kernel: (12,2):o2hb_stop_all_regions:1909 ERROR: > > stopping heartbeat on all active regions. > > Apr 10 09:39:50 ora1 kernel: Kernel panic: ocfs2 is very sorry to be fencing > > this system by panicing. > > > > > > > > > > ----Original Message Follows---- > > From: "Alexei_Roudnev" <[EMAIL PROTECTED]> > > To: "Jeff Mahoney" <[EMAIL PROTECTED]>,"enohi ibekwe" <[EMAIL PROTECTED]> > > CC: <[email protected]> > > Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic > > Date: Mon, 9 Apr 2007 11:00:30 -0700 > > > > It's noty an issue; it is really OCFSv2 killer: > > - in 99% cases, it is not split brain condition but just a short (20 - 30 > > seconds) network interruption. Systems can (in most cases) see each other by > > network or thru the voting disk, so they can communicate by one or another > > way; > > - in 90% cases system have not any pending IO activity, so it have not any > > reason to fence itself at least until some IO happen on OCFSv2 file system. > > For example, if OCFSv2 is used for backups, it is active 3 hours at night + > > at the time of restoring only, and server can remount it without any fencing > > if it lost consensus. > > - timeouts and other fencing parameters are badly designed, and it makes a > > problem worst. IT can't work out of the box on the most SAN networks (with > > recoinfiguration timeouts all about 30 seconds - 1 minute by default). For > > example, NetApp cluster takepooevr takes about 20 seconds, and giveback > > about 40 seconds - which kills OCFSv2 for 100% sure (with default settings). > > STP timeout (in classical mode) is 40 seconds, which kills OCFSv2 for 100% > > sure. Network switch remoot time is about 1 minute for most switches, which > > kills OCFSv2 for 100% sure. Result - if I reboot staging network switch, I > > have all stand alone servers working, all RAC clusters working, all other > > servers working, and all OCFSv2 cluster fenced themself. > > > > For me, I baned OCFSv2 from any usage except backup and archive logs, and > > only with using cross connection cable for heartbeat. > > All other scenarios are catastrofic (cause overall cluster failure in many > > cases). And all because of this fencing behavior. > > > > PS> SLES9 SP3 build 283 have a very stable OCFSv2, with one well known > > problem in buffer use - it don't release small buffers after file is > > created/deleted (so if you run create file / remove file loop for a long > > time, you will deplete system memory in apporox a few days). It is not a > > case if files are big enough (Oracle backups, oracle archive logs, > > application home) but must be taken into account if you have more than > > 100,000 - 1,000,000 files on OCFSv2 file system(s). > > > > But fencing problem exists in all versions (little better in modern ones, > > because developers added configurable network timeout). If you add _one > > heartbeat interface only_ design and _no serial heartbeat_ design, it really > > became a problem, ad it's why I was thinking about testing OCFSv2 in SLES10 > > with heartbeat2 (heartbeat2 have a very reliable heartbeat and have external > > fencing, but unfortunately SLES10 is not production ready yet for us, de > > facto). > > > > > > > > ----- Original Message ----- > > From: "Jeff Mahoney" <[EMAIL PROTECTED]> > > To: "enohi ibekwe" <[EMAIL PROTECTED]> > > Cc: <[email protected]> > > Sent: Saturday, April 07, 2007 12:06 PM > > Subject: Re: [Ocfs2-users] OCFS2 Fencing, then panic > > > > > > > -----BEGIN PGP SIGNED MESSAGE----- > > > Hash: SHA1 > > > > > > enohi ibekwe wrote: > > > > Is this also an issue on SLES9? > > > > > > > > I see this exact issue on my SLES9 + ocfs 1.2.1-4.2 RAC cluster. I see > > > > the error on the same box on the cluster. > > > > > > I'm not sure what you mean by "issue." This is designed behavior. When > > > the cluster ends up in a split condition, one or more nodes will fence > > > themselves. > > > > > > - -Jeff > > > > > > - -- > > > Jeff Mahoney > > > SUSE Labs > > > -----BEGIN PGP SIGNATURE----- > > > Version: GnuPG v1.4.6 (GNU/Linux) > > > Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org > > > > > > iD8DBQFGF+vDLPWxlyuTD7IRAuNPAJ9lZPLSaH7nOCNammYyW3bwC2Wj5wCgomUp > > > zcRzcaedVAmk+AaJ/OFeddE= > > > =8e6c > > > -----END PGP SIGNATURE----- > > > > > > _______________________________________________ > > > Ocfs2-users mailing list > > > [email protected] > > > http://oss.oracle.com/mailman/listinfo/ocfs2-users > > > > > > > _________________________________________________________________ > > Cant afford to quit your job? Earn your AS, BS, or MS degree online in 1 > > year. > > http://www.classesusa.com/clickcount.cfm?id=866145&goto=http%3A%2F%2Fwww.classesusa.com%2Ffeaturedschools%2Fonlinedegreesmp%2Fform-dyn1.html%3Fsplovr%3D866143 > > > > > > _______________________________________________ > > Ocfs2-users mailing list > > [email protected] > > http://oss.oracle.com/mailman/listinfo/ocfs2-users > -- > Andy Phillips > Systems Architecture Manager, Betfair.com > > Office: 0208 8348436 > > Betfair Ltd|Winslow Road|Hammersmith Embankment|London|W69HP Company No. > 5140986 > The information in this e-mail and any attachment is confidential and is > intended only for the named recipient(s). The e-mail may not be > disclosed or used by any person other than the addressee, nor may it be > copied in any way. If you are not a named recipient please notify the > sender immediately and delete any copies of this message. Any > unauthorized copying, disclosure or distribution of the material in this > e-mail is strictly forbidden. Any view or opinions presented are solely > those of the author and do not necessarily represent those of the > company. > > > > > > ________________________________________________________________________ > In order to protect our email recipients, Betfair Group use SkyScan from > MessageLabs to scan all Incoming and Outgoing mail for viruses. > > ________________________________________________________________________ > _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
