Hi Sunil, my lotus notes choked on the table from excel... So the two nodes have the following nodenumbers: Node ocfs2 crs/css byaz05 0 2 byaz10 1 1
Greets, Alex >In such a situation, ocfs2 fences the higher node number. afaik, >css does the same. What are the css node numbers for the two nodes? >alexandra.strauss at bayerbbs.com wrote: >> >> Hello, >> >> I refer to you hoping you may help me with my problem... We have got >> an issur here and opened a SR at Metalink but until now, we got no >> useful information in solving our problem. SR-Number is 6855815.994... >> >> We wanted to protect 9i Single-Instance Databases with 10g Clusterware >> following the third-party-tool approach. There are no RAC-databases >> involved. But we want to achieve high availability as the databases >> are business critical systems. We want to make the systems able to >> relocate to another machine in case of failure to keep downtimes >> low... To achieve this we want to use OCFS2 for the filesystem. >> Relocate is done by script with help of CRS. >> >> So we took two systems (byaz05 and byaz10) and installed the following >> software: 10g CRS (10.2.0.3) and Oracle Software 9.2.0.8 and OCFS2 1.2.8 >> >> We found the following Metalinknotes and adjusted the heartbeat and >> timeouts for OCFS2: Metalink Note 395878.1: Heartbeat/Voting/Quorum >> Related Timeout Configuration for Linux, OCFS2, RAC Stack to avoid >> unnessary node fencing, panic and reboot >> Metalink Note 391771.1: OCFS2 - FREQUENTLY ASKED QUESTIONS (hier >> insbesondere der Abschnitt zu Fencing und Quorum) >> Metalink Note 434255.1: Common reasons for OCFS2 Kernel Panic or >> Reboot Issues >> Metalink Note 457423.1: OCFS2 Fencing, Network, and Disk Heartbeat >> Timeout Configuration >> >> We did no changes to the CRS/CSS default settings until now. >> >> During HA-testing we watched unexpected behaviour of the system. We >> deactivated the bond for private interconnect and expected only one >> node to go down. But we faced both nodes going down. As it seems to me >> one node was rebooted from OCFS2 and the other one from CRS/CSS. >> >> Timestamp >> -------------------------------------------------------------------------------------------------------------- >> >> 10:21:06 bond1 disabled (eth1) >> */var/log/messages byaz05* >> Apr 25 10:21:06 byaz05 kernel: bonding: bond1: link status definitely >> down for interface eth1, disabling it >> Apr 25 10:21:06 byaz05 kernel: bonding: bond1: making interface eth5 >> the new active one. >> >> 10:21:09 bond1 disabled (eth5) >> */var/log/messages byaz05* >> Apr 25 10:21:09 byaz05 kernel: bonding: bond1: link status definitely >> down for interface eth5, disabling it >> Apr 25 10:21:09 byaz05 kernel: bonding: bond1: now running without any >> active interface ! >> >> 10:21:23 o2net ? no longer connected >> */var/log/messages byaz05* >> Apr 25 10:21:23 byaz05 kernel: o2net: no longer connected to node >> byaz10.bayer-ag.com (num 1) at 10.190.59.6:7777 >> */var/log/messages byaz10* >> Apr 25 10:21:23 byaz10 kernel: o2net: no longer connected to node >> byaz05.bayer-ag.com (num 0) at 10.190.59.5:7777 >> >> 10:21:27 CSSD failure 134 >> 10:21:29 Reboot initiated by CRS >> */var/log/messages byaz05* >> Apr 25 10:21:27 byaz05 logger: Oracle clsomon failed with fatal status >> 12. >> Apr 25 10:21:27 byaz05 logger: Oracle CSSD failure 134. >> Apr 25 10:21:27 byaz05 su(pam_unix)[25839]: session closed for user >> oracle >> Apr 25 10:21:27 byaz05 logger: Oracle CRS failure. Rebooting for >> cluster integrity. >> Apr 25 10:21:27 byaz05 kernel: md: stopping all md devices. >> Apr 25 10:21:27 byaz05 kernel: md: md0 switched to read-only mode. >> Apr 25 10:21:29 byaz05 logger: Oracle CRS failure. Rebooting for >> cluster integrity. >> Apr 25 10:21:29 byaz05 kernel: e1000: eth2: e1000_watchdog_task: NIC >> Link is Up 1000 Mbps Full Duplex >> Apr 25 10:21:29 byaz05 logger: Oracle init script ceding reboot to >> sibling 27383. >> >> 10:21:58 Reboot initiated by OCFS2(?) >> */var/log/messages byaz10* >> Apr 25 10:21:58 byaz10 su(pam_unix)[4595]: session opened for user >> oracle by (uid=0) >> Apr 25 10:21:58 byaz10 su(pam_unix)[4595]: session closed for user oracle >> Apr 25 10:25:58 byaz10 syslogd 1.4.1: restart. >> Apr 25 10:25:58 byaz10 syslog: syslogd startup succeeded >> Apr 25 10:25:58 byaz10 kernel: klogd 1.4.1, log source = /proc/kmsg >> started. >> Apr 25 10:25:58 byaz10 kernel: Bootdata ok (command line is ro >> root=/dev/vgroot/_) >> >> >> We supposed all the time this is a timing problem. But we don't know >> which settings raise the problem and which steps to do to to correct >> them. Otherwise we'll have to work over the complete concept for the >> business critical systems. >> Can anyone help me? >> >> Regards, >> Alexandra
_______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
