HI Karim, Excellent information, many thanks indeed.
Best Regards John On Fri, 2009-06-05 at 00:24 +0300, Karim Alkhayer wrote: > Hi John, > > > > When multiple systems/nodes have access to data via shared storage, > the integrity of the data depends on inter-node communication ensuring > that each node is aware when other nodes are writing data. When the > coordination between the nodes fails, it results in a “split brain” > condition; A situation in which two servers try to independently > control the storage, potentially resulting in application failure or > even corruption of critical data. > > > > I/O fencing is a method of choice (used by vendors cluster frameworks, > including OCFS2) for ensuring the integrity of critical information by > preventing data corruption, allowing a set of systems to have > temporary registrations with the disk and coordinate a write-exclusive > reservation with the disk containing the data. With I/O fencing, the > cluster system ensures that errant nodes are “fenced” and do not have > access to the shared storage, while the eligible node(s) continue to > have access to the data, virtually eliminating the risk of data > corruption. > > > > The quorum is the group of nodes in a cluster that is allowed to > operate on the shared storage. When there is a failure in the cluster, > nodes may be split into groups that can communicate in their groups > and with the shared storage but not between groups. > > > > O2QUO determines which group is allowed to continue and initiates > fencing of the other group(s). > > Fencing is the act of forcefully removing a node from a cluster. A > node with OCFS2 mounted will fence itself when it realizes that it > does not have quorum in a degraded cluster. It does this so that other > nodes won’t be stuck trying to access its resources. However, the > resources do NOT get released > > > > O2CB uses a node reset mechanism to fence; this however, is causing > the machine(s) to hang instead of seamless handover. In OCFS2 1.4, > Oracle has introduced a new fencing mechanism which no longer uses > “panic” for fencing. Instead, by default, it uses "machine restart". > > > > In your case, taking the network down the way you’ve done is causing > the servers to hang, including the mounted file system which becomes > locked until the OCFS cluster services is restarted. > > > > RAC handover fails due to exactly this problem: the file system is > locked by another node which was kicked out of the cluster, but still > occupying the file system > > The healthy node will try to continue to work, but the databases > hosted on the occupied file system will hang, and possibly the > machine. At this time there is no solution but to > > - Force shutdown the troublesome node(s) > > - Shutdown the databases processes > > - Restart the OCFS2 services > > > > Network failure resolution can be applied in a situation where you > have setup a net bonding for the interconnects, which is highly > recommended. > > > > Best regards, > > Karim Alkhayer > > > > > > -----Original Message----- > From: ocfs2-users-boun...@oss.oracle.com > [mailto:ocfs2-users-boun...@oss.oracle.com] On Behalf Of John Murphy > Sent: Thursday, June 04, 2009 10:15 PM > To: ocfs2-users@oss.oracle.com > Subject: [Ocfs2-users] OCFS2 v1.4 hangs > > > > I have four database servers in a high-availability, load-balancing > > configuration. Each machine has a mount to a common data source which > is > > an OCFS2 v1.4 file-system. While working on three of the servers, I > > restarted the IP network and found after-wards the fourth machine > hung. > > I could not reboot and could not unmount the ocfs2 partitions. I am > > pretty sure this was all caused by my taking down the network on all > > three of the remaining machines, can anyone shed some light on this > for. > > Ironically, I have four machines in order to ensure reliability. > > > > TIA > > > > John > > -- > > John Murphy > > Technical And Managing Director > > MANDAC Ltd > > Kandoy House > > 2 Fairview Strand > > Dublin 3 > > p: +353 1 5143001 > > m: +353 85 711 6844 > > e: john.mur...@mandac.eu > > w: www.mandac.eu > > > > > > > > _______________________________________________ > > Ocfs2-users mailing list > > Ocfs2-users@oss.oracle.com > > http://oss.oracle.com/mailman/listinfo/ocfs2-users > > -- John Murphy Technical And Managing Director MANDAC Ltd Kandoy House 2 Fairview Strand Dublin 3 p: +353 1 5143001 m: +353 85 711 6844 e: john.mur...@mandac.eu w: www.mandac.eu _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users