> 1. What is the your disk heartbeat timeout? If you are unsure, > "cat /etc/sysconfig/o2cb". 31
> > 2. What is your shared disk setup like? Fiber Channel, iscsi, AoE, etc. > Provide as much detail as you can. iSCSI, on NetApp cluster, sopftware initiator. Tested on FibreChannel as well. System is SLES9 SP3 > > 3. Are you using some sort of multipathing? If so, provide details. Embedded iSCSI multi port support. Can test on FC and system multipath. > > 4. What is the cluster used for? Oracle database, mailserver, etc. Oracle - archive logs and backups ONLY. Other cluster (testing) - aplication binaries and configurations. > > 5. How many nodes in your cluster? 3 (2 RAC + 1 backup server) 2 > > 6. Any other relevant information? SAN convergence time is: - On NetApp - 1 minute - on Ethernet - 50 seconds - on FibreChannel network - 1 minute (timeouts on HDS Solaris multipath, for example) Network switch reboot time - about 40 seconds. Events: - rebooting one server - no problems. - power outage (10 seconds) on network switches, caused both interfaces gow down - all servers in all clusters rebooted (by OCFSv2, 1 by Oracle CSS). - problems noticed: * when I used cluster for document storage (I tested it), high CPU during heavy io operations; I tested and the decided to use heartbeat cluster + ReiserFS. * when my oracle server locked up memory (on spinlock) so that system freeze for 30 sseconds, it resulted in damaged OCFS (1 time - fatal, and 1 time - repairable). * since we began to use OCFSv2 for low IO file systems only, no big problem except fencing even if system have not pending IO on it. wishes: - clustered lvm2 (not evms - evms is too complicated and is really heavy overhead for 90% tasks); - online resize (at least if we have 1 node left in the system). - multi interface heartbeat; - self-fencing ONLY if system have pending IO (configurable); - if OCFSv2 cluster see, that ALL servers aroiund can not run heartbeat (disk IO delay), no need to self-fence any of them until at least one can run heartbeat on disk again. For now, if al servers lost access to the disk, they all (except 1) reboot; in reality, if they see each other, they dont need to reboot because they can classify failure as GLOBAL. - emergency local mount mode. > > Again, feel free to mail me directly. > > Thanks > Sunil > > _______________________________________________ > Ocfs2-users mailing list > [email protected] > http://oss.oracle.com/mailman/listinfo/ocfs2-users > _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
