Greetings, I have a 4-node Oracle RAC cluster sharing four OCFS2 v1.2 filesystems on RHEL5. Node 3 was taken down for maintenance and was rebooted several times. During this time, the networking stack on the cluster interconnect had issues (after changing to an active-backup bonding method) and was receiving high packet loss, resulting in timeouts connecting to the cluster. After the networking changes were reverted (putting the bonding method back to active-active) and the server rebooted, I can join the cluster but can only mount 3 out of the 4 OCFS2 filesystems: [EMAIL PROTECTED] /]# mount /dev/mapper/limsp_archp1 mount.ocfs2: Unknown code B 0 while mounting /dev/mapper/limsp_archp1 on /var/opt/oracle/oradata/limsp/arch. Check 'dmesg' for more information on this error.
dmesg reports: (17909,1):dlm_join_domain:1301 Timed out joining dlm domain 980E9BC11D2C458B9BC8BEACC1365CAC after 90400 msecs ocfs2: Unmounting device (253,19) on (node 3) The other nodes do not report anything for this filesystem during the failed join, but I do see successful domain joins for the other OCFS2 filesystems. I can ping the interconnect IPs between all 4 servers. I have rebooted several times and restarted the entire cluster stack to no avail. The problem has persisted for the last 18 hours. My initial thoughts is that there is a DLM resource lock that cannot be released, but I'm not exactly sure how to fix it (rebooting the other nodes is not the best option as this is a high production environment). I've tried to use the debugfs tools mentioned in the FAQ/User Guides, but it's very confusing and I'm not sure what I need to look for. I can see the disk device just fine on the server, and can browse the filesystem using ocfs2console, just cannot join the domain to mount it. I would appreciate any advice anyone may have. My details are: [EMAIL PROTECTED] /]# uname -a Linux ausracdb04.austin.ppdi.com 2.6.18-53.el5 #1 SMP Wed Oct 10 16:34:19 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux [EMAIL PROTECTED] /]# rpm -qa | grep -i ocfs2 ocfs2-2.6.18-53.el5-1.2.8-2.el5 ocfs2console-1.2.7-2.el5 ocfs2-tools-1.2.7-2.el5 [EMAIL PROTECTED] /]# cat /etc/ocfs2/cluster.conf node: ip_port = 7777 ip_address = 192.168.0.100 number = 0 name = ausracdb01 cluster = racdb node: ip_port = 7777 ip_address = 192.168.0.101 number = 1 name = ausracdb02 cluster = racdb node: ip_port = 7777 ip_address = 192.168.0.102 number = 2 name = ausracdb03 cluster = racdb node: ip_port = 7777 ip_address = 192.168.0.106 number = 3 name = ausracdb04 cluster = racdb cluster: node_count = 4 name = racdb [EMAIL PROTECTED] /]# cat /etc/sysconfig/o2cb # # This is a configuration file for automatic startup of the O2CB # driver. It is generated by running /etc/init.d/o2cb configure. # Please use that method to modify this file # # O2CB_ENABELED: 'true' means to load the driver on boot. O2CB_ENABLED=true # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start. O2CB_BOOTCLUSTER=racdb # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead. O2CB_HEARTBEAT_THRESHOLD=61 # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is considered dead. O2CB_IDLE_TIMEOUT_MS=60000 # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is sent O2CB_KEEPALIVE_DELAY_MS= # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts O2CB_RECONNECT_DELAY_MS= [EMAIL PROTECTED] /]# echo "stat " | debugfs.ocfs2 -n /dev/mapper/limsp_archp1 Inode: 5 Mode: 0775 Generation: 1066067688 (0x3f8ae6e8) FS Generation: 1066067688 (0x3f8ae6e8) Type: Directory Attr: 0x0 Flags: Valid System User: 503 (oracle) Group: 505 (dba) Size: 40960 Links: 4 Clusters: 10 ctime: 0x48e635d4 -- Fri Oct 3 10:10:12 2008 atime: 0x48627838 -- Wed Jun 25 11:54:16 2008 mtime: 0x48e635d4 -- Fri Oct 3 10:10:12 2008 dtime: 0x0 -- Wed Dec 31 18:00:00 1969 ctime_nsec: 0x3ad5b3d6 -- 987083734 atime_nsec: 0x00000000 -- 0 mtime_nsec: 0x3ad5b3d6 -- 987083734 Last Extblk: 0 Sub Alloc Slot: Global Sub Alloc Bit: 1 Tree Depth: 0 Count: 243 Next Free Rec: 10 ## Offset Clusters Block# 0 0 1 207 1 1 1 485268 2 2 1 2096789 3 3 1 751454 4 4 1 1782521 5 5 1 2144728 6 6 1 2145932 7 7 1 1784169 8 8 1 1601861 9 9 1 2446400 [EMAIL PROTECTED] /]# echo "slotmap" | debugfs.ocfs2 -n /dev/mapper/limsp_archp1 Slot# Node# 0 0 1 1 2 2 Slotmaps for another filesystem that is correctly joined and mounted: [EMAIL PROTECTED] /]# echo "slotmap" | debugfs.ocfs2 -n /dev/mapper/ph1pp1 Slot# Node# 0 0 1 1 2 2 3 3 I don't know if this is a correct command to look for "busy" locks. (Done from another node): [EMAIL PROTECTED] ~]# echo "fs_locks" | debugfs.ocfs2 -n /dev/mapper/limsp_archp1 | grep -i busy [EMAIL PROTECTED] ~]# TIA, Daniel ______________________________________________________________________ This email transmission and any documents, files or previous email messages attached to it may contain information that is confidential or legally privileged. If you are not the intended recipient or a person responsible for delivering this transmission to the intended recipient, you are hereby notified that you must not read this transmission and that any disclosure, copying, printing, distribution or use of this transmission is strictly prohibited. If you have received this transmission in error, please immediately notify the sender by telephone or return email and delete the original transmission and its attachments without reading or saving in any manner.
_______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users