-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I'm not really how these other servers were setup. I believe disk images were used.
Now I seem to have a bigger problem. I restarted one of my nodes to see if I can clear up this mess and now the restarted node won't mount the ocfs2 partitions... so the RAC cluster doesn't come up. I re-started node3 dbo3:~ # /etc/init.d/ocfs2 start Starting Oracle Cluster File System (OCFS2) ocfs2_hb_ctl: Bad magic number in superblock while reading uuid mount.ocfs2: Error when attempting to run /sbin/ocfs2_hb_ctl: "Operation not permitted" ocfs2_hb_ctl: Bad magic number in superblock while reading uuid mount.ocfs2: Error when attempting to run /sbin/ocfs2_hb_ctl: "Operation not permitted" Both node1 and node2 have the /ocfs2 cluster partition mounted, but the mounted.ocfs2 -d command only shows the /backups .. which is also mounted on node1 and node2. Any ideas on how I can get around this "Bad magic number is superblock" problem? dbo1 and dbo2 ==================================== dbo1:/ocfs2 # mount -t ocfs2 /dev/sdb1 on /ocfs2 type ocfs2 (rw,_netdev,datavolume,nointr,heartbeat=local) /dev/sdb2 on /backups type ocfs2 (rw,_netdev,datavolume,nointr,heartbeat=local) dbo1:~ # mounted.ocfs2 -d Device FS UUID Label /dev/sdb2 ocfs2 f35379a7-07a7-4e87-b766-5ee42f595fbf /backups dbo2:/ocfs2/oracrs # mount -t ocfs2 /dev/sdb1 on /ocfs2 type ocfs2 (rw,_netdev,datavolume,nointr,heartbeat=local) /dev/sdb2 on /backups type ocfs2 (rw,_netdev,datavolume,nointr,heartbeat=local) dbo2:/ocfs2/oracrs # mounted.ocfs2 -d Device FS UUID Label /dev/sdb2 ocfs2 f35379a7-07a7-4e87-b766-5ee42f595fbf /backups - -peter Alexei_Roudnev wrote: > Btw, upgrade kernel to #283; 282 had a serious bug in OCFSv2 (relaying to > the simultaneous append t the file). > > Another story - try to keep CSR and CSS files out of OCFSv2. reason is that > keeping CRS files on OCFS, you de facto keep > one cluster (CRS) depending of another (OCFS), which can influence CRS > decisions in a faulrty situations. > > (It's usually simple to create 2 more partitions or LUN's for OCRFile and > CSSFile - 102MB and 22MB each). > > What's about your case - these experiments could really broke heartbeat (did > you allowed access to the same disks from these new > experimental servers?) > > > ----- Original Message ----- > From: "Peter Santos" <[EMAIL PROTECTED]> > To: <[email protected]> > Sent: Friday, March 16, 2007 1:04 PM > Subject: [Ocfs2-users] re: o2hb_do_disk_heartbeat:963 ERROR: Device "sdb1" > another node is heartbeating in our slot! > > > Folks, > > I'm trying to wrap my head around something that happened in our >> environment. > Basically, we noticed the error in /var/log/messages with no other errors. > > "Mar 16 13:38:02 dbo3 kernel: (3712,3):o2hb_do_disk_heartbeat:963 ERROR: >> Device "sdb1": another node is > heartbeating in our slot!" > Usually there are a number of other errors, but this one was it. > > Our RAC cluster is made up of 3 nodes (dbo1,dbo2,dbo3) and they use ocfs2 >> for the ocr /voting file, but > ASM is where the datafiles are located. This is suse9 kernel 282. > > > A while back one of our SA's was trying to install ocfs2 on a couple of >> red-hat machines, and didn't properly > configure ocfs2 to add the nodes. I believe he just copied directories and >> the /etc/ocfs2/cluster.conf file. > Anyway, when he turned the machines on today, they were still mis >> configured and I believe that is the > cause of the error message "another node is heartbeating in our slot" >> message? would you agree ? > As I mentioned there are only 3 nodes in our cluster, but the >> /etc/cluster.conf file shows 6 and so does the > following: > [EMAIL PROTECTED]:/etc/ocfs2> ls /config/cluster/ocfs2/node/ > dbo1 dbo2 dbo3 dbo4 dbt3 dbt4 > > So my question, is how do I permanently remove dbt3, dbt4 and dbo4 ? I >> checked out the ocfs2 guide, but it only > has information on adding a node to both an online/offline cluster. > > > More importantly is how the oracle clusterware behaved. After this >> happened, my ASM and RDBMS instances stayed > up. None of the machines rebooted. But the CRS deamon appears to be having >> issues. > When I run "crsctl check crs" on all 3 nodes, I get the error "Cannot >> communicate with CRS" on all 3 nodes. > The cssd log directory has a core file .. yet I can log into all 3 >> database instances as if nothing happened. > I suspect this is a bug? > > The CRSD log files reveal some sort of issue relating to problems writing >> to the ocr file ..which is on ocfs2. But > if there really was a problem, wouldn't ocfs2 have rebooted the machine? >> And when RAC has a problem accessing the ocfs2 > volume, there are usually a large number of io errors in the system log > > > Any insight is greatly appreciated. > > -peter > > > alertdbo3.log > ============= > 2007-03-16 13:38:25.471 > [crsd(4994)]CRS-1006:The OCR location /ocfs2/oracrs/ocr.crs is >> inaccessible. Details in > /data/app/crs/oracle/product/10.2.0/crs/log/dbo3/crsd/crsd.log. > > 2007-03-16 13:38:43.377 > [client(13125)]CRS-1006:The OCR location /ocfs2/oracrs/ocr.crs is >> inaccessible. Details in > /data/app/crs/oracle/product/10.2.0/crs/log/dbo3/client/css.log. > > > crsd.log > ============= > 2007-03-16 13:38:11.708: [ OCRCLI][1407371616]proac_set_value: Response >> message returned with failure keyname = > [CRS.CUR.ora!ORACTAH!ORACTAH3!inst.REASON], retcode = 26 > 2007-03-16 13:38:11.710: [ OCRCLI][1417865568]proac_set_value: Response >> message returned with failure keyname = > [CRS.CUR.ora!dbo3!LISTENER_DBO3!lsnr.REASON], retcode = 26 > 2007-03-16 13:38:24.159: [ OCRMSG][1407371616]prom_rpc: CLSC recv >> failure..ret code 7 > 2007-03-16 13:38:24.159: [ OCRMSG][1407371616]prom_rpc: possible OCR >> retry scenario > 2007-03-16 13:38:24.159: [ COMMCRS][1417865568]clscsendx: (0xc80100) >> Physical connection (0xc7fa30) not active > 2007-03-16 13:38:24.159: [ OCRMSG][1417865568]prom_rpc: CLSC send >> failure..ret code 11 > 2007-03-16 13:38:24.159: [ OCRMSG][1417865568]prom_rpc: possible OCR >> retry scenario > 2007-03-16 13:38:25.036: [ OCRMAS][1182845280]th_master:13: I AM THE NEW >> OCR MASTER at incar 3. Node Number = 3 > 2007-03-16 13:38:25.046: [ OCRRAW][1182845280]proprioo: for disk 0 >> (/ocfs2/oracrs/ocr.crs), id match (1), my id set > (1201294405,1028247821) total id sets (1), 1st set >> (1201294405,1028247821), 2nd set (0,0) my votes (2), total votes (2) > 2007-03-16 13:38:25.102: [ OCRRAW][1182845280]rrecover:3: recovery >> required > 2007-03-16 13:38:25.471: [ OCRRAW][1182845280]rtnode:3: invalid tnode >> 1085 > 2007-03-16 13:38:25.471: [ OCRRAW][1182845280]propropen:0: could not read >> tnode addrd=0 > 2007-03-16 13:38:25.471: [ OCRRAW][1182845280]proprseterror: Error in >> accessing physical storage [26] Marking context > invalid. > 2007-03-16 13:38:25.471: [ OCRUTL][1182845280]u_freem: INVALID >> PROU_BEGIN_MEMTAG for memory [99351708] Begin tag > [99351170] Expected begin tag [5072426d] > [ OCRMAS][1182845280]th_calc_av:8.1': Error reading key >> [SYSTEM.version.node_numbers.node3] > 2007-03-16 13:38:25.471: [ OCRMAS][1182845280]th_master:9: Shutdown >> CacheMaster. prev AV [169869824] new calc av > [169869824] my sv [169869824]2007-03-16 13:38:39.932: [ >> CRSOCR][1438853472]0OCR api procr_open_key failed for key > CRS.CUR. OCR error code = 3 OCR error msg: > 2007-03-16 13:38:39.932: [ CRSOCR][1438853472][PANIC]0Failed to open key: >> CRS.CUR(File: caaocr.cpp, line: 472) > > * The cssd directory has a core file, but nothing in the ocssd.log file. > > > > > > > > > > > >> _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users >> -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFF+wSRoyy5QBCjoT0RArwrAKCZSL8PckEtKv2g7gsHazL9eUWjVgCdHM2H KjTEYZL/nxXn+UbMDCvETVI= =eX2O -----END PGP SIGNATURE----- _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
