[Ocfs2-users] re: o2hb_do_disk_heartbeat:963 ERROR: Device "sdb1" another node is heartbeating in our slot!

Peter Santos Fri, 16 Mar 2007 12:04:24 -0800

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Folks,


I'm trying to wrap my head around something that happened in our environment.
Basically, we noticed the error in /var/log/messages with no other errors.

"Mar 16 13:38:02 dbo3 kernel: (3712,3):o2hb_do_disk_heartbeat:963 ERROR: Device 
"sdb1": another node is
                                              heartbeating in our slot!"        
Usually there are a number of other errors, but this one was it.

Our RAC cluster is made up of 3 nodes (dbo1,dbo2,dbo3) and they use ocfs2 for 
the ocr /voting file, but
ASM is where the datafiles are located. This is suse9 kernel 282.


A while back one of our SA's was trying to install ocfs2 on a couple of red-hat 
machines, and didn't properly
configure ocfs2 to add the nodes. I believe he just copied directories and the 
/etc/ocfs2/cluster.conf file.
Anyway, when he turned the machines on today, they were still mis configured 
and I believe that is the
cause of the error message "another node is heartbeating in our slot" message? 
would you agree ?

As I mentioned there are only 3 nodes in our cluster, but the /etc/cluster.conf 
file shows 6 and so does the
following:
        [EMAIL PROTECTED]:/etc/ocfs2> ls /config/cluster/ocfs2/node/
        dbo1  dbo2  dbo3  dbo4  dbt3  dbt4

So my question, is how do I permanently remove dbt3, dbt4 and dbo4 ? I checked 
out the ocfs2 guide, but it only
has information on adding a node to both an online/offline cluster.


More importantly is how the oracle clusterware behaved.  After this happened, 
my ASM and RDBMS instances stayed         
up. None of the machines rebooted. But the CRS deamon appears to be having 
issues.

When I run "crsctl check crs" on all 3 nodes, I get the error "Cannot 
communicate with CRS" on all 3 nodes.
The cssd log directory has a core file .. yet I can log into all 3 database 
instances as if nothing happened.

I suspect this is a bug?

The CRSD log files reveal some sort of issue relating to problems writing to 
the ocr file ..which is on ocfs2. But
if there really was a problem, wouldn't ocfs2 have rebooted the machine? And 
when RAC has a problem accessing the ocfs2
volume, there are usually a large number of io errors in the system log


Any insight is greatly appreciated.

- -peter


alertdbo3.log
=============
2007-03-16 13:38:25.471
[crsd(4994)]CRS-1006:The OCR location /ocfs2/oracrs/ocr.crs is inaccessible. 
Details in
                             
/data/app/crs/oracle/product/10.2.0/crs/log/dbo3/crsd/crsd.log.

2007-03-16 13:38:43.377
[client(13125)]CRS-1006:The OCR location /ocfs2/oracrs/ocr.crs is inaccessible. 
Details in                      
                               
/data/app/crs/oracle/product/10.2.0/crs/log/dbo3/client/css.log.


crsd.log
=============
2007-03-16 13:38:11.708: [  OCRCLI][1407371616]proac_set_value: Response 
message returned with failure keyname =
[CRS.CUR.ora!ORACTAH!ORACTAH3!inst.REASON], retcode = 26
2007-03-16 13:38:11.710: [  OCRCLI][1417865568]proac_set_value: Response 
message returned with failure keyname =
[CRS.CUR.ora!dbo3!LISTENER_DBO3!lsnr.REASON], retcode = 26
2007-03-16 13:38:24.159: [  OCRMSG][1407371616]prom_rpc: CLSC recv failure..ret 
code 7
2007-03-16 13:38:24.159: [  OCRMSG][1407371616]prom_rpc: possible OCR retry 
scenario
2007-03-16 13:38:24.159: [ COMMCRS][1417865568]clscsendx: (0xc80100) Physical 
connection (0xc7fa30) not active

2007-03-16 13:38:24.159: [  OCRMSG][1417865568]prom_rpc: CLSC send failure..ret 
code 11
2007-03-16 13:38:24.159: [  OCRMSG][1417865568]prom_rpc: possible OCR retry 
scenario
2007-03-16 13:38:25.036: [  OCRMAS][1182845280]th_master:13: I AM THE NEW OCR 
MASTER at incar 3. Node Number = 3
2007-03-16 13:38:25.046: [  OCRRAW][1182845280]proprioo: for disk 0 
(/ocfs2/oracrs/ocr.crs), id match (1), my id set
(1201294405,1028247821) total id sets (1), 1st set (1201294405,1028247821), 2nd 
set (0,0) my votes (2), total votes (2)
2007-03-16 13:38:25.102: [  OCRRAW][1182845280]rrecover:3: recovery required
2007-03-16 13:38:25.471: [  OCRRAW][1182845280]rtnode:3: invalid tnode 1085
2007-03-16 13:38:25.471: [  OCRRAW][1182845280]propropen:0: could not read 
tnode addrd=0
2007-03-16 13:38:25.471: [  OCRRAW][1182845280]proprseterror: Error in 
accessing physical storage [26] Marking context
invalid.
2007-03-16 13:38:25.471: [  OCRUTL][1182845280]u_freem: INVALID 
PROU_BEGIN_MEMTAG for memory [99351708] Begin tag
[99351170] Expected begin tag [5072426d]
[  OCRMAS][1182845280]th_calc_av:8.1': Error reading key 
[SYSTEM.version.node_numbers.node3]
2007-03-16 13:38:25.471: [  OCRMAS][1182845280]th_master:9: Shutdown 
CacheMaster. prev AV [169869824] new calc av
[169869824] my sv [169869824]2007-03-16 13:38:39.932: [  
CRSOCR][1438853472]0OCR api procr_open_key failed for key
CRS.CUR. OCR error code = 3 OCR error msg:
2007-03-16 13:38:39.932: [  CRSOCR][1438853472][PANIC]0Failed to open key: 
CRS.CUR(File: caaocr.cpp, line: 472)


* The cssd directory has a core file, but nothing in the ocssd.log file.


        

        






-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFF+vg0oyy5QBCjoT0RAkemAJ9NSS2e9gndC62WErJlgr82aAwuZwCgjfk8
xFtWactcUf2LcoUKLexmaPQ=
=Av6M
-----END PGP SIGNATURE-----

_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

[Ocfs2-users] re: o2hb_do_disk_heartbeat:963 ERROR: Device "sdb1" another node is heartbeating in our slot!

Reply via email to