[Ocfs2-users] server reboots due to heartbeat error - ocfs node

Khosrof Ohanian Thu, 28 Jun 2012 12:24:26 -0700

I have been trying to submit this message to the group for 3 days now.  I hope 
it works this time around.  Any help would be appreciated.


I am troubleshooting an issue with my development RAC server where node 1 will 
reboot due to a heartbeat timeout.  I am getting multiple errors from ocfs2 and 
wondering if my configuration is wrong or if my procedure is wrong.  The issue 
comes up when I am cloning the OCFS lun for the production system to a separate 
lun used for development.  Taking standard precautions, but yet still node 1 on 
the development cluster (2 node) reboots.

System design:
2 ocfs2 clusters (prod and development)
production = 3 nodes, development = 2 nodes
All systems are running Oracle Linux Server release 5.8
kernel: [root@node1 ~]# uname -a
Linux node1.xxxxx.com 2.6.18-308.1.1.0.1.el5 #1 SMP Wed Mar 7 11:39:17 EST 2012 
x86_64 x86_64 x86_64 GNU/Linux
OCFS release ocfs2-2.6.18-308.1.1.0.1.el5-1.4.9-1.el5

cluster node setup on prod and development clusters.
Unique cluster names for each.
Nodes in the production cluster are numbered 1,2,3
Nodes in the development cluster are number 1,2
QUESTION 1:  Should the node numbers be unique since I am cloning a LUN between 
the 2 clusters?  See error about heartbeat in the same slot below?

Procedure:
Shutdown all processes on the luns to be cloned.
unmount the luns to be cloned on both development nodes (devnode1, devnode2)
synchronize the clone to the production lun on the san, EMC clariion.  Fracture 
the luns.
On devnode2 I run the following commands:
fsck.ocfs2 (fix errors)
tunefs.ocfs2 --label=dev.index /dev/path (set new label)
tunefs.ocfs2 --uuid-reset /dev/path (set random uuid)
On devnode1 I run:
sfdisk -R /dev/path (re-read partitions to grab new label and uuid)
Then I mount the volumes onto both nodes

Errors:
Both nodes constantly report the same error on the cloned luns.
kernel: (o2hb-D34207AE9F,4086,16):o2hb_do_disk_heartbeat:781 ERROR: Device 
"emcpowerj1": another node is heartbeating in our slot!
However, the error above does not cause any instability.
After I unmount the luns and start the clone, I get the following error for a 
few minutes:
(MpxTestDaemon  ,14513,8):o2hb_bio_end_io:241 ERROR: IO Error -5
kernel: (o2hb-023EBFE1B5,3945,8):o2hb_do_disk_heartbeat:772 ERROR: status = -5
After 3 minutes the system gets rebooted, the logs show the following:
kernel: (events/8,70,8):o2hb_write_timeout:176 ERROR: Heartbeat write timeout 
to device emcpowerl1 after 150000 milliseconds
(events/8,70,8):o2hb_stop_all_regions:2026 ERROR: stopping heartbeat on all 
active regions.

QUESTION 2:  Do I need to stop the heartbeat on the unmounted luns before the 
SAN unpresents them from the server?  I found a command as follows:
ocfs2_hb_ctl -K -d /dev/device

QUESTION 3:  Am I doing anything else wrong in my procedure that would be 
causing the heartbeat issue and server reboot?

Thanks in advance for your time and replies.

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

[Ocfs2-users] server reboots due to heartbeat error - ocfs node

Reply via email to