The fencing is because the io write took more than 2 mins. Since you have provided only a snippet of the logs, all I can say is that mutipathd detecting the path failure and o2hb fencing is 90 secs apart. I don't see the barely timed out bit.
Check your multipath setting/configuration. Daniel Keisling wrote: > Greetings, > > I have 3 Oracle RAC clusters running on OCFS2 attached to an HP > EVA8000 SAN with dm-multipath as my multipath provider. On Saturday, > one of the EVA8000 controllers (active/active) rebooted. Out of my 3 > different clusters, at least one node each cluster rebooted with the > following messages: > > Sep 20 01:19:38 ausracdb02 kernel: rport-0:0-5: blocked FC remote > port time out: saving binding > Sep 20 01:19:38 ausracdb02 kernel: lpfc 0000:0e:00.0: 0:0203 Devloss > timeout on WWPN 50:0:1f:e1:50:b:32:88 NPort xb0c00 Data: x2000008 x7 x7 > Sep 20 01:19:38 ausracdb02 kernel: sd 0:0:3:9: SCSI error: return code > = 0x00010000 > Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev sds, > sector 58618847 > Sep 20 01:19:38 ausracdb02 kernel: device-mapper: multipath: Failing > path 65:32. > Sep 20 01:19:38 ausracdb02 kernel: sd 0:0:3:9: SCSI error: return code > = 0x00010000 > Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev sds, > sector 58618847 > Sep 20 01:19:38 ausracdb02 kernel: sd 0:0:3:9: SCSI error: return code > = 0x00010000 > Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev sds, > sector 58618847 > Sep 20 01:19:38 ausracdb02 kernel: sd 0:0:3:9: SCSI error: return code > = 0x00020008 > Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev sds, > sector 22227855 > Sep 20 01:19:38 ausracdb02 kernel: sd 0:0:3:9: SCSI error: return code > = 0x00020008 > Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev sds, > sector 1735 > Sep 20 01:19:38 ausracdb02 kernel: sd 0:0:3:9: SCSI error: return code > = 0x00020008 > Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev sds, > sector 58618943 > Sep 20 01:19:38 ausracdb02 kernel: sd 0:0:3:9: SCSI error: return code > = 0x00020008 > Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev sds, > sector 58619935 > Sep 20 01:19:38 ausracdb02 kernel: rport-1:0-2: blocked FC remote > port time out: saving binding > Sep 20 01:19:38 ausracdb02 multipathd: sdaw: tur checker reports path > is down > Sep 20 01:19:38 ausracdb02 multipathd: checker failed path 67:0 in map > limsp > Sep 20 01:19:38 ausracdb02 kernel: lpfc 0000:0e:00.1: 1:0203 Devloss > timeout on WWPN 50:0:1f:e1:50:b:32:89 NPort x150d00 Data: x2000008 x7 x6 > Sep 20 01:19:38 ausracdb02 kernel: sd 1:0:0:2: SCSI error: return code > = 0x00010000 > Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev sdap, > sector 32776295 > <snip> > Sep 20 01:21:03 ausracdb02 kernel: (27,1):o2hb_write_timeout:269 > ERROR: Heartbeat write timeout to device dm-11 after 120000 milliseconds > Sep 20 01:21:03 ausracdb02 kernel: Heartbeat thread (27) printing last > 24 blocking operations (cur = 19): > Sep 20 01:21:03 ausracdb02 kernel: Heartbeat thread stuck at waiting > for read completion, stuffing current time into that blocker (index 19) > Sep 20 01:21:03 ausracdb02 kernel: Index 20: took 0 ms to do > submit_bio for read > Sep 20 01:21:03 ausracdb02 kernel: Index 21: took 0 ms to do waiting > for read completion > Sep 20 01:21:03 ausracdb02 kernel: Index 22: took 0 ms to do bio alloc > write > Sep 20 01:21:03 ausracdb02 kernel: Index 23: took 0 ms to do bio add > page write > > It appears that the heartbeat thread just barely timed out, as the > controller was in the process of coming back up. My questions are: > > 1) Why did only some nodes in each cluster reboot? > 2) Why was there a timeout when the multipathing should have kept the > filesystems up? > 3) Is there a way to increase the heartbeat timeout above 120000 > milliseconds? > > My config: > > Kernel: 2.6.18-53.el5 x86_64 on RHEL 5.1 > > OCFS2: ocfs2-2.6.18-53.el5-1.2.8-2.el5 > > O2CB: > [EMAIL PROTECTED] ~]# /etc/init.d/o2cb status > Module "configfs": Loaded > Filesystem "configfs": Mounted > Module "ocfs2_nodemanager": Loaded > Module "ocfs2_dlm": Loaded > Module "ocfs2_dlmfs": Loaded > Filesystem "ocfs2_dlmfs": Mounted > Checking O2CB cluster racdb: Online > Heartbeat dead threshold: 61 > Network idle timeout: 60000 > Network keepalive delay: 2000 > Network reconnect delay: 2000 > Checking O2CB heartbeat: Active > Cluster1 has two nodes, Cluster2 has two nodes, and Cluster3 has four > nodes. Cluster1 and Cluster2 had one node reboot while Cluster3 had > two nodes reboot. > > > TIA, > > Daniel _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users