|
IT IS NOT NORMAL. Something wrong with your
storage, FC switch or cards. Why , when you shutdown one node, second node
experience IO errors?
----- Original Message -----
Sent: Thursday, September 21, 2006 2:56
PM
Subject: [Ocfs2-users] ocfs2 fencing on
reboot of 2nd node
I'm performing some testing
with ocfs2 on 2 nodes with Red Hat AS4 Update 4 (x86_64) and (mulitpath
included in the 2.6 kernel) and am runing into some issues when cleanly
rebooting the 2nd node, while the 1st node is still up.
So if I do the following on the 2nd node, the 1st node
does not fence itself:
/etc/init.d/ocfs2 stop /etc/init.d/o2cb stop wait more
than 60 seconds init 6
I get the following on the 1st node, but
everything is fine:
Sep 21 21:44:49 bbflgrid11 kernel: SCSI error : <0 0 0 12> return
code = 0x20000 Sep 21 21:44:49
bbflgrid11 kernel: end_request: I/O error, dev sdm, sector 192785
Sep 21 21:44:49 bbflgrid11 kernel:
device-mapper: dm-multipath: Failing path 8:192. Sep 21 21:44:49 bbflgrid11 kernel: SCSI error : <0 0
0 14> return code = 0x20000 Sep 21
21:44:49 bbflgrid11 kernel: end_request: I/O error, dev sdo, sector
193297 Sep 21 21:44:49 bbflgrid11
kernel: device-mapper: dm-multipath: Failing path 8:224. Sep 21 21:44:49 bbflgrid11 kernel: SCSI error : <0 0
0 13> return code = 0x20000 Sep 21
21:44:49 bbflgrid11 kernel: end_request: I/O error, dev sdn, sector
192785 Sep 21 21:44:49 bbflgrid11
kernel: device-mapper: dm-multipath: Failing path 8:208.
Sep 21 21:44:49 bbflgrid11 multipathd:
8:192: mark as failed Sep 21 21:44:49
bbflgrid11 multipathd: mpath1: remaining active paths: 1 Sep 21 21:44:49 bbflgrid11 multipathd: 8:224: mark as
failed Sep 21 21:44:49 bbflgrid11
multipathd: mpath3: remaining active paths: 1 Sep 21 21:44:49 bbflgrid11 multipathd: 8:208: mark as failed
Sep 21 21:44:49 bbflgrid11 multipathd:
mpath2: remaining active paths: 1 Sep
21 21:44:58 bbflgrid11 multipathd: 8:192: readsector0 checker reports path is
up Sep 21 21:44:58 bbflgrid11
multipathd: 8:192: reinstated Sep 21
21:44:58 bbflgrid11 multipathd: mpath1: remaining active paths: 2
Sep 21 21:44:58 bbflgrid11 multipathd: 8:208:
readsector0 checker reports path is up Sep 21 21:44:58 bbflgrid11 multipathd: 8:208: reinstated
Sep 21 21:44:58 bbflgrid11 multipathd:
mpath2: remaining active paths: 2 Sep
21 21:44:58 bbflgrid11 multipathd: 8:224: readsector0 checker reports path is
up Sep 21 21:44:58 bbflgrid11
multipathd: 8:224: reinstated Sep 21
21:44:58 bbflgrid11 multipathd: mpath3: remaining active paths: 2
Sep 21 21:46:06 bbflgrid11 kernel: SCSI error
: <1 0 0 11> return code = 0x20000 Sep 21 21:46:06 bbflgrid11 kernel: end_request: I/O error, dev sdaa,
sector 1920 Sep 21 21:46:06 bbflgrid11
kernel: device-mapper: dm-multipath: Failing path 65:160. Sep 21 21:46:06 bbflgrid11 multipathd: 65:160: mark as
failed Sep 21 21:46:06 bbflgrid11
multipathd: mpath0: remaining active paths: 1 Sep 21 21:46:06 bbflgrid11 multipathd: 65:160: readsector0 checker
reports path is up Sep 21 21:46:06
bbflgrid11 multipathd: 65:160: reinstated Sep 21 21:46:06 bbflgrid11 multipathd: mpath0: remaining active paths:
2
Now if I do the
following on the 2nd node, the 1st node fences itself (same as above, except
dont wait 60 seconds after o2cb stop)
/etc/init.d/ocfs2 stop /etc/init.d/o2cb stop init
6
Node 1 logs the following and
fences itself, I have to power cycle the server to get it back, it doesn't
reboot or shutdown just hangs
Sep
21 21:28:00 bbflgrid11 kernel: SCSI error : <0 0 0 13> return code =
0x20000 Sep 21 21:28:00 bbflgrid11
kernel: end_request: I/O error, dev sdn, sector 192785 Sep 21 21:28:00 bbflgrid11 kernel: device-mapper:
dm-multipath: Failing path 8:208. Sep
21 21:28:00 bbflgrid11 multipathd: 8:208: mark as failed Sep 21 21:28:00 bbflgrid11 multipathd: mpath2:
remaining active paths: 1 Sep 21
21:28:00 bbflgrid11 kernel: SCSI error : <1 0 0 12> return code =
0x20000 Sep 21 21:28:00 bbflgrid11
kernel: end_request: I/O error, dev sdab, sector 192784 Sep 21 21:28:00 bbflgrid11 kernel: end_request: I/O
error, dev sdab, sector 192786 Sep 21
21:28:00 bbflgrid11 kernel: device-mapper: dm-multipath: Failing path
65:176. Sep 21 21:28:00 bbflgrid11
kernel: SCSI error : <1 0 0 13> return code = 0x20000 Sep 21 21:28:00 bbflgrid11 kernel: end_request: I/O
error, dev sdac, sector 192785 Sep 21
21:28:00 bbflgrid11 kernel: device-mapper: dm-multipath: Failing path
65:192. Sep 21 21:28:00 bbflgrid11
multipathd: 65:176: mark as failed Sep
21 21:28:00 bbflgrid11 multipathd: mpath1: remaining active paths: 1
Sep 21 21:28:01 bbflgrid11 multipathd:
65:192: mark as failed Sep 21 21:28:01
bbflgrid11 multipathd: mpath2: remaining active paths: 0 Sep 21 21:28:01 bbflgrid11 kernel:
(4912,1):o2hb_bio_end_io:331 ERROR: IO Error -5 Sep 21 21:28:01 bbflgrid11 kernel:
(4912,1):o2hb_do_disk_heartbeat:973 ERROR: status = -5 Sep 21 21:28:01 bbflgrid11 kernel:
(4912,1):o2hb_bio_end_io:331 ERROR: IO Error -5 Sep 21 21:28:01 bbflgrid11 kernel:
(4912,1):o2hb_do_disk_heartbeat:973 ERROR: status = -5 Sep 21 21:28:01 bbflgrid11 multipathd: 65:176:
readsector0 checker reports path is up Sep 21 21:28:01 bbflgrid11 multipathd: 65:176: reinstated
Sep 21 21:28:01 bbflgrid11 multipathd:
mpath1: remaining active paths: 2 Sep
21 21:28:03 bbflgrid11 kernel: (4912,1):o2hb_bio_end_io:331 ERROR: IO Error
-5 Sep 21 21:28:03 bbflgrid11 kernel:
(4912,1):o2hb_do_disk_heartbeat:973 ERROR: status = -5 Sep 21 21:28:03 bbflgrid11 kernel:
(4912,1):o2hb_bio_end_io:331 ERROR: IO Error -5 Sep 21 21:28:03 bbflgrid11 kernel:
(4912,1):o2hb_do_disk_heartbeat:973 ERROR: status = -5 Sep 21 21:28:05 bbflgrid11 kernel:
(4912,1):o2hb_bio_end_io:331 ERROR: IO Error -5 Sep 21 21:28:05 bbflgrid11 kernel:
(4912,1):o2hb_do_disk_heartbeat:973 ERROR: status = -5 Sep 21 21:28:05 bbflgrid11 kernel:
(4912,1):o2hb_bio_end_io:331 ERROR: IO Error -5 Sep 21 21:28:05 bbflgrid11 kernel:
(4912,1):o2hb_do_disk_heartbeat:973 ERROR: status = -5 Sep 21 21:28:07 bbflgrid11 kernel:
(4912,1):o2hb_bio_end_io:331 ERROR: IO Error -5 Sep 21 21:28:07 bbflgrid11 kernel:
(4912,1):o2hb_do_disk_heartbeat:973 ERROR: status = -5 Sep 21 21:28:07 bbflgrid11 kernel:
(4912,1):o2hb_bio_end_io:331 ERROR: IO Error -5 Sep 21 21:28:07 bbflgrid11 kernel:
(4912,1):o2hb_do_disk_heartbeat:973 ERROR: status = -5 Sep 21 21:28:09 bbflgrid11 kernel:
(4912,1):o2hb_bio_end_io:331 ERROR: IO Error -5 Sep 21 21:28:09 bbflgrid11 kernel:
(4912,1):o2hb_do_disk_heartbeat:973 ERROR: status = -5 Sep 21 21:28:09 bbflgrid11 kernel:
(4912,1):o2hb_bio_end_io:331 ERROR: IO Error -5 Sep 21 21:28:09 bbflgrid11 kernel:
(4912,1):o2hb_do_disk_heartbeat:973 ERROR: status = -5 Sep 21 21:28:09 bbflgrid11 multipathd: 8:208:
readsector0 checker reports path is up Sep 21 21:28:09 bbflgrid11 multipathd: 8:208: reinstated
Sep 21 21:28:09 bbflgrid11 multipathd:
mpath2: remaining active paths: 1 Sep
21 21:28:10 bbflgrid11 multipathd: 65:192: readsector0 checker reports path is
up Sep 21 21:28:10 bbflgrid11
multipathd: 65:192: reinstated Sep 21
21:28:10 bbflgrid11 multipathd: mpath2: remaining active paths: 2
... Index 14: took 0 ms to do submit_bio for read Index 15: took 0 ms to do waiting for read
completion (11,1):o2hb_stop_all_regions:1908 ERROR: stopping heartbeat on all
active regions Kernel panic - not
syncing: ocfs2 is very sorry to be fencing this system by
panicing
Seems like if I wait
for the node 1 to heartbeat to node 2, with o2c down, before rebooting it's
fine, but if I reboot before node 1 has had a chance to hearbeat to node 2,
with o2cb down, it's panics.
Shawn E. Ruff Senior Oracle DBA Fiberlink
Communications
The information transmitted is intended only for the
person or entity to which it is addressed and may contain confidential and/or
privileged material. Any review, retransmission, dissemination or other
use of, or taking of any action in reliance upon, this information by persons
or entities other than the intended recipient is prohibited. If you
received this in error, please contact the sender and delete the material from
any computer.
_______________________________________________ Ocfs2-users mailing
list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
|