On Sep 14, 2009, at 10:25 PM, "McCulloch, Alan" <alan.mccull...@agresearch.co.nz
> wrote:
hi All,
thanks for the responses.
After being dropped into the
# Filesystem repair
prompt,
( on account of “inode 27344909 has illegal blocks” )
following warm reboot (via “reboot”) after finding (SAN )
filesystem in read-only
mode yesterday morning (possibly because of HBA fault on SAN) , I ran
fsck –r /data
(Linux version 2.6.18-92.1.18.el5 , Red Hat 4.1.2-42 , ext3
filesystem)
This took a couple of hours or so , prompting me for various changes
all of which I accepted. This appeared to complete OK, but then the
system would not boot, with the following error from the qla2xxx
driver.
.
.
qla2xxx 0000:05:0d.0: Mailbox command timeout occurred. Scheduling
ISP abort.
qla2xxx 0000:05:0d.0: Mailbox command timeout occurred. Scheduling
ISP abort.
.
etc
However after powering down the system and cold-booting, the system
was able
to boot up and mount the repaired filesystem without any obvious
damage, but with
abnormal not to mention scary looking boot messages and ongoing
warnings from
multipath.
This morning (as I sort of expected) the filesystem had dropped back
down to read-only mode, but meanwhile
the source of our woes was identified, a fibre port on the SAN
controller which was degraded but not
completely failed, so that there had been no clean failover to the
twin controller, and therefore a degraded
virtual device was presented to the O/S, with consequence for the
filesystem.
After that port and controller was quarantined, this time around I
did a cold power-off reboot
of the server , and this time there was a more normal looking boot
and the filesystem
came up normally without any repair being requested.
(My hypothesis is that in this situation – i.e. ext3 filesystem has
put itself in read-only mode –
a warm boot , via reboot, does not cleanly remount the filesystem
and apply the journal
quite like a cold power-off reboot does. I think it is likely that
the lengthy
session of me answering “yes” to fsck’s interactive repair, the
first time around, simply applied all of the
fixes that would automatically have been done from the journal , had
I cold-rebooted in the first place.
However that is only a hunch. But I will be making sure to do cold
power-off reboots in general, in
future.)
Another lesson is that a sophisticated system of twin SAN
controllers with failover does not protect
against a situation where a device is degrading rather than failing
completely.
Thanks again for the responses and sorry if my questions were a bit
basic but I have
been dropped in a little out of my depth with this system.
I always prefer round-robin mpath versus fail-over if possible as a
degraded or failed path simply is not used, then there is the twice
the bandwidth factor when both paths are working which is nice.
-Ross
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos