by Chris Snook( From Redhat Magazine)


When ext3 encounters possible corruption in filesystem metadata, it
aborts the journal and remounts it as read-only to prevent causing
damage to the metadata on disk. This can occur due to I/O errors while
reading metadata, even if there is no metadata corruption on disk.

If filesystems on multiple disk arrays or accessed by multiple clients
are repeatedly becoming read-only in a SAN environment, the most common
cause is a SCSI timeout while the Fibre Channel HBA driver is handling
an RSCN event on the Fibre Channel fabric.

An RSCN (Registered State Change Notification) is generated whenever the
configuration of a Fibre Channel fabric changes, and is propagated to
any HBA that shares a zone with the device that changed state. RSCNs may
be generated when an HBA, switch, or LUN is added or removed, or when
the zoning of the fabric is changed.

Resolution:

Some cases of this behavior may be due to a known bug in the interaction
between NFS and ext3. For this reason, it is recommended that users
experiencing this problem on NFS servers update their kernel, at least
to version 2.6.9-42.0.2.EL. Here is the link to the related bugzilla
entry https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=199172
<https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=199172>

The lpfc driver update in Red Hat Enterprise Linux 4 Update 4 includes a
change to RSCN handling which prevents this problem in many
environments. Users of Emulex HBAs experiencing this problem are advised
to update their kernel, at least to version 2.6.9-42.EL. Here is the
link to the related bugzilla entry
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=179752
<https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=179752>

The lpfc and qla2xxx drivers also have configuration options which cause
the driver to handle RSCNs in a less invasive manner, which often
prevents timeouts during RSCN handling. These options must be set in the
/etc/modprobe.conf file:
options lpfc lpfc_use_adisc=1
options qla2xxx ql2xprocessrscn=1

After making these changes, the initrd must be rebuilt and the system
must be rebooted for the changes to take effect.

Recommendation:

This problem may be prevented or mitigated by applying SAN vendor
recommended configurations and firmware updates to HBAs, switches, and
disk arrays on the fabric, as well as recommended configurations and
updates to multipathing software. This particularly applies to timeout
and retry settings.

The architecture of Fibre Channel assumes that the fabric changes
infrequently, so RSCNs can be disruptive even on properly configured
fabrics. Events which generate RSCNs should be minimized, particularly
at times of high activity, since this causes RSCN handling to take
longer than it would on a mostly idle fabric.

In multipathed environments with separate fabrics for different paths,
zone changes to the fabrics should be made far apart in time. It is not
uncommon for complete handling of a zone change to take many minutes on
a busy fabric with many systems and LUNs. Performing zone changes
separately minimizes the risk of all paths timing out due to RSCN
handling.

Reply via email to