Hi Jeff,

Jeffrey Altman wrote:

AFS is a very stressful application for a file system.  If there are
bugs in the SAN AFS would be more likely to find them than other
applications.

*grin* Try telling that to my management! I just sent an email calling AFS an excellent network and storage diagnostic. We used to track reports of AFS issues versus resolved issues that turned out to be infrastructure related. I forget the exact figures, but by far AFS was not implicated, only complaining. The effort to get away from AFS is renewed!

========================================
For the record, here's what I've been experiencing. The worst of the experience, as detailed below, was the impact on creation of move and release clones but not backup clones

AFS IMPACT

We were running 1.4.1 with some patches. (Upgrading to 1.4.6 has been part of a thus far definitive fix for the 9585 issues.)

The primary difference between 1.4.1 and 1.4.6 is the bundling of
FSync calls which would significantly reduce the load on the
underlying file system.  (Robert Banz gave a good description of
the impact.)  If this change is permitting the SAN to perform its
operations with a reduced incident rate, that would imply that
there is still a problem in the SAN (or the connections between the
host machine and the SAN) but it is not being tickled (as often.)

Agreed.
The worst of the six month stretch occured when the primary and secondary controller roles (9585 only thus far) were reversed as a consequence of SAN fabric rebuilds. For whatever reason, the time required to create volume clones for AFS 'vos release' and 'vos move' (using 'vos status' to audit clone time) increased from a typical several seconds to minutes, ten minutes, and in one case four hours. The RW volume is of course unwritable during the clone operation.

My conclusion:
The secondary controller, the cabling, or something else along
that data path is defective.

Thanks for the confirmation.

That's the growing conclusion of various vendors as well. We appear to be replacing the SAN fabric piece by piece, sucked along in the slipstream of "maybe this will work." Which is fine, but it's taken a month thus far, and I'm refusing to use the SAN until the timeout errors stop.

'vos remove' times on afflicted partitions were also affected, with increased time required to remove a volume.

I don't know why the creation of .backup clones was not similarly affected. For a given volume the create time/refresh time for a move clone or release clone might have been fifteen minutes, while the .backup clone created quickly and took only slightly longer than usual.

The data is not copied for a .backup until the data actually changes.

So I should have seen the same cloning behavior if I'd used 'backup -force" (or whatever it is) or removed the .backup and then run vos backup. I'll check my notes but don't recall documenting this. Pretty sure I tried, pretty sure I saw what you'd expect.

Is the code base for cloning then shared, as I speculated? (If you know offhand. I believe it is but haven't checked.)

With 'vos move' out of the picture I moved volumes with dump/restore, for volumes not frequently or recently updated, and dump/restore followed by use of a synchronization tool, Unison, to create a new RW volume, followed by changing the mount point to point to the name of the new volume, followed by waiting until the previous RW volume no longer showed any updates for a few days.

(If anyone is interested in Unison let me know. I'm thinking of talking about it at Best Practices this year.)

The deadline for submissions is approaching fast.  Please submit your
talk.

Blzorp!  Thanks.  Completely forgot.
The USP continues to spew SCSI command timeouts.

Bad controller?  Bad cable?  Bad disk?

SCSI command timeouts are at a level far below AFS.  If an AFS service
requests a disk operation and that operation results in SCSI command
timeouts, there is something seriously wrong somewhere between the
SCSI controller and the disk.

No wonder you are getting lousy performance.

No kidding. It's been miserable trying to support AFS with unstable storage.

I'm seeing SCSI command timeouts and UFS log timeouts (on vice partitions using the SAN for storage) on LUNS used for vicep's on the Hitachi USP, and was seeing them also on the 9585 until a recent configuration change.

UFS log timeouts are more evidence that the problem is somewhere
between UFS and the disk.

At first I thought this was load related, so wrote scripts to generate a goodly load. It turns out that even with a one second sleep between file create/write/close operations and between rm operations the SCSI command timeouts still occur, and that it's not load but simply activity that turns up the timeouts.

And I bet the SAN admins are telling you that there is nothing wrong.
They are badly mistaken.

LOL! They're telling me that they're only seeing this issues on AFS file servers. (Except for a 'few instances' elsewhere, so "AFS obviously has a problem.")

Thanks Jeff.

More fuel for looking at the SAN.

Kim


Jeffrey Altman

_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to