Re: [OpenAFS] AFS namei file servers, SAN, any issues elsewhere? We've had some. Can AFS _cause_ SAN issues?

Kim Kimball Mon, 17 Mar 2008 14:05:44 -0700

Hi Jeff,


Jeffrey Altman wrote:


AFS is a very stressful application for a file system.  If there are
bugs in the SAN AFS would be more likely to find them than other
applications.

*grin* Try telling that to my management! I just sent an email callingAFS an excellent network and storage diagnostic. We used to trackreports of AFS issues versus resolved issues that turned out to beinfrastructure related. I forget the exact figures, but by far AFS wasnot implicated, only complaining. The effort to get away from AFS isrenewed!

========================================
For the record, here's what I've been experiencing. The worst of theexperience, as detailed below, was the impact on creation of move andrelease clones but not backup clones
AFS IMPACT
We were running 1.4.1 with some patches. (Upgrading to 1.4.6 hasbeen part of a thus far definitive fix for the 9585 issues.)
The primary difference between 1.4.1 and 1.4.6 is the bundling of
FSync calls which would significantly reduce the load on the
underlying file system.  (Robert Banz gave a good description of
the impact.)  If this change is permitting the SAN to perform its
operations with a reduced incident rate, that would imply that
there is still a problem in the SAN (or the connections between the
host machine and the SAN) but it is not being tickled (as often.)

Agreed.

The worst of the six month stretch occured when the primary andsecondary controller roles (9585 only thus far) were reversed as aconsequence of SAN fabric rebuilds. For whatever reason, the timerequired to create volume clones for AFS 'vos release' and 'vos move'(using 'vos status' to audit clone time) increased from a typicalseveral seconds to minutes, ten minutes, and in one case four hours.The RW volume is of course unwritable during the clone operation.
My conclusion:
The secondary controller, the cabling, or something else along
that data path is defective.

Thanks for the confirmation.

That's the growing conclusion of various vendors as well. We appear tobe replacing the SAN fabric piece by piece, sucked along in theslipstream of "maybe this will work." Which is fine, but it's taken amonth thus far, and I'm refusing to use the SAN until the timeout errorsstop.

'vos remove' times on afflicted partitions were also affected, withincreased time required to remove a volume.
I don't know why the creation of .backup clones was not similarlyaffected. For a given volume the create time/refresh time for a moveclone or release clone might have been fifteen minutes, while the.backup clone created quickly and took only slightly longer than usual.
The data is not copied for a .backup until the data actually changes.

So I should have seen the same cloning behavior if I'd used 'backup-force" (or whatever it is) or removed the .backup and then run vosbackup. I'll check my notes but don't recall documenting this. Prettysure I tried, pretty sure I saw what you'd expect.

Is the code base for cloning then shared, as I speculated? (If youknow offhand. I believe it is but haven't checked.)

With 'vos move' out of the picture I moved volumes with dump/restore,for volumes not frequently or recently updated, and dump/restorefollowed by use of a synchronization tool, Unison, to create a new RWvolume, followed by changing the mount point to point to the name ofthe new volume, followed by waiting until the previous RW volume nolonger showed any updates for a few days.
(If anyone is interested in Unison let me know. I'm thinking oftalking about it at Best Practices this year.)
The deadline for submissions is approaching fast.  Please submit your
talk.

Blzorp!  Thanks.  Completely forgot.

The USP continues to spew SCSI command timeouts.


Bad controller?  Bad cable?  Bad disk?

SCSI command timeouts are at a level far below AFS.  If an AFS service
requests a disk operation and that operation results in SCSI command
timeouts, there is something seriously wrong somewhere between the
SCSI controller and the disk.

No wonder you are getting lousy performance.

No kidding. It's been miserable trying to support AFS with unstablestorage.

I'm seeing SCSI command timeouts and UFS log timeouts (on vicepartitions using the SAN for storage) on LUNS used for vicep's on theHitachi USP, and was seeing them also on the 9585 until a recentconfiguration change.
UFS log timeouts are more evidence that the problem is somewhere
between UFS and the disk.
At first I thought this was load related, so wrote scripts togenerate a goodly load. It turns out that even with a one secondsleep between file create/write/close operations and between rmoperations the SCSI command timeouts still occur, and that it's notload but simply activity that turns up the timeouts.
And I bet the SAN admins are telling you that there is nothing wrong.
They are badly mistaken.

LOL! They're telling me that they're only seeing this issues on AFSfile servers. (Except for a 'few instances' elsewhere, so "AFSobviously has a problem.")


Thanks Jeff.

More fuel for looking at the SAN.

Kim


Jeffrey Altman

_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Re: [OpenAFS] AFS namei file servers, SAN, any issues elsewhere? We've had some. Can AFS _cause_ SAN issues?

Reply via email to