I've got a bunch of questions.  Even if you only have time to answer a
few of them, it will help us to narrow down the root cause.

On 2/10/07, Derek Harkness <[EMAIL PROTECTED]> wrote:
I'm attempting to deploy/update a new AFS fileserver.  The new server is the
first to upgraded from Debian sarge, OpenAFS 1.3.xx, kernel 2.4 to Etch,
2.6.18, AFS 1.4.2, reiserfs and a new 7 terabyte XRaid.

The upgrade went fine except I file writes to the new system are so slow the
system is unusable.  On the server iostat shows a transfer rate of ~40KB/s
and an iowait of 20 during AFS operations.  If I stop the fileserver and

First and foremost, do local volume package operations (e.g. the
salvager, vos backup, fileserver startup/shutdown, etc) run slowly, or
is it only stuff that involves Rx?  What about vos dump foo localhost
on the ailing fileserver?  The fact that iowait is going through the
roof may be indicative of an io subsystem problem, so eliminating
network/Rx problems at the top of the decision tree will be useful.

I'm not familiar with the Linux iostat utility, but if it supports
per-disk stats similar similar to the -x option on Solaris, or the -D
option on AIX, then please post some data while the problem is
occurring.


perform io directly on the XRaid I can read and write between
100MB/s-500MB/s.


A single fibre channel port (excepting 10Gb E-ports) can't transmit
500MB/s.  From what I've heard, apple's fc raid products only provide
a single 2Gb sfp per controller, and don't support fc multipathing.
So, you're limited to a max theoretical of ~203MB/s (less in AL mode).
Thus, I'm guessing your tests are, at least in some cases, only
stressing the page cache, rather than anything across the fabric (for
that matter, is there a fabric?).  In order to declare the storage
subsystem OK, we need to be sure you've tested every layer of the
storage stack.

Please tell us specifically what you did to verify "direct" io.  For example:

* Were you running some well-known benchmark suite?  If so, what
options did you pass?
* Did it involve one file or many?
* Were any fsync()s issued?
* Did it modify any filesystem metadata, or only file data?
* Was it single threaded or multi-threaded?
* How much data was read/written?
* How big were the files involved?
* Did you do anything to mitigate/bypass caching?

Other questions that might be useful:

* How deep are the tagged command queues for the xserve lun(s)?
* Do all the disks pass surface scans?
* Are the disks and/or controllers reporting SMART events?
* If this stuff is fabric attached, have you looked at port error
counts, port performance data, etc?


Does anyone have any suggestions on how I might trouble shoot this problem?
So far I've checked network performance, io performance directly to the
XRaid, and the reisfer filesystem.  It all seems to be pointing me back to

How have you verified that "network performance" is ok?  What are the
ethernet port error counts like?  What are the packet retransmit rates
like?

I don't know much of anything about apple's storage line, but if they
have any sort of performance analysis and/or problem determination
tools, what do they say?

the some problem is the AFS fileserver.

Hardware:
HP DL380
2x2.8ghz Hyperthreaded Xeon CPU
4 Gigs of RAM
Gigabit ethernet
MPTFusion fiber channel card
Apple XRaid

I've got 2 other identical box currently run AFS and working fine.  The only
difference is the other boxes are running an old OS.


Are the machines running the older kernel still running 1.3.x?

Until we can better understand your testing methodology, I'd have to
say this could be a hardware problem, a kernel driver problem, an AFS
problem, or even a network problem.  We need more information to
narrow it down.

Regards,

--
Tom Keiser
[EMAIL PROTECTED]
_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to