Thanks for the help but I've identified the problem. The problem was an interaction between AFS, reiserfs, and the XRaid. The XRaid has a setting that allows the OS to flush the controller and disk cache. It appears that AFS and reiserfs were flushing on each block, where as JFS flushed the cache at file close. I disabled that setting and the performance of the XRaid jumped to exactly what it should have been. Thanks again for the help.

Derek Harkness
System Administrator
University of Michigan-Dearborn
(313) 593-5089


I've got a bunch of questions.  Even if you only have time to answer a
few of them, it will help us to narrow down the root cause.

First and foremost, do local volume package operations (e.g. the
salvager, vos backup, fileserver startup/shutdown, etc) run slowly, or
is it only stuff that involves Rx?  What about vos dump foo localhost
on the ailing fileserver?  The fact that iowait is going through the
roof may be indicative of an io subsystem problem, so eliminating
network/Rx problems at the top of the decision tree will be useful.

Salvager and vos backup running locally are slow. They don't drive iowait as high but certainly don't get the throughput they should

Below is an iostat for my AFS partitions, I was salvaging the /vicepa partition which had 5 dead clone volumes that were deleted. I have 4 partitions on 2 raid devices (sda, sdb) sda1 -> /vicepa, sda2 -> / vicepb, sdb1 -> /vicepc, sdb2 - > /vicepd. vicep[abc] are formated with reiserfs and vicepd is format jfs.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.05    0.00    0.00    1.95    0.00   98.00

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               5.19         0.00        31.14          0        156
sda2              0.00         0.00         0.00          0          0
sda1              7.78         0.00        31.14          0        156
sdb               0.00         0.00         0.00          0          0
sdb2              0.00         0.00         0.00          0          0
sdb1              0.00         0.00         0.00          0          0

I'm not familiar with the Linux iostat utility, but if it supports
per-disk stats similar similar to the -x option on Solaris, or the -D
option on AIX, then please post some data while the problem is
occurring.

* Were you running some well-known benchmark suite?  If so, what
options did you pass?

I went back and checked and yes my original test method was only hitting the page cache. But here are some new #s.

Read test
hdparm -t /dev/sdb2
Buffered read: 354 MB in  3.01 seconds = 117.54 MB/sec

bonnie++ benchmark tool
Version  1.03
------Sequential Output------ --Sequential Input- --Random--Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec % CP K/sec %CP thales 8G 40379 91 57465 24 3870 1 9779 22 97536 16 195.3 0

------Sequential Create------ --------Random Create---------Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec % CP K/sec %CP 16 25563 99 +++++ +++ 20961 100 24985 100 +++++ +++ 18984 100

thales,8G, 40379,91,57465,24,3870,1,9779,22,97536,16,195.3,0,16,25563,99,+++++,++ +,20961,100,24985,100,+++++,+++,18984,100

* Did it involve one file or many?

8 files

* Were any fsync()s issued?

Yes

* Did it modify any filesystem metadata, or only file data?

Only file data

* Was it single threaded or multi-threaded?
Single

* How much data was read/written?
About 8 gigs

* How big were the files involved?

1 Gig each

* Did you do anything to mitigate/bypass caching?



Other questions that might be useful:

* How deep are the tagged command queues for the xserve lun(s)?

Don't know but I will check

* Do all the disks pass surface scans?

I will run a surface test over night.

* Are the disks and/or controllers reporting SMART events?

No smart events, no other raid events reported

* If this stuff is fabric attached, have you looked at port error
counts, port performance data, etc?

XRaid is directly connected to the server both

How have you verified that "network performance" is ok?  What are the
ethernet port error counts like?  What are the packet retransmit rates
like?

RX packets:1065818 errors:0 dropped:0 overruns:0 frame:0
TX packets:1257865 errors:0 dropped:0 overruns:0 carrier:0

NPtcp network bandwidth test reports ~90 Mbits/sec average throughput.

I don't know much of anything about apple's storage line, but if they
have any sort of performance analysis and/or problem determination
tools, what do they say?

I will run tonight


the some problem is the AFS fileserver.

Hardware:
HP DL380
2x2.8ghz Hyperthreaded Xeon CPU
4 Gigs of RAM
Gigabit ethernet
MPTFusion fiber channel card
Apple XRaid

I've got 2 other identical box currently run AFS and working fine. The only
difference is the other boxes are running an old OS.


Are the machines running the older kernel still running 1.3.x?

Yes, 1.3.81.

Until we can better understand your testing methodology, I'd have to
say this could be a hardware problem, a kernel driver problem, an AFS
problem, or even a network problem.  We need more information to
narrow it down.

Regards,

--
Tom Keiser
[EMAIL PROTECTED]

Attachment: PGP.sig
Description: This is a digitally signed message part

Reply via email to