Derek Harkness System Administrator University of Michigan-Dearborn (313) 593-5089
I've got a bunch of questions. Even if you only have time to answer a few of them, it will help us to narrow down the root cause. First and foremost, do local volume package operations (e.g. the salvager, vos backup, fileserver startup/shutdown, etc) run slowly, or is it only stuff that involves Rx? What about vos dump foo localhost on the ailing fileserver? The fact that iowait is going through the roof may be indicative of an io subsystem problem, so eliminating network/Rx problems at the top of the decision tree will be useful.
Salvager and vos backup running locally are slow. They don't drive iowait as high but certainly don't get the throughput they should
Below is an iostat for my AFS partitions, I was salvaging the /vicepa partition which had 5 dead clone volumes that were deleted. I have 4 partitions on 2 raid devices (sda, sdb) sda1 -> /vicepa, sda2 -> / vicepb, sdb1 -> /vicepc, sdb2 - > /vicepd. vicep[abc] are formated with reiserfs and vicepd is format jfs.
avg-cpu: %user %nice %system %iowait %steal %idle
0.05 0.00 0.00 1.95 0.00 98.00
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 5.19 0.00 31.14 0 156
sda2 0.00 0.00 0.00 0 0
sda1 7.78 0.00 31.14 0 156
sdb 0.00 0.00 0.00 0 0
sdb2 0.00 0.00 0.00 0 0
sdb1 0.00 0.00 0.00 0 0
I'm not familiar with the Linux iostat utility, but if it supports per-disk stats similar similar to the -x option on Solaris, or the -D option on AIX, then please post some data while the problem is occurring. * Were you running some well-known benchmark suite? If so, what options did you pass?
I went back and checked and yes my original test method was only hitting the page cache. But here are some new #s.
Read test hdparm -t /dev/sdb2 Buffered read: 354 MB in 3.01 seconds = 117.54 MB/sec bonnie++ benchmark tool Version 1.03------Sequential Output------ --Sequential Input- --Random--Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec % CP K/sec %CP thales 8G 40379 91 57465 24 3870 1 9779 22 97536 16 195.3 0
------Sequential Create------ --------Random Create---------Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec % CP K/sec %CP 16 25563 99 +++++ +++ 20961 100 24985 100 +++++ +++ 18984 100
thales,8G, 40379,91,57465,24,3870,1,9779,22,97536,16,195.3,0,16,25563,99,+++++,++ +,20961,100,24985,100,+++++,+++,18984,100
* Did it involve one file or many?
8 files
* Were any fsync()s issued?
Yes
* Did it modify any filesystem metadata, or only file data?
Only file data
* Was it single threaded or multi-threaded?
Single
* How much data was read/written?
About 8 gigs
* How big were the files involved?
1 Gig each
* Did you do anything to mitigate/bypass caching?
Other questions that might be useful: * How deep are the tagged command queues for the xserve lun(s)?
Don't know but I will check
* Do all the disks pass surface scans?
I will run a surface test over night.
* Are the disks and/or controllers reporting SMART events?
No smart events, no other raid events reported
* If this stuff is fabric attached, have you looked at port error counts, port performance data, etc?
XRaid is directly connected to the server both
How have you verified that "network performance" is ok? What are the ethernet port error counts like? What are the packet retransmit rates like?
RX packets:1065818 errors:0 dropped:0 overruns:0 frame:0 TX packets:1257865 errors:0 dropped:0 overruns:0 carrier:0 NPtcp network bandwidth test reports ~90 Mbits/sec average throughput.
I don't know much of anything about apple's storage line, but if they have any sort of performance analysis and/or problem determination tools, what do they say?
I will run tonight
the some problem is the AFS fileserver. Hardware: HP DL380 2x2.8ghz Hyperthreaded Xeon CPU 4 Gigs of RAM Gigabit ethernet MPTFusion fiber channel card Apple XRaidI've got 2 other identical box currently run AFS and working fine. The onlydifference is the other boxes are running an old OS.Are the machines running the older kernel still running 1.3.x?
Yes, 1.3.81.
Until we can better understand your testing methodology, I'd have to say this could be a hardware problem, a kernel driver problem, an AFS problem, or even a network problem. We need more information to narrow it down. Regards, -- Tom Keiser [EMAIL PROTECTED]
PGP.sig
Description: This is a digitally signed message part
