I've got a bunch of questions. Even if you only have time to answer a few of them, it will help us to narrow down the root cause.
On 2/10/07, Derek Harkness <[EMAIL PROTECTED]> wrote:
I'm attempting to deploy/update a new AFS fileserver. The new server is the first to upgraded from Debian sarge, OpenAFS 1.3.xx, kernel 2.4 to Etch, 2.6.18, AFS 1.4.2, reiserfs and a new 7 terabyte XRaid. The upgrade went fine except I file writes to the new system are so slow the system is unusable. On the server iostat shows a transfer rate of ~40KB/s and an iowait of 20 during AFS operations. If I stop the fileserver and
First and foremost, do local volume package operations (e.g. the salvager, vos backup, fileserver startup/shutdown, etc) run slowly, or is it only stuff that involves Rx? What about vos dump foo localhost on the ailing fileserver? The fact that iowait is going through the roof may be indicative of an io subsystem problem, so eliminating network/Rx problems at the top of the decision tree will be useful. I'm not familiar with the Linux iostat utility, but if it supports per-disk stats similar similar to the -x option on Solaris, or the -D option on AIX, then please post some data while the problem is occurring.
perform io directly on the XRaid I can read and write between 100MB/s-500MB/s.
A single fibre channel port (excepting 10Gb E-ports) can't transmit 500MB/s. From what I've heard, apple's fc raid products only provide a single 2Gb sfp per controller, and don't support fc multipathing. So, you're limited to a max theoretical of ~203MB/s (less in AL mode). Thus, I'm guessing your tests are, at least in some cases, only stressing the page cache, rather than anything across the fabric (for that matter, is there a fabric?). In order to declare the storage subsystem OK, we need to be sure you've tested every layer of the storage stack. Please tell us specifically what you did to verify "direct" io. For example: * Were you running some well-known benchmark suite? If so, what options did you pass? * Did it involve one file or many? * Were any fsync()s issued? * Did it modify any filesystem metadata, or only file data? * Was it single threaded or multi-threaded? * How much data was read/written? * How big were the files involved? * Did you do anything to mitigate/bypass caching? Other questions that might be useful: * How deep are the tagged command queues for the xserve lun(s)? * Do all the disks pass surface scans? * Are the disks and/or controllers reporting SMART events? * If this stuff is fabric attached, have you looked at port error counts, port performance data, etc?
Does anyone have any suggestions on how I might trouble shoot this problem? So far I've checked network performance, io performance directly to the XRaid, and the reisfer filesystem. It all seems to be pointing me back to
How have you verified that "network performance" is ok? What are the ethernet port error counts like? What are the packet retransmit rates like? I don't know much of anything about apple's storage line, but if they have any sort of performance analysis and/or problem determination tools, what do they say?
the some problem is the AFS fileserver. Hardware: HP DL380 2x2.8ghz Hyperthreaded Xeon CPU 4 Gigs of RAM Gigabit ethernet MPTFusion fiber channel card Apple XRaid I've got 2 other identical box currently run AFS and working fine. The only difference is the other boxes are running an old OS.
Are the machines running the older kernel still running 1.3.x? Until we can better understand your testing methodology, I'd have to say this could be a hardware problem, a kernel driver problem, an AFS problem, or even a network problem. We need more information to narrow it down. Regards, -- Tom Keiser [EMAIL PROTECTED] _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
