I just switched our old SuSE-based server to Gentoo (2.6.14-hardened-r7) and am experiencing some problems, the most annoying of which is abysmally bad NFS performance when the server is even moderately loaded: | msbethke ~ $ time (touch x; rm x) | | real 0m59.841s | user 0m0.008s | sys 0m0.036s This is on a client, with the server unpacking a 6 GB gzip file. The slowness is the same on SuSE and Gentoo based clients. The previous installation handled the same thing without any problems, which I'd certainly expect from a dual Xeon @3 GHz with $ GB RAM, a Compaq SmartArray 642 U320 hostadpater and some 200 GB in a RAID5, connected to the clients via GBit ethernet.
To test whether only open/create/remove operations are affected, I tried
dd:
| msbethke ~ $ time dd if=/dev/zero of=test bs=1M count=100
| dd: closing output file `test': Input/output error
|
| real 0m50.500s
| user 0m0.012s
| sys 0m1.136s
| msbethke ~ $ ll test
| -rw------- 1 msbethke users 104857600 2006-04-19 18:17 test
Definitely not good for GBit, but not so bad either considering it will
have taken half a minute just to open that file. The file is complete
despite the I/O error but the error is definitely related to the server
load, it never happens normally (and I get 9-11s for the 100 MB).
I have 16 nfsd processes running but the problem is there even if only a
single client is active. nfsstat on the server shows a huge number of
read operations (I never used it before---is that too much for a server
that's been running under very moderate load from half a dozen clients
that were doing mostly word processing and programming?) but otherwise
it looks fine to me:
| # nfsstat -s
| Server rpc stats:
| calls badcalls badauth badclnt xdrcall
| 433459545 0 0 0 0
| [snip unused NFSv2]
| Server nfs v3:
| null getattr setattr lookup access readlink
| 846 0% 6556617 1% 24798 0% 332257 0% 1258235 0% 855 0%
| read write create mkdir symlink mknod
| 424885404 98% 148929 0% 11172 0% 30 0% 137 0% 10 0%
| remove rmdir rename link readdir readdirplus
| 8882 0% 11 0% 6539 0% 2418 0% 889 0% 33264 0%
| fsstat fsinfo pathconf commit
| 2919 0% 1209 0% 0 0% 19881 0%
On the client, however, I get some retransmissions and very strange
read/write values compared to the server's. I thought of 32-bit overflow
but the value is obviously longer, I can drive it beyond 2^32 on the
server. Here's the client:
| # nfsstat -c
| Client rpc stats:
| calls retrans authrefrsh
| 2691313 1493 0
| [snip unused NFSv2]
| Client nfs v3:
| null getattr setattr lookup access readlink
| 0 0% 2292223 85% 848 0% 76917 2% 260072 9% 96 0%
| read write create mkdir symlink mknod
| 3595 0% 48255 1% 930 0% 4 0% 5 0% 0 0%
| remove rmdir rename link readdir readdirplus
| 1076 0% 0 0% 1265 0% 292 0% 0 0% 3055 0%
| fsstat fsinfo pathconf commit
| 2067 0% 141 0% 0 0% 331 0%
I noticed a few things about the setup: the SA 642 adapter still has a
stoneage firmware, V1.30, but we never saw a need to upgrade as it
worked nicely with the kernel 2.4.21 cciss driver. Any know issues with
the 2.6 kernel with this one? I just flashed the latest version and will
try rebooting tonight.
Another one is that the kernel is still set to 250 Hz ticks, that was
fine on the P4/HT test system where I built it. Would this really suck
so badly on a real SMP machine? Anyway, 100 Hz should be fine for a
server.
And one parameter I haven't tried to tweak is the IO scheduler. I seem
to remember a recommendation to use noop for RAID5 as the cylinder
numbers are completely virtual anyway so the actual head scheduling
should be left to the controller. Any opinions on this?
cheers & TIA
Matthias
--
I prefer encrypted and signed messages. KeyID: FAC37665
Fingerprint: 8C16 3F0A A6FC DF0D 19B0 8DEF 48D9 1700 FAC3 7665
pgp5h3pmg10BH.pgp
Description: PGP signature
