Greetings, We're seeing some mysterious (to me) NFS timeouts between Solaris-10 clients and a Solaris-10 server. The server is a T2000 running S10U3 patched to the S10U4-equivalent kernel patch level (120011-14), and the clients are a mix of V20z's and X4200's running S10U1 or S10U3. There are a number of Solaris-9 clients as well, running NFS-3 to the same file server, but none of the NFS-3 clients are logging any timeouts.
More about the server: The T2000 is the 8-core/1.0GHz variety with 16GB RAM. Storage is HDS 9520V, 12x SATA 400GB drives arranged in three 3D+1P hardware RAID-5 groups, striped together into a single ZFS pool. The array and host are dual-connected via 2Gbit fiber channel, Sun Qlogic HBA's (PCI-Express). # zpool iostat -v p1 capacity operations bandwidth pool used avail read write read write -------------------------------------- ----- ----- ----- ----- ----- ----- p1 897G 2.33T 48 108 5.59M 9.26M c6t4849544143484920443630303133323230303230d0 299G 797G 16 35 1.86M 3.09M c6t4849544143484920443630303133323230303231d0 299G 797G 16 36 1.86M 3.09M c6t4849544143484920443630303133323230303232d0 299G 797G 16 36 1.86M 3.09M -------------------------------------- ----- ----- ----- ----- ----- ----- # The network seems healthy. Clients and server are all connected to the same gigabit switch (Cisco 3750 stack). Both clients and server show no collisions nor errors in "kstat" output, and Cricket logs of the switch involved show the switch thinks it's lightly loaded, also no collisions. The timeouts ("NFS server not responding, still trying / NFS server ok") are occurring when there's some activity, though not what I'd call a heavy load. An example is one client doing an RMAN backup to an NFS mount on the server in question, however it is mostly other clients than those doing the loading which are complaining of problems. At the time of such a timeout, Cricket logs show only about 16-20MB/sec network traffic, and the NFS server itself seems quite responsive during these periods (maybe showing 90% idle in "vmstat 2" output). Here's a snippet of output from: echo "::nfs4_diag -s" | mdb -k ********************************************* vfs: fffffe86d9eb1840 mi: fffffe80c47d9000 mount point: /path/zones/AAA/roo... mount from: filer:/home/path/AAA Messages queued: ============================================= . . . 2007 Oct 9 21:33:50: fact RF_SRV_NOT_RESPOND 2007 Oct 9 21:33:51: fact RF_SRV_OK 2007 Oct 9 21:33:51: fact RF_SRV_NOT_RESPOND 2007 Oct 9 21:33:53: fact RF_SRV_OK 2007 Oct 9 21:53:26: fact RF_DELMAP_CB_ERR 2007 Oct 9 22:07:19: fact RF_DELMAP_CB_ERR 2007 Oct 11 02:48:29: fact RF_DELMAP_CB_ERR 2007 Oct 11 02:52:11: fact RF_ERR 2007 Oct 11 02:52:11: event RE_LOST_STATE 2007 Oct 11 02:52:11: event RE_START 2007 Oct 11 02:52:16: event RE_END 2007 Oct 11 13:51:06: fact RF_DELMAP_CB_ERR 2007 Oct 11 21:34:46: fact RF_SRV_NOT_RESPOND 2007 Oct 11 21:34:46: fact RF_SRV_OK 2007 Oct 12 02:31:50: fact RF_SRV_NOT_RESPOND 2007 Oct 12 02:31:52: fact RF_SRV_OK 2007 Oct 12 02:33:11: fact RF_SRV_NOT_RESPOND 2007 Oct 12 02:33:12: fact RF_SRV_OK 2007 Oct 12 02:39:58: fact RF_DELMAP_CB_ERR 2007 Oct 12 10:40:54: fact RF_DELMAP_CB_ERR 2007 Oct 12 13:59:37: fact RF_DELMAP_CB_ERR Anyway, I'm wondering if others have seen such NFSv4-specific timeouts when older-fashioned clients are sailing along untroubled. Maybe there's a patch we're missing, or a tuneable parameter we can tweak, but it's looking right now like NFSv4 might not be ready for prime time. Any pointers would be appreciated. Regards, Marion