Greetings,

We're seeing some mysterious (to me) NFS timeouts between Solaris-10
clients and a Solaris-10 server.  The server is a T2000 running S10U3
patched to the S10U4-equivalent kernel patch level (120011-14), and
the clients are a mix of V20z's and X4200's running S10U1 or S10U3.
There are a number of Solaris-9 clients as well, running NFS-3 to
the same file server, but none of the NFS-3 clients are logging
any timeouts.

More about the server:  The T2000 is the 8-core/1.0GHz variety with
16GB RAM.  Storage is HDS 9520V, 12x SATA 400GB drives arranged in
three 3D+1P hardware RAID-5 groups, striped together into a single
ZFS pool.  The array and host are dual-connected via 2Gbit fiber channel,
Sun Qlogic HBA's (PCI-Express).
# zpool iostat -v p1
                                           capacity     operations    bandwidth
pool                                     used  avail   read  write   read  
write
--------------------------------------  -----  -----  -----  -----  -----  
-----
p1                                       897G  2.33T     48    108  5.59M  
9.26M
  c6t4849544143484920443630303133323230303230d0   299G   797G     16     35  
1.86M  3.09M
  c6t4849544143484920443630303133323230303231d0   299G   797G     16     36  
1.86M  3.09M
  c6t4849544143484920443630303133323230303232d0   299G   797G     16     36  
1.86M  3.09M
--------------------------------------  -----  -----  -----  -----  -----  
-----

#


The network seems healthy.  Clients and server are all connected to
the same gigabit switch (Cisco 3750 stack).  Both clients and server
show no collisions nor errors in "kstat" output, and Cricket logs of
the switch involved show the switch thinks it's lightly loaded, also
no collisions.

The timeouts ("NFS server not responding, still trying / NFS server ok")
are occurring when there's some activity, though not what I'd call a
heavy load.  An example is one client doing an RMAN backup to an NFS mount
on the server in question, however it is mostly other clients than those
doing the loading which are complaining of problems.  At the time of such
a timeout, Cricket logs show only about 16-20MB/sec network traffic, and
the NFS server itself seems quite responsive during these periods (maybe
showing 90% idle in "vmstat 2" output).

Here's a snippet of output from:
  echo "::nfs4_diag -s" | mdb -k

*********************************************
vfs: fffffe86d9eb1840   mi: fffffe80c47d9000
mount point: /path/zones/AAA/roo...
mount from: filer:/home/path/AAA
Messages queued:
=============================================
. . .
2007 Oct  9 21:33:50: fact RF_SRV_NOT_RESPOND
2007 Oct  9 21:33:51: fact RF_SRV_OK
2007 Oct  9 21:33:51: fact RF_SRV_NOT_RESPOND
2007 Oct  9 21:33:53: fact RF_SRV_OK
2007 Oct  9 21:53:26: fact RF_DELMAP_CB_ERR
2007 Oct  9 22:07:19: fact RF_DELMAP_CB_ERR
2007 Oct 11 02:48:29: fact RF_DELMAP_CB_ERR
2007 Oct 11 02:52:11: fact RF_ERR
2007 Oct 11 02:52:11: event RE_LOST_STATE
2007 Oct 11 02:52:11: event RE_START
2007 Oct 11 02:52:16: event RE_END
2007 Oct 11 13:51:06: fact RF_DELMAP_CB_ERR
2007 Oct 11 21:34:46: fact RF_SRV_NOT_RESPOND
2007 Oct 11 21:34:46: fact RF_SRV_OK
2007 Oct 12 02:31:50: fact RF_SRV_NOT_RESPOND
2007 Oct 12 02:31:52: fact RF_SRV_OK
2007 Oct 12 02:33:11: fact RF_SRV_NOT_RESPOND
2007 Oct 12 02:33:12: fact RF_SRV_OK
2007 Oct 12 02:39:58: fact RF_DELMAP_CB_ERR
2007 Oct 12 10:40:54: fact RF_DELMAP_CB_ERR
2007 Oct 12 13:59:37: fact RF_DELMAP_CB_ERR


Anyway, I'm wondering if others have seen such NFSv4-specific timeouts
when older-fashioned clients are sailing along untroubled.  Maybe there's
a patch we're missing, or a tuneable parameter we can tweak, but it's
looking right now like NFSv4 might not be ready for prime time.  Any
pointers would be appreciated.

Regards,

Marion



Reply via email to