On 7/11/06, Gregory Baker <[EMAIL PROTECTED]> wrote:
We have thousands of linux clients hitting netapp file servers (many
3500 series, clustered) on a local gigabit LAN.  From time to time,
applications return "file not found" when attempting to automount a
directory and access a file.  An example of this is a long running
process, which reads in data, processes it for hours (in which time the
filesystem is unmounted) then tries to read more data from that mount
point (which causes a "file not found" error in the application).  This
occurs about 1/100th of the time.

Researching at Netapp turns up this bit by Chuck Lever (Linux NFS
contributer)

"Using the Linux NFS Client with Network Appliance Filers"
http://www.netapp.com/libr ary/tr/3183.pdf  (February 2006)

page 10 says...

"Due to a bug in the mount command, the default retransmission timeout
value on Linux for NFS over TCP is quite small...To obtain standard
behavior, we strongly recommend using "timeo=600, retrans=2" explicitly
when mounting via TCP."

Our defaults (assuming man pages are correct, RedHat Enterprise Linux 3)
would be timeo=7, retrans=3, which translates to 7+14+28+56 = 105 tenths
of a second (10 seconds).  It appears netapp is suggesting waiting
600+600 = 1200 tenths (120 seconds) before giving up on the mount command...

It's important to distinguish two different types of timeouts.

1.  The mount operation has timed out.

2.  After the mount operation succeeds, an NFS RPC operation has timed out.

TR-3183 discusses the proper settings for 2, but you are experiencing 1.

The automounter attempts to mount one of the filer's exports, but the
mount request times out causing the mounted-on directory to be
exposed.  Your filer is heavily loaded, and the filer's mountd is
single-threaded.  The filer may also be experiencing delays when
requesting information from external servers (like DNS or NIS), in
which case the mount request is held up at the filer.

Both sides are at fault:  the Linux mount command should retry (and I
believe later releases of RHEL 3 were fixed to do this) and the filer
configuration should be reviewed to make sure there are no avoidable
delays while processing mount requests.

* What "bug" in the mount command do you believe NetApp is talking about?

The bug is that the mount command overrides the proper default RPC
timeout value with a timeout value of 0.7 seconds.  This is *not* the
timeout for mount operations, it is the timeout for the in-kernel NFS
client to retransmit RPC requests.

* What do you think proper options for NFS auto/mounts would be for
extremely busy centralized NFS filers?

If you are using NFS over TCP, the proper timeout value is 60 seconds.

* What is the reference standard behavior?

Solaris, which is the NFSv3 reference implementation, uses effectively
a 60 second timeout on TCP mounts.

--
"We who cut mere stones must always be envisioning cathedrals"
  -- Quarry worker's creed

_______________________________________________
autofs mailing list
[email protected]
http://linux.kernel.org/mailman/listinfo/autofs

Reply via email to