Greetings!  This is a brief description of a long-standing race
condition in autofs and a working fix.  I realize the fix might not be
the most desirable, but I think it is a correct solution, and appears
to be working without problem here.  If I can find some more time, I
may be able to provide more detailed descriptions/diagnostics
later. Autofs version 3.1.4, kernel 2.2.14. 

This problem has been previously described (with sample log output)
in:

http://cgi.debian.org/cgi-bin/bugreport.cgi?bug=52132
http://www.uwsg.indiana.edu/hypermail/linux/kernel/9910.3/1105.html
http://www.uwsg.indiana.edu/hypermail/linux/kernel/9910.3/1112.html
http://www.uwsg.indiana.edu/hypermail/linux/kernel/0002.2/1706.html
http://www.uwsg.indiana.edu/hypermail/linux/kernel/9911.2/0167.html
  (in latter, I incorrectly assumed this had something to do with an rpc
  error.) 

In brief, with autofs submounts, after an expire run in a subprocess,
the daemon enters maybe_umount_autofs_and_exit() (via sig_child(), I
think), and, finding nothing left below it, enters do_umount_autofs().
Something is failing with the determination that the submount is
unmountable.  I've seen instances in which nfs mounts were in progress
when do_umount_autofs() was entered, as well as instances in which no
live mounts were present under the submount, but nevertheless the
kernel reported "device busy" when the umount command to the autofs
submount was actually issued.  Of course, in both cases,
do_umount_autofs() will fail to unmount the submount, but close
nevertheless the pipe to the kernel, leading to a read error in
get_pkt, and the exit of the daemon.  In the case where a live nfs
submount existed, the situation can only be resolved by manually
unraveling the autofs tree and restarting.  In the case where some
other mechanism lead to the 'device busy' error, autofs can
occasionally recover removing the autofs submount later via an expiry
process from the parent autofs mount daemon.  Even in the latter case,
there is a period in which user references to the submount fail.

In any case, one could try to find out why do_umount_autofs() is being
attempted before it will succeed, and close all such possibilities.  I
took the simpler though less elegant route of assuming that this will
happen, but rewriting do_umount_autofs() to recover when the umount
fails.  This was done by 1) not issuing the CATATONIC ioctl (which
renders the kernel end inoperable) and 2) waiting to close the pipe
after the umount command has succeeded.  If umount then fails, the
ioctl fd is reopened and operations are resumed with the same daemon.
The only difficulty with this procedure might be that, in the case
that the umount succeeds, the kernel would require the CATATONIC ioctl
to properly cleanup.  This does not seem to be the case, as the kernel
end will call autofs_catatonic_mode() itself if it cannot write to the
pipe (waitq.c:99), which will occur once the daemon closes it after
successful umount.  

In sum, this has eliminated all the problems we were experiencing, but
I've only tested over a few days so far.  If there are no objections
to this patch, may I please request that it or something similar be
included quickly, as the Debian distribution is now at the point of
removing autofs due to this 'release-critical' bug.  Please also note
that much of the reawaken code called on failed umount is probably
unnecessary, and was just copied from the daemon startup routines
elsewhere. 

Thanks for your work on this!

Camm Maguire                                            [EMAIL PROTECTED]
==========================================================================
"The earth is but one country, and mankind its citizens."  --  Baha'u'llah



--- autofs-3.1.4/daemon/automount.c     Fri Jan 21 14:08:09 2000
+++ automount.c Mon Feb 28 14:07:12 2000
@@ -141,12 +141,48 @@
   int rv;
   
   if (ap.ioctlfd >= 0) {
-    ioctl(ap.ioctlfd, AUTOFS_IOC_CATATONIC, 0);
+/*      ioctl(ap.ioctlfd, AUTOFS_IOC_CATATONIC, 0); */
     close(ap.ioctlfd);
   }
+
+  rv = spawnl(LOG_ERR, PATH_UMOUNT, PATH_UMOUNT, ap.path, NULL);
+  if (rv) {
+    chdir(ap.path);
+    ap.ioctlfd = open(".", O_RDONLY); /* Root directory for ioctl()'s */
+    chdir("/");
+    if ( ap.ioctlfd < 0 ) 
+      syslog(LOG_INFO, "can't reopen ioctlfd\n");
+    if ( ioctl(ap.ioctlfd, AUTOFS_IOC_PROTOVER, &kproto_version) ) {
+      syslog(LOG_DEBUG, "kproto on reawaken: %m");
+      kproto_version = 2;
+    }
+  
+    syslog(LOG_INFO, "using kernel protocol version %d on reawaken", kproto_version);
+  
+    if ( kproto_version < 3 ) {
+      ap.exp_timeout = ap.exp_runfreq = 0;
+      syslog(LOG_INFO, "kernel does not support timeouts");
+    } else {
+      unsigned long timeout;
+      
+      ap.exp_runfreq = (ap.exp_timeout+CHECK_RATIO-1) / CHECK_RATIO;
+      
+      timeout = ap.exp_timeout;
+      ioctl(ap.ioctlfd, AUTOFS_IOC_SETTIMEOUT, &timeout);
+      
+      /* We often start several automounters at the same time.  Add some
+        randomness so we don't all expire at the same time. */
+      if ( ap.exp_timeout )
+       alarm(ap.exp_timeout + my_pid % ap.exp_runfreq);
+      
+    }
+
+    return rv;
+      
+  }
+
   if (ap.pipefd >= 0)
     close(ap.pipefd);
-  rv = spawnl(LOG_ERR, PATH_UMOUNT, PATH_UMOUNT, ap.path, NULL);
   if (rv == 0 && submount)
     rmdir(ap.path);
 

Reply via email to