On Sat, 21 Jun 2008, Ian Kent wrote: > Ooops, I didn't pay enough attention when I read the pthread barrier man > page. That isn't actually an error return but now I'm wondering why I > haven't seen it in my test, very odd. > > Let me fix it and we'll try again. > > There are other problems but I need to know if this is a viable approach > before going further with it. > > Try this instead.
OK!!! The test program has been running for 28 hours continuously, 32 hours total, and is still going, having done 37300 mount-unmount cycles so far. There are normally 244 filesystems mounted from 125 different machines. There have been no hung processes, i.e. automount either mounts the filesystem or returns ENOENT in response to readdir(), within 120 secs. There have been no omitted unmounts, i.e. every mounted filesystem (that was unused) was unmounted within 1800 secs (the default timeout of 300 secs is used). There was one error reported. I ran the test program, and someone powered off a workstation whose filesystem I had mounted. The resulting NFS timeout(s) caused the program to think the test thread was hung, so it tried to produce a backtrace, but there was a bug and the trace was spoiled (you've seen these spoiled traces before in files I've sent in). I improved the trace procedure and attempted to restart. I did a forced umount by "kill -USR1 $PID", but automount said on syslog: Jun 21 15:58:47 bustamove automount[2880]: master.c:957: assertion failed: ap->state == ST_READY And it didn't unmount anything. So I rebooted and started the test on a clean machine. There is a pattern of failure that may not be automount's fault. On almost exactly 0.1% of the attempted mounts, the readdir eventually fails with ENOENT. The test program leaves these filesystems alone for 1800 secs, then tries again to mount and test them, which invariably succeeds. I don't see any pattern to the type of the machine: workstation, server, compute node, heavily loaded, totally idle, etc. But if multiple filesystems from one machine (submount) are unmounted and remounted at the "same" time (0.2 secs apart), if any one fails, there is a tendency for several others to also fail. So I think we're closing in on the problem. James F. Carter Voice 310 825 2897 FAX 310 206 6673 UCLA-Mathnet; 6115 MSA; 405 Hilgard Ave.; Los Angeles, CA, USA 90095-1555 Email: [EMAIL PROTECTED] http://www.math.ucla.edu/~jimc (q.v. for PGP key) _______________________________________________ autofs mailing list [email protected] http://linux.kernel.org/mailman/listinfo/autofs
