Re: [autofs] clients suddenly start hanging (was: (no subject))

Jim Carter Sun, 22 Jun 2008 20:52:17 -0700

On Sat, 21 Jun 2008, Ian Kent wrote:

> Ooops, I didn't pay enough attention when I read the pthread barrier man
> page. That isn't actually an error return but now I'm wondering why I
> haven't seen it in my test, very odd.
> 
> Let me fix it and we'll try again.
> 
> There are other problems but I need to know if this is a viable approach
> before going further with it.
> 
> Try this instead.


OK!!!  The test program has been running for 28 hours continuously, 32 
hours total, and is still going, having done 37300 mount-unmount cycles so 
far.  There are normally 244 filesystems mounted from 125 different 
machines.

There have been no hung processes, i.e. automount either mounts the 
filesystem or returns ENOENT in response to readdir(), within 120 secs.  
There have been no omitted unmounts, i.e. every mounted filesystem (that 
was unused) was unmounted within 1800 secs (the default timeout of 300 secs 
is used).

There was one error reported.  I ran the test program, and someone powered 
off a workstation whose filesystem I had mounted.  The resulting NFS 
timeout(s) caused the program to think the test thread was hung, so it 
tried to produce a backtrace, but there was a bug and the trace was spoiled 
(you've seen these spoiled traces before in files I've sent in).  I 
improved the trace procedure and attempted to restart.  I did a forced 
umount by "kill -USR1 $PID", but automount said on syslog:

Jun 21 15:58:47 bustamove automount[2880]: master.c:957: assertion failed: 
ap->state == ST_READY

And it didn't unmount anything.  So I rebooted and started the test on a 
clean machine.  

There is a pattern of failure that may not be automount's fault.  On 
almost exactly 0.1% of the attempted mounts, the readdir eventually fails 
with ENOENT.  The test program leaves these filesystems alone for 1800 
secs, then tries again to mount and test them, which invariably succeeds. I 
don't see any pattern to the type of the machine: workstation, server, 
compute node, heavily loaded, totally idle, etc.  But if multiple 
filesystems from one machine (submount) are unmounted and remounted at the 
"same" time (0.2 secs apart), if any one fails, there is a tendency for 
several others to also fail.

So I think we're closing in on the problem.

James F. Carter          Voice 310 825 2897    FAX 310 206 6673
UCLA-Mathnet;  6115 MSA; 405 Hilgard Ave.; Los Angeles, CA, USA 90095-1555
Email: [EMAIL PROTECTED]  http://www.math.ucla.edu/~jimc (q.v. for PGP key)

_______________________________________________
autofs mailing list
[email protected]
http://linux.kernel.org/mailman/listinfo/autofs

Re: [autofs] clients suddenly start hanging (was: (no subject))

Reply via email to