Eygene Ryabinkin wrote:
Craig, good day.
Mon, Jun 23, 2008 at 01:30:52PM +0100, Craig Macdonald wrote:
I have experienced these pauses before.
15 minutes one where Maui blocked on read()?
Yes, absolutely. See
http://www.clusterresources.com/pipermail/torquedev/2007-February/000495.html
IIRC Maui says its doing a non-blocking, but its not the case in
pbs_disconnect.
This was resolved by using nscd on the master node.
In my case I clearly see from the strace of pbs_server that it just
receives many descriptors that have something to read from via the
select() call. But it then fails to contact two cluster nodes,
each one with 5 seconds timeout; and Maui times out 1 second before
its request goes to be handled. So my problem seems to be unrelated
to the NSCD (and LDAP; I assume you mean that you use LDAP
authentication and NSS). I had very bad luck with NSCD and LDAP
in the past (with RHEL 3.x), so I am not feeling myself very eager
to test it once again: in the past nscd just got stuck at some point
of its operation, so nodes were almost completely unresponsive to
the external logins.
We use NIS for authentication. I didnt manage to strace pbs_server. I
just presumed pbs_server
was doing some lookup. Could easily trigger by submitting about 50 jobs
at once.
May be my case is not related to yours. Will you be able to test
the patches?
I'm sorry, I'm unable to test such a patch, as I dont have root access
on our cluster machines.
C
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers