Hello Eygene,

I have experienced these pauses before. This was resolved by using nscd on the master node. However a workaround in the code is probably desirable.

C

Eygene Ryabinkin wrote:
Good day.

Sorry for cross-posting, but this concerns both Torque and Maui.
At least I was not able to figure the workaround that will touch
only one product.

I had noticed that Maui on our production system used to freeze for
15 minutes.  During this time no requests were processed and strace
showed that Maui is blocked inside the read() system call.

Investigations showed that the problem was that Torque server is
not responding to Maui within the 9 seconds interval and Maui tries
to close the connection via pbs_disconnect().  But the latter posts
another request (Disconnect) and Torque reads these two requests
in one read() call: they are effectively coalesced.  Maui's timeout
is due to the fact that Torque was busy processing other requests
(and it times out in connection to the worker nodes twice: it is
enough to overflow the 9 seconds timeout).  So first Maui's request
is not lost: it is processed by Torque, but only after the Maui's
call to pbs_disconnect(), making the Disconnect request to be
effectively lost.

But pbs_disconnect() tries to read all outstanding data from Torque
server and this leads to the blocking read(): once all outstanding
data from Torque is read, the final read() should return end-of-file,
but it won't do it until Torque's side of the channel will be closed.
And this will happen only after 15 minute timeout: remember that
Disconnect request is lost.

The two attached patches cure the problem: Maui drops connection
with the new function, pbs_abort_connection().  I am also attaching
my internal notes about this problem: it contains strace outputs
and my thoughts about the problem; this can be of some help for
developers.

Since this new function is a rework of the Torque API, I had put
the configure's check for this function and the usage of a new
function in Maui is conditionalized at the compile-time.

I am evaluating this patch for the two days: it shows no problems
yet.  Moreover, it cures the original freeze ;))


A side note: I had also changed configure.ac at the line where
Makefiles for various batch systems are included to the main file.
It is not related to the current problem, but autoconf 2.61 fails
to properly substitute multiple variables in one line if these
variables will be substituted to the content of some file
(AC_SUBST_FILE).  So, no matter if these patches will be accepted,
it will be good to take a look at the line 21 of Makefile.in:
-----
@ll_definitions@@sdr_definitions@@pbs_definitions@@sge_definitions@@lsf_definitions@@mx_definitions@@pcre_definitions@
-----
These variables should better be on the separate lines.


And diff chunks to the configure (that is a longest one in the
Maui's patch) can be dropped, since you'll likely produce your own
configure if the patch will be accepted.

Sorry for such a long letter ;))  And thanks for your patience!
------------------------------------------------------------------------

_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to