(Apologies for the cross-post, this is a cross-discipline problem)
I've managed to trace this problem a bit further:

Essentially, Maui has my PBS API timeout set to 9 seconds, which the pbs_statnode() call honours.
However, once the timeout has been detected, (around line 1270 of MPBSI.c),
Maui tries to disconnect from the pbs_server, using pbs_disconnect().
pbs_disconnect() sets an alarm, for 9 seconds, then tries to read the socket. read() is defined as read_nonblocking_socket() in nonblock.c. However, this is what blocks.
The gdb trace is below.


So there are two or three issues here:
1. pbs_disconnect() shouldn't block for 15 minutes, as it has an alarm() round it? The alarm() value is set by Maui to be 9 seconds, by setting the PBSAPITIMEOUT env var - The actual timeout in a recent case was 916 seconds. (about 15 mins). NB: I havent recompiled torque to see what value of PBSAPITIMEOUT it sees, but I have checked that Maui sets PBSAPITIMEOUT correctly.

2. Why isnt' read_nonblocking_socket() doing what it says on the tin?

3. What is MUThread() for in Maui, if it doesnt provide timeouts? I know from extra debug statements it is the pbs_disconnect call that timesout after 15mins.

Many thanks for any pearls of wisdom anyone may have

Craig


Program received signal SIGINT, Interrupt.
0x0029c7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
(gdb) bt
#0  0x0029c7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00aaead3 in __read_nocancel () from /lib/tls/libc.so.6
#2 0x0011e5b3 in read_nonblocking_socket (fd=10, buf=0x131720, count=16384) at ../Libifl/nonblock.c:116
#3  0x0011f391 in pbs_disconnect (connect=1) at ../Libifl/pbsD_connect.c:597
#4 0x080d7b6a in MPBSClusterQuery (R=0x8d91c40, RCount=0xfef923a8, EMsg=0x0, SC=0x0) at MPBSI.c:1288
#5  0x080a015a in __MUTFunc (V=0xfef92320) at MUtil.c:4717
#6 0x080a01d8 in MUThread (F=0x80d7a84 <MPBSClusterQuery>, TimeOut=9, RC=0xfef923a4, ACount=4, Lock=0x0) at MUtil.c:4690
#7  0x080d0c51 in MRMClusterQuery (RCount=0xfef923dc, SC=0x0) at MRM.c:493
#8  0x080d0ddf in MRMGetInfo () at MRM.c:352
#9 0x0807341d in MSchedProcessJobs (OldDay=0xfefa3850 "Tue", GlobalSQ=0xfef9f850, GlobalHQ=0xfef9b850) at MSched.c:6870
#10 0x0804caff in main (ArgC=2, ArgV=0xfefa3934) at Server.c:192

--- Begin Message ---
Hi all,

Didnt get any replies last time - does anyone know why Maui pauses for 15 mins 
after:
ERROR:    cannot get node info: Premature end of message
or just as good, does anyone know where this timeout can be configured?

Craig

-----Original Message-----
From: Craig Macdonald
Sent: Fri 1/26/2007 6:36 PM
To: [email protected]; [EMAIL PROTECTED]
Subject: maui pausing on Torque multiple qsubs
 
Hi there,

I have a problem extremely similar to
http://www.clusterresources.com/pipermail/torqueusers/2006-April/003576.html

When doing multiple qsubs, Maui starts scheduling, then times out 
getting node info.
However, it doesn't start scheduling again for a significant amount of 
time. (15+mins)
The most recent time, this happened to me while qdel-ing 3 jobs:

[EMAIL PROTECTED] terrier]$qdel 545 546 547
No Permission.
qdel: cannot connect to server trmaster (errno=15007)
[EMAIL PROTECTED] terrier]$qstat
Job id              Name             User            Time Use S Queue
------------------- ---------------- --------------- -------- - -----
547.trmaster        index_wt2g_2     craigm                 0 H verylong
[EMAIL PROTECTED] terrier]$qdel 547


At this point, maui will do it's pause.

This happens with torque-2.1.6 and maui 3.2.6p17

My issue is that Maui seems to recieve a timeout from the libpbs, but 
doesnt seem to know what to do with it
for a significant amount of time (till something else times out?). Is 
there any timeouts we can configure in Maui to reduce this.

The alternative route is to trace why the timeout occur in pbs_server.

Configurations and logs below. Logs are from a previous occurrence of 
this event.

Many thanks

Craig

I have Torque config
set server node_check_rate = 150
set server tcp_timeout = 6
set server poll_jobs = True
set server scheduler_iteration = 600

and relevant bit for Maui:
RMPOLLINTERVAL        00:00:30

Maui Log
=======
01/23 22:58:05 MResUpdateStats()
01/23 22:58:05 INFO:     current util[2097]:  7/8 (87.50%)  PH: 28.43%  
active jobs: 2 of 2 (completed: 413)
01/23 22:58:05 MQueueCheckStatus()
01/23 22:58:05 MNodeCheckStatus()
01/23 22:58:05 ALERT:    node 'trnode03' sync from expected state 'Idle' 
to state 'Running' at Tue Jan 23 22:58:05
01/23 22:58:05 ALERT:    node 'trnode04' sync from expected state 'Idle' 
to state 'Running' at Tue Jan 23 22:58:05
01/23 22:58:05 ALERT:    node 'trnode05' sync from expected state 'Idle' 
to state 'Running' at Tue Jan 23 22:58:05
01/23 22:58:05 ALERT:    node 'trnode06' sync from expected state 'Idle' 
to state 'Running' at Tue Jan 23 22:58:05
01/23 22:58:05 ALERT:    node 'trnode08' sync from expected state 'Idle' 
to state 'Running' at Tue Jan 23 22:58:05
01/23 22:58:05 MUClearChild(PID)
01/23 22:58:05 INFO:     scheduling complete.  sleeping 30 seconds
01/23 22:58:14 INFO:     connect request from 130.209.249.20
01/23 22:58:14 INFO:     received service request from host 'trmaster'
01/23 22:58:14 MSURecvPacket(9,BufP,4,NULL,100000,SC)
01/23 22:58:14 INFO:     connect request from 130.209.249.20
01/23 22:58:14 INFO:     received service request from host 'trmaster'
01/23 22:58:14 MSURecvPacket(9,BufP,4,NULL,100000,SC)
01/23 22:58:14 ServerProcessRequests()
01/23 22:58:14 INFO:     not rolling logs (8941183 < 10000000)
01/23 22:58:14 MResAdjust(NULL,0,0)
01/23 22:58:14 MStatInitializeActiveSysUsage()
01/23 22:58:14 MStatClearUsage([NONE],Active)
01/23 22:58:14 ServerUpdate()
01/23 22:58:14 MSysUpdateTime()
01/23 22:58:14 INFO:     starting iteration 2098
01/23 22:58:14 MRMGetInfo()
01/23 22:58:14 MClusterClearUsage()
01/23 22:58:14 MRMClusterQuery()
01/23 22:58:14 MPBSClusterQuery(base,RCount,SC)
01/23 22:58:23 ERROR:    cannot get node info: Premature end of message
<PAUSE HERE>
01/23 23:13:44 ALERT:    cannot load cluster resources on RM (RM 'base' 
failed in function 'clusterquery')
01/23 23:13:44 WARNING:  no resources detected
01/23 23:13:44 MRMWorkloadQuery()
01/23 23:13:44 MPBSWorkloadQuery(base,JCount,SC)
01/23 23:13:44 MPBSInitialize(base,SC)
01/23 23:13:45 MSUListen(S)
01/23 23:13:45 INFO:     opened service socket on port 15004
01/23 23:13:45 __MPBSSystemQuery(base,RCount,SC)
01/23 23:13:45 INFO:     connected to PBS server :0 on sd 1
01/23 23:13:45 MPBSJobUpdate(422,422.trmaster,TaskList,0)
01/23 23:13:45 MStatUpdateActiveJobUsage(422)

Torque pbs_server log
================
01/23/2007 22:58:14;0040;PBS_Server;Svr;trmaster;Scheduler sent command new
01/23/2007 22:58:14;0100;PBS_Server;Req;;Type AuthenticateUser request 
received from [EMAIL PROTECTED], sock=13
01/23/2007 22:58:14;0100;PBS_Server;Req;;Type QueueJob request received 
from [EMAIL PROTECTED], sock=11
01/23/2007 22:58:14;0100;PBS_Server;Req;;Type JobScript request received 
from [EMAIL PROTECTED], sock=11
01/23/2007 22:58:14;0100;PBS_Server;Req;;Type ReadyToCommit request 
received from [EMAIL PROTECTED], sock=11
01/23/2007 22:58:14;0100;PBS_Server;Req;;Type Commit request received 
from [EMAIL PROTECTED], sock=11
01/23/2007 22:58:14;0100;PBS_Server;Job;490.trmaster;enqueuing into 
feed, state 1 hop 1
01/23/2007 22:58:14;0100;PBS_Server;Job;490.trmaster;dequeuing from 
feed, state QUEUED
01/23/2007 22:58:14;0100;PBS_Server;Job;490.trmaster;enqueuing into 
verylong, state 1 hop 1
01/23/2007 22:58:14;0008;PBS_Server;Job;490.trmaster;Job Queued at 
request of [EMAIL PROTECTED], owner = [EMAIL PROTECTED], job name = 
tagDisk454, queue = verylong
01/23/2007 22:58:14;0100;PBS_Server;Req;;Type AuthenticateUser request 
received from [EMAIL PROTECTED], sock=12
01/23/2007 22:58:39;0100;PBS_Server;Req;;Type QueueJob request received 
from [EMAIL PROTECTED], sock=11
01/23/2007 22:58:39;0040;PBS_Server;Svr;trmaster;Scheduler sent command time
01/23/2007 22:58:39;0100;PBS_Server;Req;;Type StatusNode request 
received from [EMAIL PROTECTED], sock=9



_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

--- End Message ---
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to