Hi there,
I have a problem extremely similar to
http://www.clusterresources.com/pipermail/torqueusers/2006-April/003576.html
When doing multiple qsubs, Maui starts scheduling, then times out
getting node info.
However, it doesn't start scheduling again for a significant amount of
time. (15+mins)
The most recent time, this happened to me while qdel-ing 3 jobs:
[EMAIL PROTECTED] terrier]$qdel 545 546 547
No Permission.
qdel: cannot connect to server trmaster (errno=15007)
[EMAIL PROTECTED] terrier]$qstat
Job id Name User Time Use S Queue
------------------- ---------------- --------------- -------- - -----
547.trmaster index_wt2g_2 craigm 0 H verylong
[EMAIL PROTECTED] terrier]$qdel 547
At this point, maui will do it's pause.
This happens with torque-2.1.6 and maui 3.2.6p17
My issue is that Maui seems to recieve a timeout from the libpbs, but
doesnt seem to know what to do with it
for a significant amount of time (till something else times out?). Is
there any timeouts we can configure in Maui to reduce this.
The alternative route is to trace why the timeout occur in pbs_server.
Configurations and logs below. Logs are from a previous occurrence of
this event.
Many thanks
Craig
I have Torque config
set server node_check_rate = 150
set server tcp_timeout = 6
set server poll_jobs = True
set server scheduler_iteration = 600
and relevant bit for Maui:
RMPOLLINTERVAL 00:00:30
Maui Log
=======
01/23 22:58:05 MResUpdateStats()
01/23 22:58:05 INFO: current util[2097]: 7/8 (87.50%) PH: 28.43%
active jobs: 2 of 2 (completed: 413)
01/23 22:58:05 MQueueCheckStatus()
01/23 22:58:05 MNodeCheckStatus()
01/23 22:58:05 ALERT: node 'trnode03' sync from expected state 'Idle'
to state 'Running' at Tue Jan 23 22:58:05
01/23 22:58:05 ALERT: node 'trnode04' sync from expected state 'Idle'
to state 'Running' at Tue Jan 23 22:58:05
01/23 22:58:05 ALERT: node 'trnode05' sync from expected state 'Idle'
to state 'Running' at Tue Jan 23 22:58:05
01/23 22:58:05 ALERT: node 'trnode06' sync from expected state 'Idle'
to state 'Running' at Tue Jan 23 22:58:05
01/23 22:58:05 ALERT: node 'trnode08' sync from expected state 'Idle'
to state 'Running' at Tue Jan 23 22:58:05
01/23 22:58:05 MUClearChild(PID)
01/23 22:58:05 INFO: scheduling complete. sleeping 30 seconds
01/23 22:58:14 INFO: connect request from 130.209.249.20
01/23 22:58:14 INFO: received service request from host 'trmaster'
01/23 22:58:14 MSURecvPacket(9,BufP,4,NULL,100000,SC)
01/23 22:58:14 INFO: connect request from 130.209.249.20
01/23 22:58:14 INFO: received service request from host 'trmaster'
01/23 22:58:14 MSURecvPacket(9,BufP,4,NULL,100000,SC)
01/23 22:58:14 ServerProcessRequests()
01/23 22:58:14 INFO: not rolling logs (8941183 < 10000000)
01/23 22:58:14 MResAdjust(NULL,0,0)
01/23 22:58:14 MStatInitializeActiveSysUsage()
01/23 22:58:14 MStatClearUsage([NONE],Active)
01/23 22:58:14 ServerUpdate()
01/23 22:58:14 MSysUpdateTime()
01/23 22:58:14 INFO: starting iteration 2098
01/23 22:58:14 MRMGetInfo()
01/23 22:58:14 MClusterClearUsage()
01/23 22:58:14 MRMClusterQuery()
01/23 22:58:14 MPBSClusterQuery(base,RCount,SC)
01/23 22:58:23 ERROR: cannot get node info: Premature end of message
<PAUSE HERE>
01/23 23:13:44 ALERT: cannot load cluster resources on RM (RM 'base'
failed in function 'clusterquery')
01/23 23:13:44 WARNING: no resources detected
01/23 23:13:44 MRMWorkloadQuery()
01/23 23:13:44 MPBSWorkloadQuery(base,JCount,SC)
01/23 23:13:44 MPBSInitialize(base,SC)
01/23 23:13:45 MSUListen(S)
01/23 23:13:45 INFO: opened service socket on port 15004
01/23 23:13:45 __MPBSSystemQuery(base,RCount,SC)
01/23 23:13:45 INFO: connected to PBS server :0 on sd 1
01/23 23:13:45 MPBSJobUpdate(422,422.trmaster,TaskList,0)
01/23 23:13:45 MStatUpdateActiveJobUsage(422)
Torque pbs_server log
================
01/23/2007 22:58:14;0040;PBS_Server;Svr;trmaster;Scheduler sent command new
01/23/2007 22:58:14;0100;PBS_Server;Req;;Type AuthenticateUser request
received from [EMAIL PROTECTED], sock=13
01/23/2007 22:58:14;0100;PBS_Server;Req;;Type QueueJob request received
from [EMAIL PROTECTED], sock=11
01/23/2007 22:58:14;0100;PBS_Server;Req;;Type JobScript request received
from [EMAIL PROTECTED], sock=11
01/23/2007 22:58:14;0100;PBS_Server;Req;;Type ReadyToCommit request
received from [EMAIL PROTECTED], sock=11
01/23/2007 22:58:14;0100;PBS_Server;Req;;Type Commit request received
from [EMAIL PROTECTED], sock=11
01/23/2007 22:58:14;0100;PBS_Server;Job;490.trmaster;enqueuing into
feed, state 1 hop 1
01/23/2007 22:58:14;0100;PBS_Server;Job;490.trmaster;dequeuing from
feed, state QUEUED
01/23/2007 22:58:14;0100;PBS_Server;Job;490.trmaster;enqueuing into
verylong, state 1 hop 1
01/23/2007 22:58:14;0008;PBS_Server;Job;490.trmaster;Job Queued at
request of [EMAIL PROTECTED], owner = [EMAIL PROTECTED], job name =
tagDisk454, queue = verylong
01/23/2007 22:58:14;0100;PBS_Server;Req;;Type AuthenticateUser request
received from [EMAIL PROTECTED], sock=12
01/23/2007 22:58:39;0100;PBS_Server;Req;;Type QueueJob request received
from [EMAIL PROTECTED], sock=11
01/23/2007 22:58:39;0040;PBS_Server;Svr;trmaster;Scheduler sent command time
01/23/2007 22:58:39;0100;PBS_Server;Req;;Type StatusNode request
received from [EMAIL PROTECTED], sock=9
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers