Hi all, Didnt get any replies last time - does anyone know why Maui pauses for 15 mins after: ERROR: cannot get node info: Premature end of message or just as good, does anyone know where this timeout can be configured?
Craig -----Original Message----- From: Craig Macdonald Sent: Fri 1/26/2007 6:36 PM To: [email protected]; [EMAIL PROTECTED] Subject: maui pausing on Torque multiple qsubs Hi there, I have a problem extremely similar to http://www.clusterresources.com/pipermail/torqueusers/2006-April/003576.html When doing multiple qsubs, Maui starts scheduling, then times out getting node info. However, it doesn't start scheduling again for a significant amount of time. (15+mins) The most recent time, this happened to me while qdel-ing 3 jobs: [EMAIL PROTECTED] terrier]$qdel 545 546 547 No Permission. qdel: cannot connect to server trmaster (errno=15007) [EMAIL PROTECTED] terrier]$qstat Job id Name User Time Use S Queue ------------------- ---------------- --------------- -------- - ----- 547.trmaster index_wt2g_2 craigm 0 H verylong [EMAIL PROTECTED] terrier]$qdel 547 At this point, maui will do it's pause. This happens with torque-2.1.6 and maui 3.2.6p17 My issue is that Maui seems to recieve a timeout from the libpbs, but doesnt seem to know what to do with it for a significant amount of time (till something else times out?). Is there any timeouts we can configure in Maui to reduce this. The alternative route is to trace why the timeout occur in pbs_server. Configurations and logs below. Logs are from a previous occurrence of this event. Many thanks Craig I have Torque config set server node_check_rate = 150 set server tcp_timeout = 6 set server poll_jobs = True set server scheduler_iteration = 600 and relevant bit for Maui: RMPOLLINTERVAL 00:00:30 Maui Log ======= 01/23 22:58:05 MResUpdateStats() 01/23 22:58:05 INFO: current util[2097]: 7/8 (87.50%) PH: 28.43% active jobs: 2 of 2 (completed: 413) 01/23 22:58:05 MQueueCheckStatus() 01/23 22:58:05 MNodeCheckStatus() 01/23 22:58:05 ALERT: node 'trnode03' sync from expected state 'Idle' to state 'Running' at Tue Jan 23 22:58:05 01/23 22:58:05 ALERT: node 'trnode04' sync from expected state 'Idle' to state 'Running' at Tue Jan 23 22:58:05 01/23 22:58:05 ALERT: node 'trnode05' sync from expected state 'Idle' to state 'Running' at Tue Jan 23 22:58:05 01/23 22:58:05 ALERT: node 'trnode06' sync from expected state 'Idle' to state 'Running' at Tue Jan 23 22:58:05 01/23 22:58:05 ALERT: node 'trnode08' sync from expected state 'Idle' to state 'Running' at Tue Jan 23 22:58:05 01/23 22:58:05 MUClearChild(PID) 01/23 22:58:05 INFO: scheduling complete. sleeping 30 seconds 01/23 22:58:14 INFO: connect request from 130.209.249.20 01/23 22:58:14 INFO: received service request from host 'trmaster' 01/23 22:58:14 MSURecvPacket(9,BufP,4,NULL,100000,SC) 01/23 22:58:14 INFO: connect request from 130.209.249.20 01/23 22:58:14 INFO: received service request from host 'trmaster' 01/23 22:58:14 MSURecvPacket(9,BufP,4,NULL,100000,SC) 01/23 22:58:14 ServerProcessRequests() 01/23 22:58:14 INFO: not rolling logs (8941183 < 10000000) 01/23 22:58:14 MResAdjust(NULL,0,0) 01/23 22:58:14 MStatInitializeActiveSysUsage() 01/23 22:58:14 MStatClearUsage([NONE],Active) 01/23 22:58:14 ServerUpdate() 01/23 22:58:14 MSysUpdateTime() 01/23 22:58:14 INFO: starting iteration 2098 01/23 22:58:14 MRMGetInfo() 01/23 22:58:14 MClusterClearUsage() 01/23 22:58:14 MRMClusterQuery() 01/23 22:58:14 MPBSClusterQuery(base,RCount,SC) 01/23 22:58:23 ERROR: cannot get node info: Premature end of message <PAUSE HERE> 01/23 23:13:44 ALERT: cannot load cluster resources on RM (RM 'base' failed in function 'clusterquery') 01/23 23:13:44 WARNING: no resources detected 01/23 23:13:44 MRMWorkloadQuery() 01/23 23:13:44 MPBSWorkloadQuery(base,JCount,SC) 01/23 23:13:44 MPBSInitialize(base,SC) 01/23 23:13:45 MSUListen(S) 01/23 23:13:45 INFO: opened service socket on port 15004 01/23 23:13:45 __MPBSSystemQuery(base,RCount,SC) 01/23 23:13:45 INFO: connected to PBS server :0 on sd 1 01/23 23:13:45 MPBSJobUpdate(422,422.trmaster,TaskList,0) 01/23 23:13:45 MStatUpdateActiveJobUsage(422) Torque pbs_server log ================ 01/23/2007 22:58:14;0040;PBS_Server;Svr;trmaster;Scheduler sent command new 01/23/2007 22:58:14;0100;PBS_Server;Req;;Type AuthenticateUser request received from [EMAIL PROTECTED], sock=13 01/23/2007 22:58:14;0100;PBS_Server;Req;;Type QueueJob request received from [EMAIL PROTECTED], sock=11 01/23/2007 22:58:14;0100;PBS_Server;Req;;Type JobScript request received from [EMAIL PROTECTED], sock=11 01/23/2007 22:58:14;0100;PBS_Server;Req;;Type ReadyToCommit request received from [EMAIL PROTECTED], sock=11 01/23/2007 22:58:14;0100;PBS_Server;Req;;Type Commit request received from [EMAIL PROTECTED], sock=11 01/23/2007 22:58:14;0100;PBS_Server;Job;490.trmaster;enqueuing into feed, state 1 hop 1 01/23/2007 22:58:14;0100;PBS_Server;Job;490.trmaster;dequeuing from feed, state QUEUED 01/23/2007 22:58:14;0100;PBS_Server;Job;490.trmaster;enqueuing into verylong, state 1 hop 1 01/23/2007 22:58:14;0008;PBS_Server;Job;490.trmaster;Job Queued at request of [EMAIL PROTECTED], owner = [EMAIL PROTECTED], job name = tagDisk454, queue = verylong 01/23/2007 22:58:14;0100;PBS_Server;Req;;Type AuthenticateUser request received from [EMAIL PROTECTED], sock=12 01/23/2007 22:58:39;0100;PBS_Server;Req;;Type QueueJob request received from [EMAIL PROTECTED], sock=11 01/23/2007 22:58:39;0040;PBS_Server;Svr;trmaster;Scheduler sent command time 01/23/2007 22:58:39;0100;PBS_Server;Req;;Type StatusNode request received from [EMAIL PROTECTED], sock=9
_______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
