Is it possible that torque or maui have gotten junked up by crashed mpi jobs or something of that sort?  I have had to clean up torque and maui by hand a couple times after bad crashes while running mpi code caused either one or the other to act strangely.

What distro are you running OSCAR over?  Have you patched them (or the login node) recently?  I am trying to think of reasons besides cosmic rays that it would have worked yesterday, but not today.

You might also want to try dropping this on the maui or torque lists.  They might have a beter idea of where to start looking.

Since we are planning on releasing 5.0 in a week or two, you could of course upgrade to that and see if subsequent maui or torque patches (or a clean install) fixes the problem, but thats a fairly invasive solution.

On 11/8/06, Andrew Preece <[EMAIL PROTECTED]> wrote:

Hello all!

 

I've got an OSCAR 4.2.1 cluster running on 4 quad opteron compute nodes.

 

The head node has the same hardware configuration. Interactive jobs were running fine until earlier today. There were no changes made to the configuration of TORQUE or MAUI.

 

Suddenly I'm getting the following from qsub when I try to submit an interactive job from a login node:

 

qsub: waiting for job 1106.brahe000.cluster to start

qsub: job 1106.brahe000.cluster apparently deleted

 

The login node *is* in the hosts.equiv file on the head node.

Interactive jobs do work from the head node itself.

 

The following is in the pbs server logs, but I've been seeing this all along

 

11/08/2006 17:12:21;0001;PBS_Server;Svr;PBS_Server;is_request, bad attempt to connect from 172.24.254.1:1023 (address not trusted)

 

And from maui's logs:

 

11/08 17:14:17 MReqCreate(1107,SrcRQ,DstRQ,DoCreate)

11/08 17:14:17 INFO:     processing node request line '1:ppn=1'

11/08 17:14:17 INFO:     job '1107' loaded:   1  apreece    nweng  14400       Idle   0 1163031254   [NONE] [NONE] [NONE] >=      0 >=      0 [NONE] 1163031257

11/08 17:14:17 INFO:     21 PBS jobs detected on RM base

11/08 17:14:17 INFO:     jobs detected: 21

11/08 17:14:17 INFO:     total jobs selected (ALL): 6/21 [State: 15]

11/08 17:14:17 INFO:     total jobs selected (ALL): 1/21 [State: 15][Policy: 5]

11/08 17:14:17 INFO:     total jobs selected in partition ALL: 1/6 [Policy: 5]

11/08 17:14:17 MQueueScheduleRJobs(Q)

11/08 17:14:17 INFO:     total jobs selected in partition ALL: 1/1

11/08 17:14:17 INFO:     total jobs selected in partition ALL: 0/1 [PartitionAccess: 1]

11/08 17:14:17 INFO:     total jobs selected in partition opteron: 1/1

11/08 17:14:17 MQueueScheduleIJobs(Q,opteron)

11/08 17:14:17 INFO:     16 feasible tasks found for job 1107:0 in partition opteron (1 Needed)

11/08 17:14:17 INFO:     tasks located for job 1107:  1 of 1 required (1 feasible)

11/08 17:14:17 MJobStart(1107)

11/08 17:14:17 MRMJobStart(1107,Msg,SC)

11/08 17:14:17 MPBSJobStart(1107,base,Msg,SC)

11/08 17:14:17 MPBSJobModify(1107,Resource_List,Resource,brahe003.cluster)

11/08 17:14:17 MPBSJobModify(1107,Resource_List,Resource,1:ppn=1)

11/08 17:14:17 INFO:     job '1107' successfully started

11/08 17:14:17 INFO:     starting job '1107'

11/08 17:14:17 INFO:     1 jobs started on iteration 293

Active Jobs------

------------------

11/08 17:14:17 INFO:     resources available after scheduling: N: 0  P: 3

11/08 17:14:17 INFO:     total jobs selected in partition ALL: 0/6 [State: 1][Policy: 5]

11/08 17:14:17 INFO:     total jobs selected in partition ALL: 0/6 [State: 1][Policy: 5]

11/08 17:14:17 MSchedUpdateStats()

11/08 17:14:17 INFO:     iteration:  293   scheduling time:  0.117 seconds

11/08 17:14:17 INFO:     current util[293]:  4/4 (100.00%)  PH: 67.69%  active jobs: 16 of 22 (completed: 9)

11/08 17:14:17 INFO:     scheduling complete.  sleeping 10 seconds

11/08 17:14:17 INFO:     received service request from host 'brahe000.cluster'

11/08 17:14:17 MSURecvData(S,5000000,TRUE,SC,EMsg)

11/08 17:14:17 UIQueueShow(RBuffer,Buffer,1,root,BufSize)

11/08 17:14:17 UIQueueShowAllJobs(SBuffer,SBufSize,ALL)

11/08 17:14:17 INFO:     UIQueueShowAllJobs buffer size: 1686 bytes

11/08 17:14:17 MSUSendData(S,5000000,TRUE,TRUE)

11/08 17:14:17 INFO:     packet sent (1772 bytes of 1772)

11/08 17:14:17 MSUDisconnect(S)

11/08 17:14:22 INFO:     received service request from host 'brahe000.cluster'

11/08 17:14:22 MSURecvData(S,5000000,TRUE,SC,EMsg)

11/08 17:14:22 UIQueueShow(RBuffer,Buffer,1,root,BufSize)

11/08 17:14:22 UIQueueShowAllJobs(SBuffer,SBufSize,ALL)

11/08 17:14:22 INFO:     UIQueueShowAllJobs buffer size: 1686 bytes

11/08 17:14:22 MSUSendData(S,5000000,TRUE,TRUE)

11/08 17:14:22 INFO:     packet sent (1772 bytes of 1772)

11/08 17:14:22 MSUDisconnect(S)

11/08 17:14:28 ServerProcessRequests()

11/08 17:14:28 MResAdjust(NULL,0,0)

11/08 17:14:28 MStatInitializeActiveSysUsage()

11/08 17:14:28 INFO:     starting iteration 294

11/08 17:14:28 MRMGetInfo()

11/08 17:14:28 MRMClusterQuery()

11/08 17:14:28 MPBSClusterQuery(base,RCount,SC)

11/08 17:14:28 __MPBSGetNodeState(Name,State,PNode)

11/08 17:14:28 INFO:     PBS node brahe001.cluster set to state Busy (job-exclusive)

11/08 17:14:28 MPBSNodeUpdate(brahe001.cluster,brahe001.cluster,Busy,base)

11/08 17:14:28 MPBSLoadQueueInfo(base,brahe001.cluster,SC)

11/08 17:14:28 __MPBSGetNodeState(Name,State,PNode)

11/08 17:14:28 INFO:     PBS node brahe002.cluster set to state Busy (job-exclusive)

11/08 17:14:28 MPBSNodeUpdate(brahe002.cluster,brahe002.cluster,Busy,base)

11/08 17:14:28 MPBSLoadQueueInfo(base,brahe002.cluster,SC)

11/08 17:14:28 __MPBSGetNodeState(Name,State,PNode)

11/08 17:14:28 INFO:     PBS node brahe003.cluster set to state Idle (free)

11/08 17:14:28 INFO:     node 'brahe003.cluster' changed states from Running to Idle

11/08 17:14:28 MPBSNodeUpdate(brahe003.cluster,brahe003.cluster,Idle,base)

11/08 17:14:28 MPBSLoadQueueInfo(base,brahe003.cluster,SC)

11/08 17:14:28 __MPBSGetNodeState(Name,State,PNode)

11/08 17:14:28 INFO:     PBS node brahe004.cluster set to state Busy (job-exclusive)

11/08 17:14:28 MPBSNodeUpdate(brahe004.cluster,brahe004.cluster,Busy,base)

11/08 17:14:28 MPBSLoadQueueInfo(base,brahe004.cluster,SC)

11/08 17:14:28 __MPBSGetNodeState(Name,State,PNode)

11/08 17:14:28 INFO:     6 PBS resources detected on RM base

11/08 17:14:28 INFO:     resources detected: 6

11/08 17:14:28 MRMWorkloadQuery()

11/08 17:14:28 MPBSWorkloadQuery(base,JCount,SC)

11/08 17:14:28 MPBSJobUpdate(1084,1084.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1085,1085.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1086,1086.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1087,1087.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1088,1088.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1089,1089.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1090,1090.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1091,1091.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1092,1092.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1093,1093.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1094,1094.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1095,1095.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1097,1097.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1098,1098.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1099,1099.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1100,1100.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1101,1101.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1102,1102.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1103,1103.brahe000.cluster,TaskList,0)

11/08 17:14:28 MPBSJobUpdate(1104,1104.brahe000.cluster,TaskList,0)

11/08 17:14:28 INFO:     active PBS job 1107 has been removed from the queue.  assuming successful completion

 

Any help would be greatly appreciated!

 

Thanks,

Andrew.


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to