Hello all: I'm running torque 2.3.6 (packaged with Ubuntu 10.10) and maui 3.3.1. I'm having an issue where submitted jobs sit in the queue indefinitely. This was occurring with pbs_sched, so I installed maui hoping it would fix the problem. With maui, I have more information about the problem, but no resolution. I've spent several hours searching the torqueusers and mauiusers mailing lists, and reading the manuals, to no avail. I hope you can help...
As far as I can tell, maui is complaining that there are not sufficient "feasible procs" for jobs to run because of a lack of "features". My nodes have no features enabled, and I'm not requesting any with my jobs. Yet, the jobs show up with "[1][ppn=1]" in the feature list. I don't know where these features are coming from or how to unset them, or if that's really the source of the problem (it's simply my best guess). Any ideas? Here's more information on my setup and how I reproduce the problem: I have one node (currently online). It has 48 processors: > caleb@torqueserver:~$ qnodes > fu48core.esl > state = free > np = 48 > ntype = cluster > status = opsys=linux,uname=Linux 48core 2.6.32-25-server #45-Ubuntu SMP > Sat Oct 16 20:06:58 UTC 2010 x86_64,sessions=2834 5874 12296 13555 19465 > 17575,nsessions=6,nusers=3,idletime=2308,totmem=82007668kb,availmem=73380372kb,physmem=82007668kb,ncpus=48,loadave=2.19,netload=24944834533,state=free,jobs=,varattr=,rectime=1311202191 It's free and presumably happy: > caleb@torqueserver:/usr/local/maui$ checknode fu48core > > checking node fu48core.esl > > State: Idle (in current state for 5:15:40) > Configured Resources: PROCS: 48 MEM: 78G SWAP: 78G DISK: 1M > Utilized Resources: SWAP: 8426M > Dedicated Resources: [NONE] > Opsys: linux Arch: [NONE] > Speed: 1.00 Load: 2.240 > Network: [DEFAULT] > Features: [NONE] > Attributes: [Batch] > Classes: [batch 48:48][amplhack 48:48][qualnet 48:48][lightweight 48:48] > > Total Time: 6:19:49 Up: 6:19:49 (100.00%) Active: 00:00:00 (0.00%) > > Reservations: > NOTE: no reservations on node The batch queue is empty. If I submit a very basic job (I've tried more complicated jobs too, with specific resource requests), it gets deferred immediately: > caleb@torqueserver:/usr/local/maui$ echo "sleep 30" | qsub > 25.torqueserver.esl > caleb@torqueserver:/usr/local/maui$ checkjob 25 > checking job 25 > > State: Idle EState: Deferred > Creds: user:caleb group:abelian class:batch qos:DEFAULT > WallTime: 00:00:00 of 1:00:00:00 > SubmitTime: Wed Jul 20 16:52:37 > (Time Queued Total: 00:00:31 Eligible: 00:00:00) > > Total Tasks: 1 > > Req[0] TaskCount: 1 Partition: ALL > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 > Opsys: [NONE] Arch: [NONE] Features: [1][ppn=1] > NodeCount: 1 > > IWD: [NONE] Executable: [NONE] > Bypass: 0 StartCount: 0 > PartitionMask: [ALL] > Flags: RESTARTABLE > > job is deferred. Reason: NoResources (cannot create reservation for job > '25' (intital reservation attempt) > ) > Holds: Defer (hold reason: NoResources) > PE: 1.00 StartPriority: 1 > cannot select job 25 for partition DEFAULT (job hold active) If I release the job, I can see that maui's complaining about a lack of feasible procs due to unavailable features: > caleb@torqueserver:/usr/local/maui$ releasehold 25 > > job holds adjusted > caleb@torqueserver:/usr/local/maui$ checkjob -v 25 > > > checking job 25 (RM job '25.torqueserver.esl') > > State: Idle > Creds: user:caleb group:abelian class:batch qos:DEFAULT > WallTime: 00:00:00 of 1:00:00:00 > SubmitTime: Wed Jul 20 16:52:37 > (Time Queued Total: 00:04:39 Eligible: 00:02:35) > > Total Tasks: 1 > > Req[0] TaskCount: 1 Partition: ALL > Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 > Opsys: [NONE] Arch: [NONE] Features: [1][ppn=1] > Exec: '' ExecSize: 0 ImageSize: 0 > Dedicated Resources Per Task: PROCS: 1 > NodeAccess: SHARED > NodeCount: 1 > > > IWD: [NONE] Executable: [NONE] > Bypass: 0 StartCount: 0 > PartitionMask: [ALL] > Flags: RESTARTABLE > > Messages: cannot create reservation for job '25' (intital reservation > attempt) > > PE: 1.00 StartPriority: 2 > job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 > of 1 procs found) > idle procs: 48 feasible procs: 0 > > Rejection Reasons: [Features : 1] > > Detailed Node Availability Information: > > fu48core.esl rejected : Features There are no error messages in the torque server_log, maui's log file, or the node's mom_log. In fact, my node never even sees the job since maui never decides to run it. Any help you can provide would be extremely helpful. Thanks! -- Caleb Phillips, Ph.D. Candidate Computer Science Department University of Colorado, Boulder _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
