Steve, thanks for the tip. I've been able to temporarily resolve this by creating a new queue and submitting jobs there instead of the default batch queue. The default batch queue is still strangely nonfunctional, though...
> qsub -I -l nodes=fu48core.esl ? Same result. Job is held indefinitely and complains about features. > We define features for every node. I think the reason you might be having > trouble is because > > from: > pbs/server_priv/nodes > > bh001 np=4 compute > > Then set a queue attribute of: resources.default_neednodes = compute It turns out I had a resources_default.neednodes attribute on the (default) batch queue that was defining these mystery attributes "1:ppn=1". I tried your suggestion, and changed this attribute to "compute" like so: $ qmgr -c "set queue batch resources_default.neednodes = compute" And, I added the compute feature to the nodes file. However, this didn't fix the problem. New jobs are still getting created with the "1:ppn" features requested by default, even though it's been removed from the configuration and the server has been restarted. I have no idea where these features are coming from! I created a new, very basic queue following the example at [1]. Jobs submitted to the new queue run without problem. I'm still curious what's up with the batch queue and why it is marking all jobs with the features "1:ppn=1", but for now I'm able to run jobs, so I'm happy. Incidentally, here's the qmgr output for the queue that is not working: > caleb@torqueserver:~$ qmgr -c "print queue batch" > # > # Create queues and set their attributes. > # > # > # Create and define queue batch > # > create queue batch > set queue batch queue_type = Execution > set queue batch max_running = 126 > set queue batch resources_max.ncpus = 8 > set queue batch resources_max.nodes = 1 > set queue batch resources_max.walltime = 99:00:00 > set queue batch resources_min.ncpus = 1 > set queue batch resources_default.ncpus = 1 > set queue batch resources_default.nodect = 1 > set queue batch resources_default.nodes = 1 > set queue batch resources_default.walltime = 24:00:00 > set queue batch resources_available.nodect = 1 > set queue batch max_user_run = 100 > set queue batch enabled = True > set queue batch started = True Here's the qmgr output for the queue that is working: > caleb@torqueserver:~$ qmgr -c "print queue foo" > # > # Create queues and set their attributes. > # > # > # Create and define queue foo > # > create queue foo > set queue foo queue_type = Execution > set queue foo resources_default.nodes = 1 > set queue foo resources_default.walltime = 24:00:00 > set queue foo enabled = True > set queue foo started = True [1] http://www.clusterresources.com/torquedocs21/4.1queueconfig.shtml#example > for the particular queue. > > - From there, Maui will query torque, and know that the node bh001 has a > compute feature, so when you submit a job to a queue, it should be mapped to > bh001 via the node features. > > I'm actually not sure if you can submit jobs and have them run on nodes w/o > defining node features. > > On Jul 20, 2011, at 6:59 PM, Caleb Phillips wrote: > >> Hello all: >> >> I'm running torque 2.3.6 (packaged with Ubuntu 10.10) and maui 3.3.1. >> I'm having an issue where submitted jobs sit in the queue indefinitely. >> This was occurring with pbs_sched, so I installed maui hoping it would >> fix the problem. With maui, I have more information about the problem, >> but no resolution. I've spent several hours searching the torqueusers >> and mauiusers mailing lists, and reading the manuals, to no avail. I >> hope you can help... >> >> As far as I can tell, maui is complaining that there are not sufficient >> "feasible procs" for jobs to run because of a lack of "features". My >> nodes have no features enabled, and I'm not requesting any with my jobs. >> Yet, the jobs show up with "[1][ppn=1]" in the feature list. I don't >> know where these features are coming from or how to unset them, or if >> that's really the source of the problem (it's simply my best guess). Any >> ideas? >> >> Here's more information on my setup and how I reproduce the problem: >> >> I have one node (currently online). It has 48 processors: >> >>> caleb@torqueserver:~$ qnodes >>> fu48core.esl >>> state = free >>> np = 48 >>> ntype = cluster >>> status = opsys=linux,uname=Linux 48core 2.6.32-25-server #45-Ubuntu >>> SMP Sat Oct 16 20:06:58 UTC 2010 x86_64,sessions=2834 5874 12296 13555 >>> 19465 >>> 17575,nsessions=6,nusers=3,idletime=2308,totmem=82007668kb,availmem=73380372kb,physmem=82007668kb,ncpus=48,loadave=2.19,netload=24944834533,state=free,jobs=,varattr=,rectime=1311202191 >> >> It's free and presumably happy: >> >>> caleb@torqueserver:/usr/local/maui$ checknode fu48core >>> >>> checking node fu48core.esl >>> >>> State: Idle (in current state for 5:15:40) >>> Configured Resources: PROCS: 48 MEM: 78G SWAP: 78G DISK: 1M >>> Utilized Resources: SWAP: 8426M >>> Dedicated Resources: [NONE] >>> Opsys: linux Arch: [NONE] >>> Speed: 1.00 Load: 2.240 >>> Network: [DEFAULT] >>> Features: [NONE] >>> Attributes: [Batch] >>> Classes: [batch 48:48][amplhack 48:48][qualnet 48:48][lightweight 48:48] >>> >>> Total Time: 6:19:49 Up: 6:19:49 (100.00%) Active: 00:00:00 (0.00%) >>> >>> Reservations: >>> NOTE: no reservations on node >> >> The batch queue is empty. If I submit a very basic job (I've tried more >> complicated jobs too, with specific resource requests), it gets deferred >> immediately: >> >>> caleb@torqueserver:/usr/local/maui$ echo "sleep 30" | qsub >>> 25.torqueserver.esl >>> caleb@torqueserver:/usr/local/maui$ checkjob 25 >>> checking job 25 >>> >>> State: Idle EState: Deferred >>> Creds: user:caleb group:abelian class:batch qos:DEFAULT >>> WallTime: 00:00:00 of 1:00:00:00 >>> SubmitTime: Wed Jul 20 16:52:37 >>> (Time Queued Total: 00:00:31 Eligible: 00:00:00) >>> >>> Total Tasks: 1 >>> >>> Req[0] TaskCount: 1 Partition: ALL >>> Network: [NONE] Memory>= 0 Disk>= 0 Swap>= 0 >>> Opsys: [NONE] Arch: [NONE] Features: [1][ppn=1] >>> NodeCount: 1 >>> >>> IWD: [NONE] Executable: [NONE] >>> Bypass: 0 StartCount: 0 >>> PartitionMask: [ALL] >>> Flags: RESTARTABLE >>> >>> job is deferred. Reason: NoResources (cannot create reservation for job >>> '25' (intital reservation attempt) >>> ) >>> Holds: Defer (hold reason: NoResources) >>> PE: 1.00 StartPriority: 1 >>> cannot select job 25 for partition DEFAULT (job hold active) >> >> If I release the job, I can see that maui's complaining about a lack of >> feasible procs due to unavailable features: >> >>> caleb@torqueserver:/usr/local/maui$ releasehold 25 >>> >>> job holds adjusted >>> caleb@torqueserver:/usr/local/maui$ checkjob -v 25 >>> >>> >>> checking job 25 (RM job '25.torqueserver.esl') >>> >>> State: Idle >>> Creds: user:caleb group:abelian class:batch qos:DEFAULT >>> WallTime: 00:00:00 of 1:00:00:00 >>> SubmitTime: Wed Jul 20 16:52:37 >>> (Time Queued Total: 00:04:39 Eligible: 00:02:35) >>> >>> Total Tasks: 1 >>> >>> Req[0] TaskCount: 1 Partition: ALL >>> Network: [NONE] Memory>= 0 Disk>= 0 Swap>= 0 >>> Opsys: [NONE] Arch: [NONE] Features: [1][ppn=1] >>> Exec: '' ExecSize: 0 ImageSize: 0 >>> Dedicated Resources Per Task: PROCS: 1 >>> NodeAccess: SHARED >>> NodeCount: 1 >>> >>> >>> IWD: [NONE] Executable: [NONE] >>> Bypass: 0 StartCount: 0 >>> PartitionMask: [ALL] >>> Flags: RESTARTABLE >>> >>> Messages: cannot create reservation for job '25' (intital reservation >>> attempt) >>> >>> PE: 1.00 StartPriority: 2 >>> job cannot run in partition DEFAULT (idle procs do not meet requirements : >>> 0 of 1 procs found) >>> idle procs: 48 feasible procs: 0 >>> >>> Rejection Reasons: [Features : 1] >>> >>> Detailed Node Availability Information: >>> >>> fu48core.esl rejected : Features >> >> There are no error messages in the torque server_log, maui's log file, >> or the node's mom_log. In fact, my node never even sees the job since >> maui never decides to run it. >> >> Any help you can provide would be extremely helpful. Thanks! >> >> -- >> Caleb Phillips, Ph.D. Candidate >> Computer Science Department >> University of Colorado, Boulder >> _______________________________________________ >> mauiusers mailing list >> [email protected] >> http://www.supercluster.org/mailman/listinfo/mauiusers > > ---------------------- > Steve Crusan > System Administrator > Center for Research Computing > University of Rochester > https://www.crc.rochester.edu/ > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG/MacGPG2 v2.0.17 (Darwin) > Comment: GPGTools - http://gpgtools.org > > iQEcBAEBAgAGBQJOKJqRAAoJENS19LGOpgqKRmYH+wUgAcq1B4If6qSE+EWT0MEc > uWp/caUMzy7FO2GYuVaAWtCVPBkUCo6QWlu97L+vQlpSa88yhEYwqZdKE+4ygFs4 > gycahUdZeOAYukvqj+cRaUkOtK+DKaLio+Ehh9NyMOfR18w4y+iAbN451UYLESXd > Ib+Pn2m7C7BN9rdejVyX0Cx/MjflXxXmnXfvGH1QjD4wtWqBBr3KVjZu+qw+VmQw > XTu8YIqQxWp0+ITa+rBOhgnWVjgRy1qFM4rLqxJIVPytQKjp4I2zA34l6OX+6SRN > BCbKeUoumqUE1RstuScp8O4HKGqL6GKHpjZAOmvX4JNmeewEWbZMW9eqbp0GQ88= > =ZRP5 > -----END PGP SIGNATURE----- > _______________________________________________ > mauiusers mailing list > [email protected] > http://www.supercluster.org/mailman/listinfo/mauiusers > _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
