>From the line "#PBS -l mem=8gb,nodes=1:ppn=4,walltime=01:00:00," the user is >saying, "I Need one node with four processors and 8 GB of RAM for one hour." >If no nodes in your cluster have that configuration (four cores && 8 GB RAM), >that's why it's blocked. There's no way this job will be able to be scheduled to run in a setup of "Worst case is that 3 processes run on one node and the 4th on another," because the user only requested one node. --Joe
________________________________ From: [EMAIL PROTECTED] on behalf of Jim Kusznir Sent: Fri 10/3/2008 12:17 PM To: Discussion of Rocks Clusters; [email protected] Subject: [Mauiusers] Why a rorque/maui job won't start? Hello: As I looked through the job queue on my cluster, I'm finding myself mystified....I have one job that just won't start, and I can't figure out why: [EMAIL PROTECTED] changhun]# qstat Job id Name User Time Use S Queue ------------------- ---------------- --------------- -------- - ----- 4428.aeolus CMAQ.aug.benz ramos 21:00:00 R default 4429.aeolus CMAQ.dec.benz ramos 32:31:14 R default 4437.aeolus hsa_xml.sh changhun 0 Q default 4442.aeolus for.chem.ga2 sledburg 2095:20: R default 4483.aeolus mem2Rjob2 wdavis 258:09:4 R default Job 4437 caught my attention, as it appears it should have started before 4442 and 4483, both of which want way more resources than it does. In addtion, at this moment I have 1 node available, and each of my nodes have 8 cores and 8GB ram. The users' job script reads: [EMAIL PROTECTED] hsa_xml]# more hsa_xml.sh #PBS -l mem=8gb,nodes=1:ppn=4,walltime=01:00:00 #PBS -m abe #PBS -M <deleted> # copy qsub's env to the job #PBS -V cd $PBS_O_WORKDIR mpirun mpi_subdue -limit 100 hsa_xml.g I'm still not entirely sure what the mem= flag is supposed to set, but in any case, here's what checkjob says: [EMAIL PROTECTED] hsa_xml]# checkjob 4437 checking job 4437 State: Idle Creds: user:changhun group:changhun class:default qos:DEFAULT WallTime: 00:00:00 of 1:00:00 SubmitTime: Wed Oct 1 10:21:03 (Time Queued Total: 1:22:49:15 Eligible: 00:00:00) Total Tasks: 4 Req[0] TaskCount: 4 Partition: ALL Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] Dedicated Resources Per Task: PROCS: 1 MEM: 2048M IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 0 PartitionMask: [ALL] Flags: RESTARTABLE Holds: Batch (hold reason: NoResources) Messages: cannot create reservation for job '4437' (intital reservation attempt) PE: 8.97 StartPriority: 540 cannot select job 4437 for partition DEFAULT (job hold active) >From this, it appears its trying to schedule 4 processes, with each process having 2 gig of RAM. Worst case is that 3 processes run on one node and the 4th on another....This has been available several times since it's been queued. Why won't this job run? I suspect if the user removes the mem= limit, it will run, but this still leaves the question as to "why" --Jim _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
_______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
