I apologize for the cross-post, but I am in need of help.
Using moab server version 5.1.0p9 (snap NA) (rev. 8933) and
torque 2.3.0
Here is what has happened. We have a group using the cluster that
needs 16GB
RAM. We added RAM to a compute node, and brought the node back up.
Torque sees
the node, and pbsnodes reports the RAM at 16GB. This group was happy
to submit
a job and know that in about 24 hours the job would run on the node. To
facilitate that I created a rolling reservation with a lead of 24
hours. From
reading the documentation, I added the following lines to moab.cfg and
recycled the scheduler:
#Rolling reservation for rokaslabs
SRCFG[rokaslab] HOSTLIST=vmp001 ACCOUNTLIST=rokaslab GROUPLIST=*rokaslab
SRCFG[rokaslab] FLAGS=BYNAME
SRCFG[rokaslab] ROLLBACKOFFSET=24:00:00
SRCFG[rokaslab] PERIOD=INFINITY
moab reports that the reservation was created. Showres shows rokaslab.
4732
As a test I had the user add the following lines to his pbs-script
#PBS -l mem=10000mb
#PBS -l advres:rokaslab.4732
checkjob reported that reservation rokaslab.4732 could not be found.
Right
enough, showres showed rokaslab.4733
So, I had the user resubmit with
#PBS -l advres:rokaslab.4733
and checkjob reports that rokaslab.4733 could not be found. showres
showed
rokaslab.4734
I had the user resubmut this way:
#PBS -l advres:rokaslab
Now the job in question appears in the idle queue; it shows that the
job has
acquired a reservation for vmp001; however checkjob gives the
following output:
checkjob 552072
job 552072
AName: pbs_velvet_21_2runs_Agambiae_10gigs_2
State: Idle
Creds: user:gibbonjg group:rokaslab account:rokaslab class:all
WallTime: 00:00:00 of 4:00:00
SubmitTime: Mon Oct 27 10:44:10
(Time Queued Total: 1:00:08:26 Eligible: 1:00:08:09)
Total Requested Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Memory >= 5000M Disk >= 0 Swap >= 0
Opsys: --- Arch: --- Features: x86
Dedicated Resources Per Task: PROCS: 1 MEM: 10000M
Reserved Nodes: (23:59:43 -> 1:03:59:43 Duration: 4:00:00)
[vmp001:1]
BypassCount: 293
Partition Mask: [base]
Flags: ADVRES:rokaslab,RESTARTABLE
Attr: checkpoint
StartPriority: 2680
EstimatedStart: H: -86786 R: -518 B: 54214
PE: 11.01
Reservation '552072' (23:59:43 -> 1:03:59:43 Duration: 4:00:00)
rejected for Features -
rejected for CPU -
rejected for Memory -
rejected for State -
NOTE: job cannot run in partition base (idle procs do not meet
requirements : 0 of 1 procs found)
idle procs: 420 feasible procs: 0
Node Rejection Summary: [Features: 214][CPU: 1][Memory: 121][State: 288]
So, vmp001 is acquired, but I get the "NOTE: job cannot run in
partition base
(idle procs do not meet requirements : 0 of 1 procs found)" Moreover,
showstart reports that the job will start in 24 hours. Waiting an hour,
showstart will again report that the job will start in 24 hours.
Clearly, I am missing something, or the reservation is mis-configured,
or I
haven't given Torque sufficient info to report the available RAM, or
all of these.
I would really appreciate some help; the users are beginning to get
impatient.
TIA
Charles
---
Charles Johnson
Advanced Computing Center for Research and Education
Vanderbilt University
[EMAIL PROTECTED]
Office: 615-343-2776
Cell: 615-478-8799
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers