Hello,
I think we at the UNM Center for Advanced Research Computing turned up a
bug when these two are set in maui.cfg in how the number of processors
included in the job are calculated:
JOBNODEMATCHPOLICY EXACTNODE
NODEACCESSPOLICY SINGLEJOB
Specifically, if a limit on the number of processors is set (when the
compute nodes are multi-processor), that limit can be circumvented by
specifying a ppn less than the processor count per node.
For example, if, on a system with 4 processors/node, the processors per
user are limited to 32 ala:
USERCFG[DEFAULT] MAXPROC=32
One could circumvent the limit by using the following 'qsub':
qsub -lnodes=32:ppn=1
This would result in a job that is scheduled on 32 nodes, each dedicated
to that job, and a total of 128 processors (OUCH !!!).
Please note that once a job is running, the correct number of processors
is counted against the job. So, if one were to first submit a job (on a
machine with plenty of free nodes, but the above Maui limits set) that
ends up taking up the maximum procs/user like this:
qsub -lnodes=8:ppn=1
Then, once that job is *running* submit another job:
qsub -lnodes=1:ppn=1
This second job will wait to start until the first has exited. Thus, once
the job has started running, the processors it has been allocated are
properly counted.
Attached is a patch for src/moab/MJob.c and src/server/UserI.c that I
believe resolves this problem (The modifications to UserI.c simply changes
how the processors required for the job are reported via the user
commands). Essentially, this change causes Maui to /not/ assume node
packing when JOBNODEMATCHPOLICY is set to EXACTNODE and instead uses the
number of (configured) processors/node * total number of nodes. N.B. This
only applies once Maui has determined that the node access policy is
either SingleJob or SingleTask.
Please let me know if you think I'm misunderstanding something here.
Otherwise, please look over the attached patch file and see if it can be
included in future releases.
Thanks,
Jim
James E. Prewett [email protected] [email protected]
Systems Team Leader LoGS: http://www.hpc.unm.edu/~download/LoGS/
Designated Security Officer OpenPGP key: pub 1024D/31816D93
HPC Systems Engineer III UNM HPC 505.277.8210
Only in maui-3.2.6p21-modified/src/moab: .MJob.c.swp
diff -u -r maui-3.2.6p21/src/moab/MJob.c maui-3.2.6p21-modified/src/moab/MJob.c
--- maui-3.2.6p21/src/moab/MJob.c 2009-08-18 09:13:58.000000000 -0600
+++ maui-3.2.6p21-modified/src/moab/MJob.c 2010-02-05 09:06:00.000000000
-0700
@@ -1790,6 +1790,24 @@
proccount += TC * ((MNode[0] != NULL) ? MNode[0]->CRes.Procs : 1);
}
+ else if(RQ->NAccessPolicy == mnacSingleJob)
+ {
+ if ((J->SpecFlags & (1 << mjfHostList)) &&
+ (J->ReqHList != NULL) &&
+ (J->ReqHList[0].N != NULL))
+ {
+ /* there is a host list */
+ int index;
+ for (index = 0;J->ReqHList[index].N != NULL;index++)
+ {
+ proccount += J->ReqHList[index].N->CRes.Procs;
+ }
+ }
+ else
+ {
+ proccount = J->NodesRequested * ((MNode[0] != NULL) ?
MNode[0]->CRes.Procs : 1);
+ }
+ }
else
{
/* assume packing */
diff -u -r maui-3.2.6p21/src/server/UserI.c
maui-3.2.6p21-modified/src/server/UserI.c
--- maui-3.2.6p21/src/server/UserI.c 2009-08-18 09:13:59.000000000 -0600
+++ maui-3.2.6p21-modified/src/server/UserI.c 2010-02-03 08:53:47.000000000
-0700
@@ -1939,16 +1939,16 @@
MUSNPrintF(&BPtr,&BSpace,"job cannot run in partition %s (insufficient
idle procs available: %d < %d)\n\n",
P->Name,
P->ARes.Procs,
- J->Request.TC * RQ->DRes.Procs);
+ J->C.TotalProcCount);
}
- else if (pcount < RQ->DRes.Procs * J->Request.TC)
+ else if (pcount < J->C.TotalProcCount)
{
int rcount;
MUSNPrintF(&BPtr,&BSpace,"job cannot run in partition %s (idle procs do
not meet requirements : %d of %d procs found)\n",
P->Name,
pcount,
- RQ->DRes.Procs * J->Request.TC);
+ J->C.TotalProcCount);
rcount = 0;
@@ -1994,7 +1994,7 @@
MUSNPrintF(&BPtr,&BSpace,"job can run in partition %s (%d procs
available. %d procs required)\n",
P->Name,
pcount,
- J->Request.TC * RQ->DRes.Procs);
+ J->C.TotalProcCount);
Fail = FALSE;
}
@@ -4966,7 +4966,7 @@
{
sprintf(Buffer,"%sjob cannot run (insufficient idle procs: %d needed
%d available)\n",
Buffer,
- J->Request.TC * RQ->DRes.Procs,
+ J->C.TotalProcCount,
MPar[0].ARes.Procs);
return(SUCCESS);
@@ -5151,11 +5151,11 @@
ptr = MUStrTok(NULL,": \t\n",&TokPtr);
} /* END while (ptr != NULL) */
- if (nodeindex < (int)(RQ->DRes.Procs * J->Request.TC))
+ if (nodeindex < (int)J->C.TotalProcCount)
{
sprintf(Buffer,"%sERROR: incorrect number of procs in hostlist. (%d
requested %d specified)\n",
Buffer,
- J->Request.TC * RQ->DRes.Procs,
+ J->C.TotalProcCount,
nodeindex);
return(FAILURE);
@@ -7597,7 +7597,7 @@
(J->Cred.U != NULL) ? J->Cred.U->Name : "-",
J->StartTime,
J->SubmitTime,
- J->Request.TC * RQ->DRes.Procs,
+ J->C.TotalProcCount,
J->WCLimit,
((J->Cred.Q != NULL) && (J->Cred.Q->Index != 0)) ? J->Cred.Q->Name : "-",
tmpState,
@@ -7685,7 +7685,8 @@
J->Cred.U->Name,
J->StartTime,
J->SubmitTime,
- J->Request.TC * J->Req[0]->DRes.Procs,
+ // J->Request.TC * J->Req[0]->DRes.Procs,
+ J->C.TotalProcCount,
J->WCLimit,
(J->Cred.Q->Index != 0) ? J->Cred.Q->Name : "-",
tmpState,
@@ -9866,7 +9867,7 @@
J->Cred.U->Name,
(J->Cred.G != NULL) ? J->Cred.G->Name : NONE,
J->SystemQueueTime,
- J->Request.TC * J->Req[0]->DRes.Procs,
+ J->C.TotalProcCount,
J->Request.TC * J->Req[0]->DRes.Mem,
J->SpecWCLimit[0],
J->StartPriority,
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers