Hi,
I am having troubles with MAUI after adding new nodes and getting a higher
number of running jobs. I use MAUI with a configuration where each node is
configured with a Torque number_of_procs = 2 * num_of_CPUs and 1/2 this
number of proc put in a SR attached to the node. Typical SR configuration
is (there is one per node in fact) :
SRCFG[sdj_0] HOSTLIST=grid33.lal.in2p3.fr
SRCFG[sdj_0] PERIOD=INFINITY
SRCFG[sdj_0] ACCESS=DEDICATED
SRCFG[sdj_0] PRIORITY=10
SRCFG[sdj_0] TASKCOUNT=1
SRCFG[sdj_0] RESOURCES=PROCS:4
SRCFG[sdj_0] CLASSLIST=dteam,ops,sdj
Current total number of nodes and number of jobs as reported by diagnose -n
and showq - r are :
Total Nodes: 170 (Active: 167 Idle: 1 Down: 2)
1267 Jobs 1267 of 2552 Processors Active (49.65%)
In this configuration, 'diagnose -r' lists a certain number of reservation
made for active jobs and then output is truncated with error :
NOTE: list truncated
Active Reserved Processors: 1247
WARNING: reservation table is corrupt: active procs reserved does not
equal active procs detected (1247 != 1267)
In maui.log (LOGLEVEL 2), I cannot find any error related to this using
grep -E 'WARN|ERROR|ALERT' /var/log/maui.log. The only thing, but this is
not clear it is a related problem, for some jobs there are entries like :
04/19 14:36:07 INFO: active PBS job 33633 has been removed from the
queue. assuming successful completion
04/19 14:36:07 ALERT: job ' 33633' has invalid system queue
time (SQ: 1208607298 > ST: 1208567570)
This is a major problem for us as we rely on 'diagnose -r output' to
compute and publish used and available job slots and CPUs.
We are running :
torque-devel-2.3.0-snap.200801151629.2
maui-client-3.2.6p20-snap.1182974819.9
rebuilt with some non default hard limits :
Parameter : Default Setting : Current Setting :
MAX_MCLASS : 16 : 64 :
MMAX_JOB : 4096 : 32768 :
MAX_MJOB_TRACE : 4096 : 32768 :
MAX_MRES : 1024 : 8192 :
MMAX_SRES : 128 : 1024
MMAX_NODE : 5129 : 5120 :
Thanks in advance for any help.
Michel
*************************************************************
* Michel Jouvin Email : [EMAIL PROTECTED] *
* LAL / CNRS Tel : +33 1 64468932 *
* B.P. 34 Fax : +33 1 69079404 *
* 91898 Orsay Cedex *
* France *
*************************************************************
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers