Hi,
I already sent a message to the list one month ago about my problems using
standing reservations in a cluster with 200 machines/ 1500 cores but it
remained unanswered.
We are running :
torque-devel-2.3.0-snap.200801151629.2
maui-client-3.2.6p20-snap.1182974819.9
rebuilt with some non default hard limits, in particular for reservations
and standing reservations :
Parameter : Default Setting : Current Setting :
MAX_MCLASS : 16 : 64 :
MMAX_JOB : 4096 : 32768 :
MAX_MJOB_TRACE : 4096 : 32768 :
MAX_MRES : 1024 : 8192 :
MMAX_SRES : 128 : 1024
MMAX_NODE : 5129 : 5120 :
Our configuration was including 1 standing reservation per machine (around
250 in total). As long as there is a small number of jobs running
everything is ok : diagnose -r displays allow the active reservations and
standing reservations.
After reaching a certain number of running jobs (almost all our jobs are
using 1 core) we have not been able to determine precisely (between 900 and
1200, doesn't seem to be 1024!), diagnose -r is no longer able to list all
reservations and output is ending with a message like :
------
NOTE: list truncated
Active Reserved Processors: 1235
WARNING: reservation table is corrupt: active procs reserved does not
equal active procs detected (1235 != 2097)
-----
where 2097 is slightly less than the number of running jobs (whatever is
the actual number, it is always close to number of running jobs - 20). When
this problem begins, standing reservations are no longer there and the
consequence is that Torque PROCS normally reserved by standing reservations
appear free and jobs are scheduled using these PROCS leading to an
unexpected load on worker nodes. Unfortunatly there is no message in MAUI
log file... but the impact on scheduling shows this is not a diagnose
problem.
Even if this is not clear if the problem is related, under these high load
conditions MAUI is crashing very frequently (we have a cron job that
restarts it every 5 minutes if it is no longer there).
Thanks in advance for any help, hint or troubleshooting advice. BTW, is
there any more recent version of MAUI than the one we are running ? I have
not found anything on clusterressources.com web site but there is may be a
CVS or SVN repository where to download more recent snapshots.
Cheers,
Michel
*************************************************************
* Michel Jouvin Email : [EMAIL PROTECTED] *
* LAL / CNRS Tel : +33 1 64468932 *
* B.P. 34 Fax : +33 1 69079404 *
* 91898 Orsay Cedex *
* France *
*************************************************************
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers