Hi,

I already sent a message to the list one month ago about my problems using standing reservations in a cluster with 200 machines/ 1500 cores but it remained unanswered.

We are running :

torque-devel-2.3.0-snap.200801151629.2
maui-client-3.2.6p20-snap.1182974819.9

rebuilt with some non default hard limits, in particular for reservations and standing reservations :

Parameter : Default Setting : Current Setting :
MAX_MCLASS : 16 : 64 :
MMAX_JOB : 4096 : 32768 :
MAX_MJOB_TRACE : 4096 : 32768 :
MAX_MRES : 1024 : 8192 :
MMAX_SRES : 128 : 1024
MMAX_NODE : 5129 : 5120 :

Our configuration was including 1 standing reservation per machine (around 250 in total). As long as there is a small number of jobs running everything is ok : diagnose -r displays allow the active reservations and standing reservations.

After reaching a certain number of running jobs (almost all our jobs are using 1 core) we have not been able to determine precisely (between 900 and 1200, doesn't seem to be 1024!), diagnose -r is no longer able to list all reservations and output is ending with a message like :

------
NOTE:  list truncated

Active Reserved Processors: 1235
WARNING: reservation table is corrupt: active procs reserved does not equal active procs detected (1235 != 2097)
-----

where 2097 is slightly less than the number of running jobs (whatever is the actual number, it is always close to number of running jobs - 20). When this problem begins, standing reservations are no longer there and the consequence is that Torque PROCS normally reserved by standing reservations appear free and jobs are scheduled using these PROCS leading to an unexpected load on worker nodes. Unfortunatly there is no message in MAUI log file... but the impact on scheduling shows this is not a diagnose problem.

Even if this is not clear if the problem is related, under these high load conditions MAUI is crashing very frequently (we have a cron job that restarts it every 5 minutes if it is no longer there).

Thanks in advance for any help, hint or troubleshooting advice. BTW, is there any more recent version of MAUI than the one we are running ? I have not found anything on clusterressources.com web site but there is may be a CVS or SVN repository where to download more recent snapshots.

Cheers,

Michel

    *************************************************************
    * Michel Jouvin                 Email : [EMAIL PROTECTED] *
    * LAL / CNRS                    Tel : +33 1 64468932        *
    * B.P. 34                       Fax : +33 1 69079404        *
    * 91898 Orsay Cedex                                         *
    * France                                                    *
    *************************************************************


_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to