hmm, actually I did. cset max_job_delay 600
Even changing the timeout, srun and salloc don't respond and many tests fail: FAILURE: srun not responding FAILURE: salloc not responding Is there any other config I may try on SLURM and/or Maui sides ? Increasing the timeout to 20 minutes will help ? Thanks, Rafael On Tue, 2009-06-02 at 07:42 -0700, [email protected] wrote: > Maui and Moab dramatically slow down SLURM job scheduling. > You'll need to set a much higher timeout in SLURM's > testsuite for it to run successfully with Maui or Moab. > Do this by adding a file called "globals.local" in the > testsuite directory and add a line like this: > > set max_job_delay 600 > > or edit the value in the "globals" file if you prefer. > The default value is 120 seconds to run a small job, > which isn't sufficient in your configuration. 10 minutes > to launch a job should be sufficient, but the test > suite will take a very long time to complete. > > > At 11:08 AM -0300 6/2/09, Rafael Folco wrote: > >Hi, > > > >First of all, sorry for sending it to crossed mailing lists. > > > >I am running SLURM testsuite with Maui configured, but I see many > >FAILURES and srun doesn't respond for most options. Is it expected or > >should I fix something to run SLURM testsuite smoothly ? > > > >Here is one example (I see many other failures like this): > > > >TEST: 1.23 > >spawn /usr/bin/srun -N1 -l --mincpus=999999 -t1 hostname > >srun: Job is in held state, pending scheduler release > >srun: job 38 queued and waiting for resources > > > >FAILURE: srun not responding > > > >When removing option --mincpus, "srun -N1 -l -t1 hostname" works fine. > > > ># cat slurm-maui-sles11.log |grep SUCCESS| wc -l > >137 > > > ># cat slurm-maui-sles11.log |grep FAIL| wc -l > >102 > > > > > >I couldn't finish the testsuite, it was running for more than 10 hours > >and just getting errors... > > > >FAILURE: srun not responding > >FAILURE: salloc not responding > > > > > >In spite of Maui/SLURM seem to be working, I see this error on maui.log: > > > >06/02 08:07:36 MRMCheckEvents() > >06/02 08:07:36 ALERT: cannot query events on RM (RM 'cluster-ib-5' > >does not support function 'rmeventquery') > >06/02 08:07:36 MSUAcceptClient(5,ClientSD,HostName,TCP) > >06/02 08:07:36 INFO: accept call failed, errno: 11 (Resource > >temporarily unavailable) > >06/02 08:07:36 INFO: all clients connected. servicing requests > > > > > ># showq > >ACTIVE JOBS-------------------- > >JOBNAME USERNAME STATE PROC REMAINING > >STARTTIME > > > > > > 0 Active Jobs 0 of 4 Processors Active (0.00%) > > > > > >I appreciate if somebody point me to the root of the problem and clarify > >what is going on. > > > >Thanks in advance. > > > >Rafael > > > > > >-- > >Rafael Folco > >Linux on Power > >IBM Linux Technology Center > >E-Mail: [email protected] > > > >Attachment converted: Macintosh HD:slurm-maui-sles11.log (TEXT/ttxt) > >(012EBAB2) > > -- Rafael Folco Linux on Power IBM Linux Technology Center E-Mail: [email protected] _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
