Eva Hocks wrote:

> maui crashed with
>
> Apr 20 00:00:27 dev-005 kernel: maui[27159]: segfault at 28 ip 
> 000000345b87af1c sp 00007fffe7b5e7e8 error 4 in 
> libc-2.12.so[345b800000+189000]
>
>
> Anyone know about this problem and how to fix it?
>   

You did not give enough information. Can you reproduce the crash at 
will? How big is your cluster? How many nodes and how many cores? How 
many jobs were in queue? Are there certain jobs that trigger the crash?

At our site, Maui initially had problems with jobs taking the whole 
cluster. We had to increase limits in include files. If a job takes the 
whole cluster, the exec_host job attribute becomes too long for the 
buffer that Maui uses to hold it. We figured out the biggest size the 
buffer needed to be and we increased  MAX_MBUFFER and MMAX_BUFFER 
accordingly, in msched-common.h. Another thing that made Maui crash was 
having two many node requests in a job : 
-lnode=1:e1:ppn=12+1:e2:ppn=12+1:e3:ppn=12+1:e4:ppn=12+1:e5:ppn=12 ( we 
increased MAX_MREQ_PER_JOB and MMAX_REQ_PER_JOB in msched.h to solve 
this problem). It is a pain, though, because I think that nothing can 
stop a user from submitting a job with thousands of node requests, even 
if it will never run. That may be a job for the submit filter, though.

We also increased MAX_MPAR, MAX_MTASK and MAX_MCLASS, although they did 
not cause a crash, but rather made jobs stay in queue. Needless to say, 
we had to recompile.

Hope this helps,


-- 
Michel Béland, analyste en calcul scientifique
michel.bel...@calculquebec.ca
bureau S-250, pavillon Roger-Gaudry (principal), Université de Montréal
téléphone : 514 343-6111 poste 3892     télécopieur : 514 343-2155
Calcul Québec (www.calculquebec.ca)
Calcul Canada (calculcanada.ca)

_______________________________________________
mauiusers mailing list
mauiusers@supercluster.org
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to