Eva Hocks wrote: > maui crashed with > > Apr 20 00:00:27 dev-005 kernel: maui[27159]: segfault at 28 ip > 000000345b87af1c sp 00007fffe7b5e7e8 error 4 in > libc-2.12.so[345b800000+189000] > > > Anyone know about this problem and how to fix it? >
You did not give enough information. Can you reproduce the crash at will? How big is your cluster? How many nodes and how many cores? How many jobs were in queue? Are there certain jobs that trigger the crash? At our site, Maui initially had problems with jobs taking the whole cluster. We had to increase limits in include files. If a job takes the whole cluster, the exec_host job attribute becomes too long for the buffer that Maui uses to hold it. We figured out the biggest size the buffer needed to be and we increased MAX_MBUFFER and MMAX_BUFFER accordingly, in msched-common.h. Another thing that made Maui crash was having two many node requests in a job : -lnode=1:e1:ppn=12+1:e2:ppn=12+1:e3:ppn=12+1:e4:ppn=12+1:e5:ppn=12 ( we increased MAX_MREQ_PER_JOB and MMAX_REQ_PER_JOB in msched.h to solve this problem). It is a pain, though, because I think that nothing can stop a user from submitting a job with thousands of node requests, even if it will never run. That may be a job for the submit filter, though. We also increased MAX_MPAR, MAX_MTASK and MAX_MCLASS, although they did not cause a crash, but rather made jobs stay in queue. Needless to say, we had to recompile. Hope this helps, -- Michel Béland, analyste en calcul scientifique michel.bel...@calculquebec.ca bureau S-250, pavillon Roger-Gaudry (principal), Université de Montréal téléphone : 514 343-6111 poste 3892 télécopieur : 514 343-2155 Calcul Québec (www.calculquebec.ca) Calcul Canada (calculcanada.ca) _______________________________________________ mauiusers mailing list mauiusers@supercluster.org http://www.supercluster.org/mailman/listinfo/mauiusers