Thanks for the suggestions. I bumped up the following: In mcom.h Changed MMAX_BUFFER from 65536 to 131072
In msched-common.h Changed MAX_MBUFFER from 65536 to 131072 Changed MMAX_BUFFER from 65536 to 131072 Changed MAX_MTASK from 4096 to 8192 In msched.h Changed MAX_MREQ_PER_JOB from 4 to 8 Once I did that the job request would not cause maui to segfault. BTW, isn't there someway that this could just kick out some sort of error into the logs instead of just silently causing a segfault? I spent quite a bit of time looking around in the logs and running maui through strace to see what might be wrong. If it would log something about these constants being too low that would be a real big help in tracking down what needs to be changed. Also if this was a run time change that could be made in a configuration file instead of having to edit include files and recompile would be great. -- Steven DuChene -----Original Message----- From: DuChene, StevenX A Sent: Tuesday, November 29, 2011 9:41 AM To: 'Michel Béland' Cc: [email protected] Subject: RE: [Mauiusers] maui segfaults trying to schedule a job OK, I see in mcom.h MMAX_BUFFER is set to 65536 and MAX_MBUFFER is set to 65536 in msched_common.h Our node names are 8 characters long and this job would be requesting 172 nodes specifically so that would be 1376 characters. -- Steven DuChene -----Original Message----- From: Michel Béland [mailto:[email protected]] Sent: Tuesday, November 29, 2011 7:22 AM To: DuChene, StevenX A Cc: [email protected] Subject: Re: [Mauiusers] maui segfaults trying to schedule a job DuChene, StevenX A a écrit : > > This morning I discovered that the maui scheduler process was not > running on one of our clusters like it should. When I try to start the > maui process as the maui user I get a segmentation fault. In checking > the log files the last few entries look like this: > > > > (...) > > There is only this one job in the queue on a 256 node cluster running > torque 2.5.7 and maui 3.2.6p21 > > > > I have tried starting the maui process within strace but I do not see > any smoking gun in that strace output. > > > > I can probably get maui to start if I qdel the job but I was sort of > hoping to see what was causing the problem in case any additional > debugging output was needed. > > I guess that you have more than 16 cores per node so that your job requests more that 4096 cores. In that case, you have to increase MAX_MTASK in include/msched-common.h and recompile. It hast to be equal or greater than the number of cores in the cluster. You have to watch out also for the size of MAX_MBUFFER and MMAX_BUFFER in include/mcom.h and include/msched-common.h. This is used to define the size of the buffer that contains the string exec_host. For large clusters, it is too small and large jobs will kill Maui after they have started execution. It is good to have short node names for that reason. Other parameters to check are MAX_MNODE, MAX_MCLASS, MAX_MREQ_PER_JOB (or MMAX_REQ_PER_JOB), MAX_MRES, MAX_MNODE_PER_JOB, MAX_MNODE_PER_FRAG, MMAX_JOB. -- Michel Béland, analyste en calcul scientifique [email protected] bureau S-250, pavillon Roger-Gaudry (principal), Université de Montréal téléphone : 514 343-6111 poste 3892 télécopieur : 514 343-2155 Calcul Québec (www.calculquebec.ca) Calcul Canada (calculcanada.org) _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
