Paul, My maui.cfg file has the line:
USERCFG[DEFAULT] MAXPROC=780,2600 Which limits all users to 780 cpus if there are others trying to get cpus and 2600 cpus in an uncontested environment. Just specify one number for MAXPROC for a hard limit regardless of other users. Regards, Chris On 07/18/2012 08:07 AM, Paul Raines wrote: > > I tried putting a watch on MSched.statfp to see if I could catch it getting > corrupted, but I just ended up with a segfault in a different location, this > time in the fprintf right before the fflush it segfaulted in last time you see > in the backtrace below. > > So I went in an commented out all the CLASSCFG lines in my maui.cfg and > restarted. So far maui has been running longer than it ever has before > without hanging or crashing. However, the whole reason for the CLASSCFG lines > was that maui seemed in the past to be ignoring the max_user_run set for each > of my queues. I will need to monitor things to see if that is still the case. > > One related question. What I really want to limit on a per queue basis is > not number of jobs but number of CPUs a user has running. Is there anyway > to do that? > > -- Paul Raines (http://help.nmr.mgh.harvard.edu) > > > > On Tue, 17 Jul 2012 4:03pm, Steve Johnson wrote: > >> On 07/17/2012 02:05 PM, Paul Raines wrote: >>> No, I know nothing about that. I think I can remove most of those CLASSCFG >>> lines as I was having problems in a previous torque getting max_user_run >>> to actually work. Or will just the fact that I have more than 16 queues >>> defined in torque still be a problem? >>> >>> Seems like maui should then give an error at startup saying too many >>> CLASSCFG >>> in the config if MAX_CLASS is exceeded. >> >> IIRC, maui will ignore any classes > 16, so it probably isn't clobbering >> memory elsewhere. But if you notice queues not getting scheduled, that limit >> will be the problem unless you have a CLASSCFG[DEFAULT] defined. >> >>> Where is this documented? What is the difference between MAX_MCLASS >>> (default >>> 64) and MAX_CLASS (default 16)? >> >> Documented? Heh...good one. ;) >> >> It looks like MMAX_CLASS is used in src/moab/Mutil.c and src/mcom/MS3I.c, >> whereas MAX_MCLASS is more widely used throughout the code. Not sure if >> they're directly related. >> >> You might check if there's a particular job that's triggering the >> segfault/hang and see if there's anything abnormal in its characteristics in >> Torque (uid, gid, super long or "strange" strings/paths, etc). Try setting a >> break in MJobWriteStats and examine variables. If you find a bogus address, >> work backward to see where it got clobbered. Sorry I can't offer more help. >> >> I had a crashing problem a couple weeks ago, but it appears to be unrelated. >> I followed the same path as you with gdb and also inserted some conditional >> printf's in the source to finally track it down to MMAX_JOBRA set too low. >> Sadly, the process took several hours. Why such limits are hardcoded is >> beyond me. >> >> // Steve >> >> >>> >>> Thanks >>> >>> -- Paul Raines (http://help.nmr.mgh.harvard.edu) >>> >>> >>> >>> On Tue, 17 Jul 2012 2:56pm, Steve Johnson wrote: >>> >>>> It looks like you have 17 CLASSCFG lines. Have you increased MAX_MCLASS >>>> and >>>> MMAX_CLASS in include/msched-common.h? >>>> >>>> // Steve >>>> >>>> >>>> On 07/17/2012 12:42 PM, Paul Raines wrote: >>>>> >>>>> We have two separate clusters. One is an ancient cluster with nodes that >>>>> are >>>>> dual Opterons and 4G RAM. The other is newer with dual quad Xeon E5472's >>>>> and >>>>> 32G RAM. Recently we updated both clusters to CentOS6, torque-2.5.11 and >>>>> maui 3.3.1. So OS/software/config - wise they are identical. I built >>>>> torque/maui RPMs myself on an old Opteron node to install on both >>>>> clusters. >>>>> >>>>> The older cluster has been running without any problems. On the new one >>>>> though maui keeps hanging or segfaulting within 1-8 hours of starting >>>>> maui. >>>>> I installed the debuginfo RPMS and run maui in the debugger. >>>>> >>>>> When it just hangs (doesn't crash but doesn't respond to any tools such >>>>> as showq), this is what I see: >>>>> >>>>> ========================================================================= >>>>> (gdb) run -d >>>>> Starting program: /usr/sbin/maui -d >>>>> *** glibc detected *** /usr/sbin/maui: corrupted double-linked list: >>>>> 0x000000000 >>>>> 7f106a0 *** >>>>> >>>>> >>>>> ^C >>>>> Program received signal SIGINT, Interrupt. >>>>> 0x00000036cd2f542e in __lll_lock_wait_private () from /lib64/libc.so.6 >>>>> (gdb) bt >>>>> #0 0x00000036cd2f542e in __lll_lock_wait_private () from >>>>> /lib64/libc.so.6 >>>>> #1 0x00000036cd27bed5 in _L_lock_9323 () from /lib64/libc.so.6 >>>>> #2 0x00000036cd2797c6 in malloc () from /lib64/libc.so.6 >>>>> #3 0x00000036cca04c72 in local_strdup () from >>>>> /lib64/ld-linux-x86-64.so.2 >>>>> #4 0x00000036cca08636 in _dl_map_object () from >>>>> /lib64/ld-linux-x86-64.so.2 >>>>> #5 0x00000036cca12994 in dl_open_worker () from >>>>> /lib64/ld-linux-x86-64.so.2 >>>>> #6 0x00000036cca0e176 in _dl_catch_error () from >>>>> /lib64/ld-linux-x86-64.so.2 >>>>> #7 0x00000036cca1244a in _dl_open () from /lib64/ld-linux-x86-64.so.2 >>>>> #8 0x00000036cd323520 in do_dlopen () from /lib64/libc.so.6 >>>>> #9 0x00000036cca0e176 in _dl_catch_error () from >>>>> /lib64/ld-linux-x86-64.so.2 >>>>> #10 0x00000036cd323677 in __libc_dlopen_mode () from /lib64/libc.so.6 >>>>> #11 0x00000036cd2fbd51 in backtrace () from /lib64/libc.so.6 >>>>> #12 0x00000036cd26f98b in __libc_message () from /lib64/libc.so.6 >>>>> #13 0x00000036cd275296 in malloc_printerr () from /lib64/libc.so.6 >>>>> #14 0x00000036cd277efa in _int_free () from /lib64/libc.so.6 >>>>> #15 0x0000000000466136 in MUFree (Ptr=0x46bfbd0) at MUtil.c:460 >>>>> #16 0x00000000004499a5 in MUserDestroy (UP=0x46bfbd0) at MUser.c:682 >>>>> #17 0x00000000004499de in MUserFreeTable () at MUser.c:700 >>>>> #18 0x00000000004ac48f in MSysShutdown (Signo=0) at MSys.c:2540 >>>>> #19 0x0000000000418361 in UIProcessClients (SS=0x774d270, >>>>> TimeLimit=<value optimized out>) at UserI.c:527 >>>>> #20 0x0000000000405bb8 in main (ArgC=2, ArgV=<value optimized out>) >>>>> at Server.c:240 >>>>> (gdb) quit >>>>> ========================================================================= >>>>> >>>>> >>>>> When it crashes this is what I see >>>>> >>>>> ========================================================================= >>>>> (gdb) run -d >>>>> Starting program: /usr/sbin/maui -d >>>>> >>>>> >>>>> Program received signal SIGSEGV, Segmentation fault. >>>>> 0x00000036cd265ee7 in _IO_fflush (fp=0x7f0d010) at iofflush.c:43 >>>>> 43 result = _IO_SYNC (fp) ? EOF : 0; >>>>> (gdb) >>>>> (gdb) bt >>>>> #0 0x00000036cd265ee7 in _IO_fflush (fp=0x7f0d010) at iofflush.c:43 >>>>> #1 0x000000000047c07b in MJobWriteStats (J=0x9b61080) at MJob.c:7815 >>>>> #2 0x000000000048643e in MJobProcessCompleted (J=0x9b61080) at >>>>> MJob.c:9562 >>>>> #3 0x00000000004a6eb8 in MPBSWorkloadQuery (R=0x6a4b2e0, >>>>> JCount=0x7ffffff7b938, SC=<value optimized out>) at MPBSI.c:871 >>>>> #4 0x000000000045f926 in __MUTFunc (V=0x7ffffff7b830) at MUtil.c:4718 >>>>> #5 0x0000000000462387 in MUThread (F=<value optimized out>, >>>>> TimeOut=<value optimized out>, RC=<value optimized out>, >>>>> ACount=<value optimized out>, Lock=<value optimized out>) at >>>>> MUtil.c:4691 >>>>> #6 0x0000000000498ed4 in MRMWorkloadQuery (WCount=0x7ffffff7b98c, >>>>> SC=0x0) >>>>> at MRM.c:595 >>>>> #7 0x000000000049cb19 in MRMGetInfo () at MRM.c:364 >>>>> #8 0x000000000042dc42 in MSchedProcessJobs (OldDay=0x7fffffffde40 "Tue", >>>>> GlobalSQ=0x7ffffffdbe30, GlobalHQ=0x7ffffffbbe30) at MSched.c:6930 >>>>> #9 0x0000000000405c46 in main (ArgC=2, ArgV=<value optimized out>) >>>>> at Server.c:192 >>>>> (gdb) frame >>>>> #0 0x00000036cd265ee7 in _IO_fflush (fp=0x7f0d010) at iofflush.c:43 >>>>> 43 result = _IO_SYNC (fp) ? EOF : 0; >>>>> (gdb) frame 1 >>>>> #1 0x000000000047c07b in MJobWriteStats (J=0x9b61080) at MJob.c:7815 >>>>> 7815 fflush(MSched.statfp); >>>>> (gdb) list MJob.c:7815 >>>>> 7810 >>>>> 7811 if >>>>> (MJobToTString(J,DEFAULT_WORKLOAD_TRACE_VERSION,Buf,sizeof(Buf)) >>>>> == SUCCESS) >>>>> 7812 { >>>>> 7813 fprintf(MSched.statfp,"%s",Buf); >>>>> 7814 >>>>> 7815 fflush(MSched.statfp); >>>>> 7816 >>>>> 7817 DBG(4,fSTAT) DPrint("INFO: job stats written for '%s'\n", >>>>> 7818 J->Name); >>>>> 7819 } >>>>> (gdb) p Buf >>>>> $3 = "16828", ' ' <repeats 18 times>, "0 1 coutu coutu 345600 >>>>> Completed [max100:1] 1342534818 1342534819 1342534819 1342535999 >>>>> [NONE] >>>>> [NONE] [NONE] >= 0M >= 0M [nonGPU] 1342534818 1 1 >>>>> [NONE]:DEFA"... >>>>> (gdb) >>>>> ========================================================================= >>>>> >>>>> My guess is some memory corruption has overwritten MSched.statfp which is >>>>> just a file handle and thus fflush crashes when it actually tries to >>>>> write to it. WHere that overwrite is occuring though is anyone's guess. >>>>> >>>>> I am hoping someone on this list might have a clue. It is really a >>>>> mystery >>>>> to me why I only see this on one cluster. They are exactly the same >>>>> config >>>>> except for host name. Here is my maui.cfg >>>>> >>>>> ========================================================================= >>>>> ADMIN1 maui root >>>>> ADMIN3 ALL >>>>> ADMINHOST launchpad.nmr.mgh.harvard.edu >>>>> BACKFILLPOLICY FIRSTFIT >>>>> CLASSCFG[default] MAXPROCPERUSER=150 >>>>> CLASSCFG[extended] MAXPROCPERUSER=50 MAXPROC=250 >>>>> CLASSCFG[GPU] MAXPROCPERUSER=5000 >>>>> CLASSCFG[matlab] MAXPROCPERUSER=60 >>>>> CLASSCFG[max100] MAXPROCPERUSER=100 >>>>> CLASSCFG[max10] MAXPROCPERUSER=10 >>>>> CLASSCFG[max200] MAXPROCPERUSER=200 >>>>> CLASSCFG[max20] MAXPROCPERUSER=20 >>>>> CLASSCFG[max50] MAXPROCPERUSER=50 >>>>> CLASSCFG[max75] MAXPROCPERUSER=75 >>>>> CLASSCFG[p10] MAXPROCPERUSER=5000 >>>>> CLASSCFG[p20] MAXPROCPERUSER=5000 >>>>> CLASSCFG[p30] MAXPROCPERUSER=5000 >>>>> CLASSCFG[p40] MAXPROCPERUSER=5000 >>>>> CLASSCFG[p50] MAXPROCPERUSER=30 >>>>> CLASSCFG[p5] MAXPROCPERUSER=5000 >>>>> CLASSCFG[p60] MAXPROCPERUSER=20 >>>>> CLASSWEIGHT 10 >>>>> ENABLEMULTIREQJOBS TRUE >>>>> ENFORCERESOURCELIMITS OFF >>>>> LOGFILEMAXSIZE 1000000000 >>>>> LOGFILE /var/spool/maui/log/maui.log >>>>> LOGLEVEL 2 >>>>> NODEALLOCATIONPOLICY PRIORITY >>>>> NODECFG[DEFAULT] PRIORITY=1000 PRIORITYF='PRIORITY + 3 * JOBCOUNT' >>>>> QUEUETIMEWEIGHT 1 >>>>> RESERVATIONPOLICY CURRENTHIGHEST >>>>> RMCFG[base] TYPE=PBS >>>>> RMPOLLINTERVAL 00:00:30 >>>>> SERVERHOST launchpad.nmr.mgh.harvard.edu >>>>> SERVERMODE NORMAL >>>>> SERVERPORT 40559 >>>>> USERCFG[DEFAULT] MAXIPROC=8 >>>>> USERCFG[jonghwan] MAXPROC=300 >>>>> USERCFG[shafee] MAXPROC=300 >>>>> >>>>> I actually changed the LOGLEVEL from 3 to 2 at one point thinking the >>>>> error is happening when writing to the log and lowering the amount it >>>>> writes might affect things, but it didn't help >>>>> >>>>> --------------------------------------------------------------- >>>>> Paul Raines http://help.nmr.mgh.harvard.edu >>>>> MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging >>>>> 149 (2301) 13th Street Charlestown, MA 02129 USA >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> The information in this e-mail is intended only for the person to whom it >>>>> is >>>>> addressed. If you believe this e-mail was sent to you in error and the >>>>> e-mail >>>>> contains patient information, please contact the Partners Compliance >>>>> HelpLine at >>>>> http://www.partners.org/complianceline . If the e-mail was sent to you in >>>>> error >>>>> but does not contain patient information, please contact the sender and >>>>> properly >>>>> dispose of the e-mail. >>>>> >>>>> _______________________________________________ >>>>> mauiusers mailing list >>>>> [email protected] >>>>> http://www.supercluster.org/mailman/listinfo/mauiusers >>>>> >>>> >>>> >>>> >> >> >> > _______________________________________________ > mauiusers mailing list > [email protected] > http://www.supercluster.org/mailman/listinfo/mauiusers > _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
