On 07/17/2012 02:05 PM, Paul Raines wrote: > No, I know nothing about that. I think I can remove most of those CLASSCFG > lines as I was having problems in a previous torque getting max_user_run > to actually work. Or will just the fact that I have more than 16 queues > defined in torque still be a problem? > > Seems like maui should then give an error at startup saying too many CLASSCFG > in the config if MAX_CLASS is exceeded.
IIRC, maui will ignore any classes > 16, so it probably isn't clobbering memory elsewhere. But if you notice queues not getting scheduled, that limit will be the problem unless you have a CLASSCFG[DEFAULT] defined. > Where is this documented? What is the difference between MAX_MCLASS (default > 64) and MAX_CLASS (default 16)? Documented? Heh...good one. ;) It looks like MMAX_CLASS is used in src/moab/Mutil.c and src/mcom/MS3I.c, whereas MAX_MCLASS is more widely used throughout the code. Not sure if they're directly related. You might check if there's a particular job that's triggering the segfault/hang and see if there's anything abnormal in its characteristics in Torque (uid, gid, super long or "strange" strings/paths, etc). Try setting a break in MJobWriteStats and examine variables. If you find a bogus address, work backward to see where it got clobbered. Sorry I can't offer more help. I had a crashing problem a couple weeks ago, but it appears to be unrelated. I followed the same path as you with gdb and also inserted some conditional printf's in the source to finally track it down to MMAX_JOBRA set too low. Sadly, the process took several hours. Why such limits are hardcoded is beyond me. // Steve > > Thanks > > -- Paul Raines (http://help.nmr.mgh.harvard.edu) > > > > On Tue, 17 Jul 2012 2:56pm, Steve Johnson wrote: > >> It looks like you have 17 CLASSCFG lines. Have you increased MAX_MCLASS and >> MMAX_CLASS in include/msched-common.h? >> >> // Steve >> >> >> On 07/17/2012 12:42 PM, Paul Raines wrote: >>> >>> We have two separate clusters. One is an ancient cluster with nodes that are >>> dual Opterons and 4G RAM. The other is newer with dual quad Xeon E5472's >>> and >>> 32G RAM. Recently we updated both clusters to CentOS6, torque-2.5.11 and >>> maui 3.3.1. So OS/software/config - wise they are identical. I built >>> torque/maui RPMs myself on an old Opteron node to install on both clusters. >>> >>> The older cluster has been running without any problems. On the new one >>> though maui keeps hanging or segfaulting within 1-8 hours of starting maui. >>> I installed the debuginfo RPMS and run maui in the debugger. >>> >>> When it just hangs (doesn't crash but doesn't respond to any tools such >>> as showq), this is what I see: >>> >>> ========================================================================= >>> (gdb) run -d >>> Starting program: /usr/sbin/maui -d >>> *** glibc detected *** /usr/sbin/maui: corrupted double-linked list: >>> 0x000000000 >>> 7f106a0 *** >>> >>> >>> ^C >>> Program received signal SIGINT, Interrupt. >>> 0x00000036cd2f542e in __lll_lock_wait_private () from /lib64/libc.so.6 >>> (gdb) bt >>> #0 0x00000036cd2f542e in __lll_lock_wait_private () from /lib64/libc.so.6 >>> #1 0x00000036cd27bed5 in _L_lock_9323 () from /lib64/libc.so.6 >>> #2 0x00000036cd2797c6 in malloc () from /lib64/libc.so.6 >>> #3 0x00000036cca04c72 in local_strdup () from /lib64/ld-linux-x86-64.so.2 >>> #4 0x00000036cca08636 in _dl_map_object () from /lib64/ld-linux-x86-64.so.2 >>> #5 0x00000036cca12994 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2 >>> #6 0x00000036cca0e176 in _dl_catch_error () from >>> /lib64/ld-linux-x86-64.so.2 >>> #7 0x00000036cca1244a in _dl_open () from /lib64/ld-linux-x86-64.so.2 >>> #8 0x00000036cd323520 in do_dlopen () from /lib64/libc.so.6 >>> #9 0x00000036cca0e176 in _dl_catch_error () from >>> /lib64/ld-linux-x86-64.so.2 >>> #10 0x00000036cd323677 in __libc_dlopen_mode () from /lib64/libc.so.6 >>> #11 0x00000036cd2fbd51 in backtrace () from /lib64/libc.so.6 >>> #12 0x00000036cd26f98b in __libc_message () from /lib64/libc.so.6 >>> #13 0x00000036cd275296 in malloc_printerr () from /lib64/libc.so.6 >>> #14 0x00000036cd277efa in _int_free () from /lib64/libc.so.6 >>> #15 0x0000000000466136 in MUFree (Ptr=0x46bfbd0) at MUtil.c:460 >>> #16 0x00000000004499a5 in MUserDestroy (UP=0x46bfbd0) at MUser.c:682 >>> #17 0x00000000004499de in MUserFreeTable () at MUser.c:700 >>> #18 0x00000000004ac48f in MSysShutdown (Signo=0) at MSys.c:2540 >>> #19 0x0000000000418361 in UIProcessClients (SS=0x774d270, >>> TimeLimit=<value optimized out>) at UserI.c:527 >>> #20 0x0000000000405bb8 in main (ArgC=2, ArgV=<value optimized out>) >>> at Server.c:240 >>> (gdb) quit >>> ========================================================================= >>> >>> >>> When it crashes this is what I see >>> >>> ========================================================================= >>> (gdb) run -d >>> Starting program: /usr/sbin/maui -d >>> >>> >>> Program received signal SIGSEGV, Segmentation fault. >>> 0x00000036cd265ee7 in _IO_fflush (fp=0x7f0d010) at iofflush.c:43 >>> 43 result = _IO_SYNC (fp) ? EOF : 0; >>> (gdb) >>> (gdb) bt >>> #0 0x00000036cd265ee7 in _IO_fflush (fp=0x7f0d010) at iofflush.c:43 >>> #1 0x000000000047c07b in MJobWriteStats (J=0x9b61080) at MJob.c:7815 >>> #2 0x000000000048643e in MJobProcessCompleted (J=0x9b61080) at MJob.c:9562 >>> #3 0x00000000004a6eb8 in MPBSWorkloadQuery (R=0x6a4b2e0, >>> JCount=0x7ffffff7b938, SC=<value optimized out>) at MPBSI.c:871 >>> #4 0x000000000045f926 in __MUTFunc (V=0x7ffffff7b830) at MUtil.c:4718 >>> #5 0x0000000000462387 in MUThread (F=<value optimized out>, >>> TimeOut=<value optimized out>, RC=<value optimized out>, >>> ACount=<value optimized out>, Lock=<value optimized out>) at >>> MUtil.c:4691 >>> #6 0x0000000000498ed4 in MRMWorkloadQuery (WCount=0x7ffffff7b98c, SC=0x0) >>> at MRM.c:595 >>> #7 0x000000000049cb19 in MRMGetInfo () at MRM.c:364 >>> #8 0x000000000042dc42 in MSchedProcessJobs (OldDay=0x7fffffffde40 "Tue", >>> GlobalSQ=0x7ffffffdbe30, GlobalHQ=0x7ffffffbbe30) at MSched.c:6930 >>> #9 0x0000000000405c46 in main (ArgC=2, ArgV=<value optimized out>) >>> at Server.c:192 >>> (gdb) frame >>> #0 0x00000036cd265ee7 in _IO_fflush (fp=0x7f0d010) at iofflush.c:43 >>> 43 result = _IO_SYNC (fp) ? EOF : 0; >>> (gdb) frame 1 >>> #1 0x000000000047c07b in MJobWriteStats (J=0x9b61080) at MJob.c:7815 >>> 7815 fflush(MSched.statfp); >>> (gdb) list MJob.c:7815 >>> 7810 >>> 7811 if >>> (MJobToTString(J,DEFAULT_WORKLOAD_TRACE_VERSION,Buf,sizeof(Buf)) >>> == SUCCESS) >>> 7812 { >>> 7813 fprintf(MSched.statfp,"%s",Buf); >>> 7814 >>> 7815 fflush(MSched.statfp); >>> 7816 >>> 7817 DBG(4,fSTAT) DPrint("INFO: job stats written for '%s'\n", >>> 7818 J->Name); >>> 7819 } >>> (gdb) p Buf >>> $3 = "16828", ' ' <repeats 18 times>, "0 1 coutu coutu 345600 >>> Completed [max100:1] 1342534818 1342534819 1342534819 1342535999 [NONE] >>> [NONE] [NONE] >= 0M >= 0M [nonGPU] 1342534818 1 1 >>> [NONE]:DEFA"... >>> (gdb) >>> ========================================================================= >>> >>> My guess is some memory corruption has overwritten MSched.statfp which is >>> just a file handle and thus fflush crashes when it actually tries to >>> write to it. WHere that overwrite is occuring though is anyone's guess. >>> >>> I am hoping someone on this list might have a clue. It is really a mystery >>> to me why I only see this on one cluster. They are exactly the same config >>> except for host name. Here is my maui.cfg >>> >>> ========================================================================= >>> ADMIN1 maui root >>> ADMIN3 ALL >>> ADMINHOST launchpad.nmr.mgh.harvard.edu >>> BACKFILLPOLICY FIRSTFIT >>> CLASSCFG[default] MAXPROCPERUSER=150 >>> CLASSCFG[extended] MAXPROCPERUSER=50 MAXPROC=250 >>> CLASSCFG[GPU] MAXPROCPERUSER=5000 >>> CLASSCFG[matlab] MAXPROCPERUSER=60 >>> CLASSCFG[max100] MAXPROCPERUSER=100 >>> CLASSCFG[max10] MAXPROCPERUSER=10 >>> CLASSCFG[max200] MAXPROCPERUSER=200 >>> CLASSCFG[max20] MAXPROCPERUSER=20 >>> CLASSCFG[max50] MAXPROCPERUSER=50 >>> CLASSCFG[max75] MAXPROCPERUSER=75 >>> CLASSCFG[p10] MAXPROCPERUSER=5000 >>> CLASSCFG[p20] MAXPROCPERUSER=5000 >>> CLASSCFG[p30] MAXPROCPERUSER=5000 >>> CLASSCFG[p40] MAXPROCPERUSER=5000 >>> CLASSCFG[p50] MAXPROCPERUSER=30 >>> CLASSCFG[p5] MAXPROCPERUSER=5000 >>> CLASSCFG[p60] MAXPROCPERUSER=20 >>> CLASSWEIGHT 10 >>> ENABLEMULTIREQJOBS TRUE >>> ENFORCERESOURCELIMITS OFF >>> LOGFILEMAXSIZE 1000000000 >>> LOGFILE /var/spool/maui/log/maui.log >>> LOGLEVEL 2 >>> NODEALLOCATIONPOLICY PRIORITY >>> NODECFG[DEFAULT] PRIORITY=1000 PRIORITYF='PRIORITY + 3 * JOBCOUNT' >>> QUEUETIMEWEIGHT 1 >>> RESERVATIONPOLICY CURRENTHIGHEST >>> RMCFG[base] TYPE=PBS >>> RMPOLLINTERVAL 00:00:30 >>> SERVERHOST launchpad.nmr.mgh.harvard.edu >>> SERVERMODE NORMAL >>> SERVERPORT 40559 >>> USERCFG[DEFAULT] MAXIPROC=8 >>> USERCFG[jonghwan] MAXPROC=300 >>> USERCFG[shafee] MAXPROC=300 >>> >>> I actually changed the LOGLEVEL from 3 to 2 at one point thinking the >>> error is happening when writing to the log and lowering the amount it >>> writes might affect things, but it didn't help >>> >>> --------------------------------------------------------------- >>> Paul Raines http://help.nmr.mgh.harvard.edu >>> MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging >>> 149 (2301) 13th Street Charlestown, MA 02129 USA >>> >>> >>> >>> >>> >>> The information in this e-mail is intended only for the person to whom it is >>> addressed. If you believe this e-mail was sent to you in error and the >>> e-mail >>> contains patient information, please contact the Partners Compliance >>> HelpLine at >>> http://www.partners.org/complianceline . If the e-mail was sent to you in >>> error >>> but does not contain patient information, please contact the sender and >>> properly >>> dispose of the e-mail. >>> >>> _______________________________________________ >>> mauiusers mailing list >>> [email protected] >>> http://www.supercluster.org/mailman/listinfo/mauiusers >>> >> >> >> _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
