No, I know nothing about that. I think I can remove most of those CLASSCFG lines as I was having problems in a previous torque getting max_user_run to actually work. Or will just the fact that I have more than 16 queues defined in torque still be a problem?
Seems like maui should then give an error at startup saying too many CLASSCFG in the config if MAX_CLASS is exceeded. Where is this documented? What is the difference between MAX_MCLASS (default 64) and MAX_CLASS (default 16)? Thanks -- Paul Raines (http://help.nmr.mgh.harvard.edu) On Tue, 17 Jul 2012 2:56pm, Steve Johnson wrote: > It looks like you have 17 CLASSCFG lines. Have you increased MAX_MCLASS and > MMAX_CLASS in include/msched-common.h? > > // Steve > > > On 07/17/2012 12:42 PM, Paul Raines wrote: >> >> We have two separate clusters. One is an ancient cluster with nodes that >> are >> dual Opterons and 4G RAM. The other is newer with dual quad Xeon E5472's >> and >> 32G RAM. Recently we updated both clusters to CentOS6, torque-2.5.11 and >> maui 3.3.1. So OS/software/config - wise they are identical. I built >> torque/maui RPMs myself on an old Opteron node to install on both clusters. >> >> The older cluster has been running without any problems. On the new one >> though maui keeps hanging or segfaulting within 1-8 hours of starting maui. >> I installed the debuginfo RPMS and run maui in the debugger. >> >> When it just hangs (doesn't crash but doesn't respond to any tools such >> as showq), this is what I see: >> >> ========================================================================= >> (gdb) run -d >> Starting program: /usr/sbin/maui -d >> *** glibc detected *** /usr/sbin/maui: corrupted double-linked list: >> 0x000000000 >> 7f106a0 *** >> >> >> ^C >> Program received signal SIGINT, Interrupt. >> 0x00000036cd2f542e in __lll_lock_wait_private () from /lib64/libc.so.6 >> (gdb) bt >> #0 0x00000036cd2f542e in __lll_lock_wait_private () from /lib64/libc.so.6 >> #1 0x00000036cd27bed5 in _L_lock_9323 () from /lib64/libc.so.6 >> #2 0x00000036cd2797c6 in malloc () from /lib64/libc.so.6 >> #3 0x00000036cca04c72 in local_strdup () from /lib64/ld-linux-x86-64.so.2 >> #4 0x00000036cca08636 in _dl_map_object () from >> /lib64/ld-linux-x86-64.so.2 >> #5 0x00000036cca12994 in dl_open_worker () from >> /lib64/ld-linux-x86-64.so.2 >> #6 0x00000036cca0e176 in _dl_catch_error () from >> /lib64/ld-linux-x86-64.so.2 >> #7 0x00000036cca1244a in _dl_open () from /lib64/ld-linux-x86-64.so.2 >> #8 0x00000036cd323520 in do_dlopen () from /lib64/libc.so.6 >> #9 0x00000036cca0e176 in _dl_catch_error () from >> /lib64/ld-linux-x86-64.so.2 >> #10 0x00000036cd323677 in __libc_dlopen_mode () from /lib64/libc.so.6 >> #11 0x00000036cd2fbd51 in backtrace () from /lib64/libc.so.6 >> #12 0x00000036cd26f98b in __libc_message () from /lib64/libc.so.6 >> #13 0x00000036cd275296 in malloc_printerr () from /lib64/libc.so.6 >> #14 0x00000036cd277efa in _int_free () from /lib64/libc.so.6 >> #15 0x0000000000466136 in MUFree (Ptr=0x46bfbd0) at MUtil.c:460 >> #16 0x00000000004499a5 in MUserDestroy (UP=0x46bfbd0) at MUser.c:682 >> #17 0x00000000004499de in MUserFreeTable () at MUser.c:700 >> #18 0x00000000004ac48f in MSysShutdown (Signo=0) at MSys.c:2540 >> #19 0x0000000000418361 in UIProcessClients (SS=0x774d270, >> TimeLimit=<value optimized out>) at UserI.c:527 >> #20 0x0000000000405bb8 in main (ArgC=2, ArgV=<value optimized out>) >> at Server.c:240 >> (gdb) quit >> ========================================================================= >> >> >> When it crashes this is what I see >> >> ========================================================================= >> (gdb) run -d >> Starting program: /usr/sbin/maui -d >> >> >> Program received signal SIGSEGV, Segmentation fault. >> 0x00000036cd265ee7 in _IO_fflush (fp=0x7f0d010) at iofflush.c:43 >> 43 result = _IO_SYNC (fp) ? EOF : 0; >> (gdb) >> (gdb) bt >> #0 0x00000036cd265ee7 in _IO_fflush (fp=0x7f0d010) at iofflush.c:43 >> #1 0x000000000047c07b in MJobWriteStats (J=0x9b61080) at MJob.c:7815 >> #2 0x000000000048643e in MJobProcessCompleted (J=0x9b61080) at MJob.c:9562 >> #3 0x00000000004a6eb8 in MPBSWorkloadQuery (R=0x6a4b2e0, >> JCount=0x7ffffff7b938, SC=<value optimized out>) at MPBSI.c:871 >> #4 0x000000000045f926 in __MUTFunc (V=0x7ffffff7b830) at MUtil.c:4718 >> #5 0x0000000000462387 in MUThread (F=<value optimized out>, >> TimeOut=<value optimized out>, RC=<value optimized out>, >> ACount=<value optimized out>, Lock=<value optimized out>) at >> MUtil.c:4691 >> #6 0x0000000000498ed4 in MRMWorkloadQuery (WCount=0x7ffffff7b98c, SC=0x0) >> at MRM.c:595 >> #7 0x000000000049cb19 in MRMGetInfo () at MRM.c:364 >> #8 0x000000000042dc42 in MSchedProcessJobs (OldDay=0x7fffffffde40 "Tue", >> GlobalSQ=0x7ffffffdbe30, GlobalHQ=0x7ffffffbbe30) at MSched.c:6930 >> #9 0x0000000000405c46 in main (ArgC=2, ArgV=<value optimized out>) >> at Server.c:192 >> (gdb) frame >> #0 0x00000036cd265ee7 in _IO_fflush (fp=0x7f0d010) at iofflush.c:43 >> 43 result = _IO_SYNC (fp) ? EOF : 0; >> (gdb) frame 1 >> #1 0x000000000047c07b in MJobWriteStats (J=0x9b61080) at MJob.c:7815 >> 7815 fflush(MSched.statfp); >> (gdb) list MJob.c:7815 >> 7810 >> 7811 if >> (MJobToTString(J,DEFAULT_WORKLOAD_TRACE_VERSION,Buf,sizeof(Buf)) >> == SUCCESS) >> 7812 { >> 7813 fprintf(MSched.statfp,"%s",Buf); >> 7814 >> 7815 fflush(MSched.statfp); >> 7816 >> 7817 DBG(4,fSTAT) DPrint("INFO: job stats written for '%s'\n", >> 7818 J->Name); >> 7819 } >> (gdb) p Buf >> $3 = "16828", ' ' <repeats 18 times>, "0 1 coutu coutu 345600 >> Completed [max100:1] 1342534818 1342534819 1342534819 1342535999 [NONE] >> [NONE] [NONE] >= 0M >= 0M [nonGPU] 1342534818 1 1 >> [NONE]:DEFA"... >> (gdb) >> ========================================================================= >> >> My guess is some memory corruption has overwritten MSched.statfp which is >> just a file handle and thus fflush crashes when it actually tries to >> write to it. WHere that overwrite is occuring though is anyone's guess. >> >> I am hoping someone on this list might have a clue. It is really a mystery >> to me why I only see this on one cluster. They are exactly the same config >> except for host name. Here is my maui.cfg >> >> ========================================================================= >> ADMIN1 maui root >> ADMIN3 ALL >> ADMINHOST launchpad.nmr.mgh.harvard.edu >> BACKFILLPOLICY FIRSTFIT >> CLASSCFG[default] MAXPROCPERUSER=150 >> CLASSCFG[extended] MAXPROCPERUSER=50 MAXPROC=250 >> CLASSCFG[GPU] MAXPROCPERUSER=5000 >> CLASSCFG[matlab] MAXPROCPERUSER=60 >> CLASSCFG[max100] MAXPROCPERUSER=100 >> CLASSCFG[max10] MAXPROCPERUSER=10 >> CLASSCFG[max200] MAXPROCPERUSER=200 >> CLASSCFG[max20] MAXPROCPERUSER=20 >> CLASSCFG[max50] MAXPROCPERUSER=50 >> CLASSCFG[max75] MAXPROCPERUSER=75 >> CLASSCFG[p10] MAXPROCPERUSER=5000 >> CLASSCFG[p20] MAXPROCPERUSER=5000 >> CLASSCFG[p30] MAXPROCPERUSER=5000 >> CLASSCFG[p40] MAXPROCPERUSER=5000 >> CLASSCFG[p50] MAXPROCPERUSER=30 >> CLASSCFG[p5] MAXPROCPERUSER=5000 >> CLASSCFG[p60] MAXPROCPERUSER=20 >> CLASSWEIGHT 10 >> ENABLEMULTIREQJOBS TRUE >> ENFORCERESOURCELIMITS OFF >> LOGFILEMAXSIZE 1000000000 >> LOGFILE /var/spool/maui/log/maui.log >> LOGLEVEL 2 >> NODEALLOCATIONPOLICY PRIORITY >> NODECFG[DEFAULT] PRIORITY=1000 PRIORITYF='PRIORITY + 3 * JOBCOUNT' >> QUEUETIMEWEIGHT 1 >> RESERVATIONPOLICY CURRENTHIGHEST >> RMCFG[base] TYPE=PBS >> RMPOLLINTERVAL 00:00:30 >> SERVERHOST launchpad.nmr.mgh.harvard.edu >> SERVERMODE NORMAL >> SERVERPORT 40559 >> USERCFG[DEFAULT] MAXIPROC=8 >> USERCFG[jonghwan] MAXPROC=300 >> USERCFG[shafee] MAXPROC=300 >> >> I actually changed the LOGLEVEL from 3 to 2 at one point thinking the >> error is happening when writing to the log and lowering the amount it >> writes might affect things, but it didn't help >> >> --------------------------------------------------------------- >> Paul Raines http://help.nmr.mgh.harvard.edu >> MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging >> 149 (2301) 13th Street Charlestown, MA 02129 USA >> >> >> >> >> >> The information in this e-mail is intended only for the person to whom it >> is >> addressed. If you believe this e-mail was sent to you in error and the >> e-mail >> contains patient information, please contact the Partners Compliance >> HelpLine at >> http://www.partners.org/complianceline . If the e-mail was sent to you in >> error >> but does not contain patient information, please contact the sender and >> properly >> dispose of the e-mail. >> >> _______________________________________________ >> mauiusers mailing list >> [email protected] >> http://www.supercluster.org/mailman/listinfo/mauiusers >> > > > _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
