Re: [Mauiusers] maui hangs/segfaults in 3.3.1

Chris Evert Mon, 01 Oct 2012 09:46:56 -0700

Paul,

My maui.cfg file has the line:


USERCFG[DEFAULT]                MAXPROC=780,2600

Which limits all users to 780 cpus if there are others trying to get 
cpus and 2600 cpus in an uncontested environment.

Just specify one number for MAXPROC for a hard limit regardless of other 
users.

Regards,
Chris

On 07/18/2012 08:07 AM, Paul Raines wrote:
>
> I tried putting a watch on MSched.statfp to see if I could catch it getting
> corrupted, but I just ended up with a segfault in a different location, this
> time in the fprintf right before the fflush it segfaulted in last time you see
> in the backtrace below.
>
> So I went in an commented out all the CLASSCFG lines in my maui.cfg and
> restarted.  So far maui has been running longer than it ever has before
> without hanging or crashing.  However, the whole reason for the CLASSCFG lines
> was that maui seemed in the past to be ignoring the max_user_run set for each
> of my queues.  I will need to monitor things to see if that is still the case.
>
> One related question.  What I really want to limit on a per queue basis is
> not number of jobs but number of CPUs a user has running.  Is there anyway
> to do that?
>
> -- Paul Raines (http://help.nmr.mgh.harvard.edu)
>
>
>
> On Tue, 17 Jul 2012 4:03pm, Steve Johnson wrote:
>
>> On 07/17/2012 02:05 PM, Paul Raines wrote:
>>> No, I know nothing about that.  I think I can remove most of those CLASSCFG
>>> lines as I was having problems in a previous torque getting max_user_run
>>> to actually work.  Or will just the fact that I have more than 16 queues
>>> defined in torque still be a problem?
>>>
>>> Seems like maui should then give an error at startup saying too many
>>> CLASSCFG
>>> in the config if MAX_CLASS is exceeded.
>>
>> IIRC, maui will ignore any classes > 16, so it probably isn't clobbering
>> memory elsewhere.  But if you notice queues not getting scheduled, that limit
>> will be the problem unless you have a CLASSCFG[DEFAULT] defined.
>>
>>> Where is this documented?  What is the difference between MAX_MCLASS
>>> (default
>>> 64) and MAX_CLASS (default 16)?
>>
>> Documented? Heh...good one. ;)
>>
>> It looks like MMAX_CLASS is used in src/moab/Mutil.c and src/mcom/MS3I.c,
>> whereas MAX_MCLASS is more widely used throughout the code.  Not sure if
>> they're directly related.
>>
>> You might check if there's a particular job that's triggering the
>> segfault/hang and see if there's anything abnormal in its characteristics in
>> Torque (uid, gid, super long or "strange" strings/paths, etc).  Try setting a
>> break in MJobWriteStats and examine variables. If you find a bogus address,
>> work backward to see where it got clobbered. Sorry I can't offer more help.
>>
>> I had a crashing problem a couple weeks ago, but it appears to be unrelated.
>> I followed the same path as you with gdb and also inserted some conditional
>> printf's in the source to finally track it down to MMAX_JOBRA set too low.
>> Sadly, the process took several hours.  Why such limits are hardcoded is
>> beyond me.
>>
>> // Steve
>>
>>
>>>
>>> Thanks
>>>
>>> -- Paul Raines (http://help.nmr.mgh.harvard.edu)
>>>
>>>
>>>
>>> On Tue, 17 Jul 2012 2:56pm, Steve Johnson wrote:
>>>
>>>> It looks like you have 17 CLASSCFG lines.  Have you increased MAX_MCLASS
>>>> and
>>>> MMAX_CLASS in include/msched-common.h?
>>>>
>>>> // Steve
>>>>
>>>>
>>>> On 07/17/2012 12:42 PM, Paul Raines wrote:
>>>>>
>>>>> We have two separate clusters. One is an ancient cluster with nodes that
>>>>> are
>>>>> dual Opterons and 4G RAM.  The other is newer with dual quad Xeon E5472's
>>>>> and
>>>>> 32G RAM.  Recently we updated both clusters to CentOS6, torque-2.5.11 and
>>>>> maui 3.3.1.  So OS/software/config - wise they are identical.  I built
>>>>> torque/maui RPMs myself on an old Opteron node to install on both
>>>>> clusters.
>>>>>
>>>>> The older cluster has been running without any problems.  On the new one
>>>>> though maui keeps hanging or segfaulting within 1-8 hours of starting
>>>>> maui.
>>>>> I installed the debuginfo RPMS and run maui in the debugger.
>>>>>
>>>>> When it just hangs (doesn't crash but doesn't respond to any tools such
>>>>> as showq), this is what I see:
>>>>>
>>>>> =========================================================================
>>>>> (gdb) run -d
>>>>> Starting program: /usr/sbin/maui -d
>>>>> *** glibc detected *** /usr/sbin/maui: corrupted double-linked list:
>>>>> 0x000000000
>>>>> 7f106a0 ***
>>>>>
>>>>>
>>>>> ^C
>>>>> Program received signal SIGINT, Interrupt.
>>>>> 0x00000036cd2f542e in __lll_lock_wait_private () from /lib64/libc.so.6
>>>>> (gdb) bt
>>>>> #0  0x00000036cd2f542e in __lll_lock_wait_private () from
>>>>> /lib64/libc.so.6
>>>>> #1  0x00000036cd27bed5 in _L_lock_9323 () from /lib64/libc.so.6
>>>>> #2  0x00000036cd2797c6 in malloc () from /lib64/libc.so.6
>>>>> #3  0x00000036cca04c72 in local_strdup () from
>>>>> /lib64/ld-linux-x86-64.so.2
>>>>> #4  0x00000036cca08636 in _dl_map_object () from
>>>>> /lib64/ld-linux-x86-64.so.2
>>>>> #5  0x00000036cca12994 in dl_open_worker () from
>>>>> /lib64/ld-linux-x86-64.so.2
>>>>> #6  0x00000036cca0e176 in _dl_catch_error () from
>>>>> /lib64/ld-linux-x86-64.so.2
>>>>> #7  0x00000036cca1244a in _dl_open () from /lib64/ld-linux-x86-64.so.2
>>>>> #8  0x00000036cd323520 in do_dlopen () from /lib64/libc.so.6
>>>>> #9  0x00000036cca0e176 in _dl_catch_error () from
>>>>> /lib64/ld-linux-x86-64.so.2
>>>>> #10 0x00000036cd323677 in __libc_dlopen_mode () from /lib64/libc.so.6
>>>>> #11 0x00000036cd2fbd51 in backtrace () from /lib64/libc.so.6
>>>>> #12 0x00000036cd26f98b in __libc_message () from /lib64/libc.so.6
>>>>> #13 0x00000036cd275296 in malloc_printerr () from /lib64/libc.so.6
>>>>> #14 0x00000036cd277efa in _int_free () from /lib64/libc.so.6
>>>>> #15 0x0000000000466136 in MUFree (Ptr=0x46bfbd0) at MUtil.c:460
>>>>> #16 0x00000000004499a5 in MUserDestroy (UP=0x46bfbd0) at MUser.c:682
>>>>> #17 0x00000000004499de in MUserFreeTable () at MUser.c:700
>>>>> #18 0x00000000004ac48f in MSysShutdown (Signo=0) at MSys.c:2540
>>>>> #19 0x0000000000418361 in UIProcessClients (SS=0x774d270,
>>>>>        TimeLimit=<value optimized out>) at UserI.c:527
>>>>> #20 0x0000000000405bb8 in main (ArgC=2, ArgV=<value optimized out>)
>>>>>        at Server.c:240
>>>>> (gdb) quit
>>>>> =========================================================================
>>>>>
>>>>>
>>>>> When it crashes this is what I see
>>>>>
>>>>> =========================================================================
>>>>> (gdb) run -d
>>>>> Starting program: /usr/sbin/maui -d
>>>>>
>>>>>
>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>> 0x00000036cd265ee7 in _IO_fflush (fp=0x7f0d010) at iofflush.c:43
>>>>> 43            result = _IO_SYNC (fp) ? EOF : 0;
>>>>> (gdb)
>>>>> (gdb) bt
>>>>> #0  0x00000036cd265ee7 in _IO_fflush (fp=0x7f0d010) at iofflush.c:43
>>>>> #1  0x000000000047c07b in MJobWriteStats (J=0x9b61080) at MJob.c:7815
>>>>> #2  0x000000000048643e in MJobProcessCompleted (J=0x9b61080) at
>>>>> MJob.c:9562
>>>>> #3  0x00000000004a6eb8 in MPBSWorkloadQuery (R=0x6a4b2e0,
>>>>>        JCount=0x7ffffff7b938, SC=<value optimized out>) at MPBSI.c:871
>>>>> #4  0x000000000045f926 in __MUTFunc (V=0x7ffffff7b830) at MUtil.c:4718
>>>>> #5  0x0000000000462387 in MUThread (F=<value optimized out>,
>>>>>        TimeOut=<value optimized out>, RC=<value optimized out>,
>>>>>        ACount=<value optimized out>, Lock=<value optimized out>) at
>>>>> MUtil.c:4691
>>>>> #6  0x0000000000498ed4 in MRMWorkloadQuery (WCount=0x7ffffff7b98c,
>>>>> SC=0x0)
>>>>>        at MRM.c:595
>>>>> #7  0x000000000049cb19 in MRMGetInfo () at MRM.c:364
>>>>> #8  0x000000000042dc42 in MSchedProcessJobs (OldDay=0x7fffffffde40 "Tue",
>>>>>        GlobalSQ=0x7ffffffdbe30, GlobalHQ=0x7ffffffbbe30) at MSched.c:6930
>>>>> #9  0x0000000000405c46 in main (ArgC=2, ArgV=<value optimized out>)
>>>>>        at Server.c:192
>>>>> (gdb) frame
>>>>> #0  0x00000036cd265ee7 in _IO_fflush (fp=0x7f0d010) at iofflush.c:43
>>>>> 43            result = _IO_SYNC (fp) ? EOF : 0;
>>>>> (gdb) frame 1
>>>>> #1  0x000000000047c07b in MJobWriteStats (J=0x9b61080) at MJob.c:7815
>>>>> 7815        fflush(MSched.statfp);
>>>>> (gdb) list MJob.c:7815
>>>>> 7810
>>>>> 7811      if
>>>>> (MJobToTString(J,DEFAULT_WORKLOAD_TRACE_VERSION,Buf,sizeof(Buf))
>>>>> == SUCCESS)
>>>>> 7812        {
>>>>> 7813        fprintf(MSched.statfp,"%s",Buf);
>>>>> 7814
>>>>> 7815        fflush(MSched.statfp);
>>>>> 7816
>>>>> 7817        DBG(4,fSTAT) DPrint("INFO:     job stats written for '%s'\n",
>>>>> 7818          J->Name);
>>>>> 7819        }
>>>>> (gdb) p Buf
>>>>> $3 = "16828", ' ' <repeats 18 times>, "0   1    coutu     coutu  345600
>>>>> Completed  [max100:1] 1342534818 1342534819 1342534819 1342535999
>>>>> [NONE]
>>>>> [NONE] [NONE] >=    0M >=      0M   [nonGPU] 1342534818   1    1
>>>>> [NONE]:DEFA"...
>>>>> (gdb)
>>>>> =========================================================================
>>>>>
>>>>> My guess is some memory corruption has overwritten MSched.statfp which is
>>>>> just a file handle and thus fflush crashes when it actually tries to
>>>>> write to it.   WHere that overwrite is occuring though is anyone's guess.
>>>>>
>>>>> I am hoping someone on this list might have a clue.  It is really a
>>>>> mystery
>>>>> to me why I only see this on one cluster. They are exactly the same
>>>>> config
>>>>> except for host name.  Here is my maui.cfg
>>>>>
>>>>> =========================================================================
>>>>> ADMIN1                maui root
>>>>> ADMIN3                ALL
>>>>> ADMINHOST               launchpad.nmr.mgh.harvard.edu
>>>>> BACKFILLPOLICY        FIRSTFIT
>>>>> CLASSCFG[default] MAXPROCPERUSER=150
>>>>> CLASSCFG[extended] MAXPROCPERUSER=50 MAXPROC=250
>>>>> CLASSCFG[GPU] MAXPROCPERUSER=5000
>>>>> CLASSCFG[matlab] MAXPROCPERUSER=60
>>>>> CLASSCFG[max100] MAXPROCPERUSER=100
>>>>> CLASSCFG[max10] MAXPROCPERUSER=10
>>>>> CLASSCFG[max200] MAXPROCPERUSER=200
>>>>> CLASSCFG[max20] MAXPROCPERUSER=20
>>>>> CLASSCFG[max50] MAXPROCPERUSER=50
>>>>> CLASSCFG[max75] MAXPROCPERUSER=75
>>>>> CLASSCFG[p10] MAXPROCPERUSER=5000
>>>>> CLASSCFG[p20] MAXPROCPERUSER=5000
>>>>> CLASSCFG[p30] MAXPROCPERUSER=5000
>>>>> CLASSCFG[p40] MAXPROCPERUSER=5000
>>>>> CLASSCFG[p50] MAXPROCPERUSER=30
>>>>> CLASSCFG[p5] MAXPROCPERUSER=5000
>>>>> CLASSCFG[p60] MAXPROCPERUSER=20
>>>>> CLASSWEIGHT           10
>>>>> ENABLEMULTIREQJOBS TRUE
>>>>> ENFORCERESOURCELIMITS   OFF
>>>>> LOGFILEMAXSIZE        1000000000
>>>>> LOGFILE               /var/spool/maui/log/maui.log
>>>>> LOGLEVEL              2
>>>>> NODEALLOCATIONPOLICY  PRIORITY
>>>>> NODECFG[DEFAULT] PRIORITY=1000 PRIORITYF='PRIORITY + 3 * JOBCOUNT'
>>>>> QUEUETIMEWEIGHT       1
>>>>> RESERVATIONPOLICY     CURRENTHIGHEST
>>>>> RMCFG[base]             TYPE=PBS
>>>>> RMPOLLINTERVAL          00:00:30
>>>>> SERVERHOST              launchpad.nmr.mgh.harvard.edu
>>>>> SERVERMODE              NORMAL
>>>>> SERVERPORT              40559
>>>>> USERCFG[DEFAULT] MAXIPROC=8
>>>>> USERCFG[jonghwan] MAXPROC=300
>>>>> USERCFG[shafee] MAXPROC=300
>>>>>
>>>>> I actually changed the LOGLEVEL from 3 to 2 at one point thinking the
>>>>> error is happening when writing to the log and lowering the amount it
>>>>> writes might affect things, but it didn't help
>>>>>
>>>>> ---------------------------------------------------------------
>>>>> Paul Raines                     http://help.nmr.mgh.harvard.edu
>>>>> MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
>>>>> 149 (2301) 13th Street     Charlestown, MA 02129        USA
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> The information in this e-mail is intended only for the person to whom it
>>>>> is
>>>>> addressed. If you believe this e-mail was sent to you in error and the
>>>>> e-mail
>>>>> contains patient information, please contact the Partners Compliance
>>>>> HelpLine at
>>>>> http://www.partners.org/complianceline . If the e-mail was sent to you in
>>>>> error
>>>>> but does not contain patient information, please contact the sender and
>>>>> properly
>>>>> dispose of the e-mail.
>>>>>
>>>>> _______________________________________________
>>>>> mauiusers mailing list
>>>>> [email protected]
>>>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>>>>
>>>>
>>>>
>>>>
>>
>>
>>
> _______________________________________________
> mauiusers mailing list
> [email protected]
> http://www.supercluster.org/mailman/listinfo/mauiusers
>

_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Re: [Mauiusers] maui hangs/segfaults in 3.3.1

Reply via email to