Dr Stephan Raub, Maui does have some very odd "memory management" in it that has a tendency to cause these types of crashes when run in high volume situations without some tweaks and/or concessions. I've tracked down, and I think fixed, one in the latest svn trunk, but 3.3.1 should already have that fix in it.
Can/have you tried running maui from the command line with the -d line and catching the corrupt memory and back trace that comes out of it? Your original email has the strace, but it cuts off some of the backtrace. I might be able to see where in the code it's having problems, if I can get the full back trace. -- Jason Williams Systems Engineer Homewood High Performance Cluster Johns Hopkins University On 11/8/2011 12:09 PM, Dr. Stephan Raub wrote: > Dear Mr. van der Vlies > > Currently we have 6095 Jobs queued and 93 Jobs running. Amoung these, we > have some large job arrays (1000 and 4000 items per array). > > Best regards. > -- > --------------------------------------------------------- > | | Dr. rer. nat. Stephan Raub > | | Dipl. Chem. > | | High-Performance-Computing > | | Zentrum für Informations- und Medientechnologie > | | Heinrich-Heine-Universität Düsseldorf > | | Universitätsstr. 1 / Raum 25.41.O2.25-2 > | | 40225 Düsseldorf / Germany > | | > | | Tel: +49-211-811-3911 > | | Fax: +49-211-811-2539 > --------------------------------------------------------- > > Wichtiger Hinweis: Diese E-Mail kann Betriebs- oder Geschäftsgeheimnisse, > bzw. > sonstige vertrauliche Informationen enthalten. Sollten Sie diese E-Mail > irrtümlich erhalten haben, ist Ihnen eine Kenntnisnahme des Inhalts, eine > Vervielfältigung oder Weitergabe der E-Mail ausdrücklich untersagt. Bitte > benachrichtigen Sie uns und vernichten Sie die empfangene E-Mail. Vielen > Dank. > > Important Note: This e-mail may contain trade secrets or privileged, > undisclosed or otherwise confidential information. If you have received this > e-mail in error, you are hereby notified that any review, copying or > distribution of it is strictly prohibited. Please inform us immediately and > destroy the original transmittal. Thank you for your cooperation. > > >> -----Ursprüngliche Nachricht----- >> Von: Bas van der Vlies [mailto:[email protected]] >> Gesendet: Dienstag, 8. November 2011 17:10 >> An: Dr. Stephan Raub >> Betreff: Re: [Mauiusers] Possible Memory Corruption in maui >> >> On 08-11-11 16:40, Dr. Stephan Raub wrote: >>> Dear fellow maui users, >>> >>> we are running Maui 3.3.1 with torque 2.3.7 under RHEL5.5 >>> (2.6.8-194.26.1.el1) on a 600-somewhat core cluster. >>> >>> We experienced a sudden death of the maui scheduler with no message >> in the >>> logs. We could not figure out a reason so we attached an "strace" to >> the >>> maui process (as long as it was "still alive") and we got: >>> >> Dear Dr. Stephan Raub, >> >> just a question: How many jobs are in the queue? >> >> regards >> >> >> -- >> ******************************************************************** >> * Bas van der Vlies e-mail: [email protected] * >> * SARA - Academic Computing Services Amsterdam, The Netherlands * >> ******************************************************************** > > > _______________________________________________ > mauiusers mailing list > [email protected] > http://www.supercluster.org/mailman/listinfo/mauiusers _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
