Re: [OMPI devel] opal_util_register_stackhandlers()
Thanks for the bug report! I've just changed the behavior to emit a warning and *not* intercept a signal if the old signal action is neither SIG_DFL nor SIG_IGN. The opal_signal MCA parameter can be set to determine which signals you want to intercept; it defaults to the integer values of SIGABRT, SIGBUS, SIGFPE, SIGSEGV on your system. We can probably get this in OMPI v1.3.2. On Mar 19, 2009, at 11:13 AM, Kees Verstoep wrote: Hi, Currently, opal_util_register_stackhandlers() in opal/util/ stacktrace.c calls sigaction() with a third NULL argument, meaning you don't look at possibly previously installed signal handlers, and always override them with print_stackframe(). But there are actually realistic scenarios where an application actively uses these signals, and also wants to use MPI. As an example, the default opal "signal" parameter settings are such that SIG_SEGV is redirected. Typically, indeed, SIG_SEGV indicates a bug somewhere, and the stacktrace from Open MPI is a nice bonus. However, the Sun Java JDK uses SIG_SEGV to detect when stacks should be automatically extended, and it stops working rather ungracefully when that handler gets replaced. (BTW, we stumbled on this recently when we added an MPI backend for our Ibis grid programming environment. It took a bit of time to figure out what was happening, since we got no usable stacktrace for the thread that got bitten. We suspected a bug in our native code mapping at first, but MPICH did not have this problem). In most cases, you can of course work around it by manually changing the opal "signal" list, but it would be nicer if Open MPI would detect the situation, and e.g. only install the stack printer when there is no handler yet, or at least warn about the possible clash. Thanks! Kees Verstoep ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
[OMPI devel] Fwd: [OMPI svn-full] svn:open-mpi r20759
There was a glitch in the SVN server this evening; you can tell that this r number is far lower than it should be. IU is fixing it right now. This commit will occur again with a new, higher SVN r number shortly... Begin forwarded message: From:Date: March 19, 2009 8:41:21 PM EDT To: Subject: [OMPI svn-full] svn:open-mpi r20759 Reply-To: Author: jsquyres Date: 2009-03-19 20:41:21 EDT (Thu, 19 Mar 2009) New Revision: 20759 URL: https://svn.open-mpi.org/trac/ompi/changeset/20759 Log: Per a comment on the users list, don't try to install our own signal handlers if there are already non-default handlers installed. Print a warning if that situation arises. '''NOTE:''' This is a definite target for OPAL_SOS conversion -- as it is right now, this message will be displayed for ''every'' MPI process. We want this to be OPAL_SOS'ed when that becomes available so that the error message can be aggregated nicely. Added: trunk/opal/util/help-opal-util.txt Text files modified: trunk/opal/util/Makefile.am | 4 +++- trunk/opal/util/stacktrace.c |22 -- 2 files changed, 23 insertions(+), 3 deletions(-) Modified: trunk/opal/util/Makefile.am = = = = = = = = == --- trunk/opal/util/Makefile.am (original) +++ trunk/opal/util/Makefile.am 2009-03-19 20:41:21 EDT (Thu, 19 Mar 2009) @@ -9,7 +9,7 @@ # University of Stuttgart. All rights reserved. # Copyright (c) 2004-2005 The Regents of the University of California. # All rights reserved. -# Copyright (c) 2007 Cisco Systems, Inc. All rights reserved. +# Copyright (c) 2007-2009 Cisco Systems, Inc. All rights reserved. # $COPYRIGHT$ # # Additional copyrights may follow @@ -19,6 +19,8 @@ SUBDIRS = keyval +dist_pkgdata_DATA = help-opal-util.txt + AM_LFLAGS = -Popal_show_help_yy LEX_OUTPUT_ROOT = lex.opal_show_help_yy Added: trunk/opal/util/help-opal-util.txt = = = = = = = = == --- (empty file) +++ trunk/opal/util/help-opal-util.txt 2009-03-19 20:41:21 EDT (Thu, 19 Mar 2009) @@ -0,0 +1,25 @@ +# -*- text -*- +# +# Copyright (c) 2009 Cisco Systems, Inc. All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# +# This is the US/English general help file for Open MPI. +# +[stacktrace signal override] +Open MPI was insertting a signal handler for signal %d but noticed +that there is already a non-default handler installer. Open MPI's +handler was therefore not installed; your job will continue. This +warning message will only be displayed once, even if Open MPI +encounters this situation again. + +To avoid displaying this warning message, you can either not install +the error handler for signal %d or you can have Open MPI not try to +install its own signal handler for this signal by setting the +"opal_signals" MCA parameter. + + Signal: %d + Current opal_signals value: %s Modified: trunk/opal/util/stacktrace.c = = = = = = = = == --- trunk/opal/util/stacktrace.c(original) +++ trunk/opal/util/stacktrace.c2009-03-19 20:41:21 EDT (Thu, 19 Mar 2009) @@ -38,6 +38,7 @@ #include "opal/mca/backtrace/backtrace.h" #include "opal/constants.h" #include "opal/util/output.h" +#include "opal/util/show_help.h" #ifndef _NSIG #define _NSIG 32 @@ -410,11 +411,12 @@ int opal_util_register_stackhandlers (void) { #if OMPI_WANT_PRETTY_PRINT_STACKTRACE && ! defined(__WINDOWS__) -struct sigaction act; +struct sigaction act, old; char * string_value; char * tmp; char * next; int param, i; +bool showed_help = false; gethostname(stacktrace_hostname, sizeof(stacktrace_hostname)); stacktrace_hostname[sizeof(stacktrace_hostname) - 1] = '\0'; @@ -459,10 +461,26 @@ return OPAL_ERR_BAD_PARAM; } - ret = sigaction (sig, , NULL); + ret = sigaction (sig, , ); if (ret != 0) { return OPAL_ERR_IN_ERRNO; } + if (SIG_IGN != old.sa_handler && SIG_DFL != old.sa_handler) { + if (!showed_help) { + /* JMS This is icky; there is no error message + aggregation here so this message may be repeated for + every single MPI process... This should be replaced + with OPAL_SOS when that is done so that it can be + properly aggregated. */ + opal_show_help("help-opal-util.txt", + "stacktrace signal override", + true, sig, sig, sig, string_value); + showed_help = true; + } + if (0 != sigaction(sig, , NULL)) { + return OPAL_ERR_IN_ERRNO; + } + }
Re: [OMPI devel] RFC: Final cleanup of included headers
Hi Ralph, On Wednesday 18 March 2009 09:00:36 am Ralph Castain wrote: > Could we hold off on this until after 1.3.2 is out the door and has a > couple of days to stabilize? All these header file changes are making > it more difficult to cleanly apply patches to the 1.3 branch. Hmm, sure, we can hold off the big patch. With the current plan, 1.3.2 should be out on 4/3. Some intermediate (small!) steps however I'd still like to be able to apply? > When we get past the next couple of weeks, the 1.3 branch should clear > out the backlog of CMRs, and we should have the usual immediate "oops" > fixes in to 1.3.2. Then this won't be such a problem. However, it would be nice, if You could test the patch on Your systems, prior to me moving it into trunk. I want to limit the "down-time" of trunk (There may be a few places, where additional headers are required -- as unnecessary headers were removed in lower-level headers). Thanks, Rainer -- Rainer Keller, PhD Tel: +1 (865) 241-6293 Oak Ridge National Lab Fax: +1 (865) 241-4811 PO Box 2008 MS 6164 Email: kel...@ornl.gov Oak Ridge, TN 37831-2008AIM/Skype: rusraink
Re: [MTT devel] [MTT svn] svn:mtt-svn r1273 (Analyze/Performance plug-ins)
Hello Eithan, Thanks for info, will refactor it. from http://www.netlib.org/benchmark/hpl/ ... *HPL* is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack Benchmark. The HPL package provides a testing and timing program to quantify the * accuracy* of the obtained solution as well as the time it took to compute it ... Where do you think is a good place to keep parsers for other then lat/bw based mpi benchmarks? I think we can have a collection of such parsers in the mtt and at some point we can enhance mtt reports with other metrics. What do you think? On Thu, Mar 19, 2009 at 8:22 PM, Ethan Mallovewrote: > Hi Mike, > > Is HPL a latency and/or bandwidth performance test? All the Analyze > plug-ins in lib/MTT/Test/Analyze/Performance are for latency/bandwidth > tests, which means they can then be rendered as graphs in the MTT > Reporter. All of these plug-ins are required to output at least one > of the following: > > latency_avg > latency_min > latency_max > bandwidth_avg > bandwidth_min > bandwidth_max > > They all contain this: > > $report->{test_type} = 'latency_bandwidth'; > > HPL.pm should have a line like this somewhere: > > $report->{test_type} = 'tv_gflops'; > > Maybe HPL.pm could go into a different directory or have a comment > somewhere to clear up this confusion. > > Regards, > Ethan > > > On Thu, Mar/19/2009 02:11:05AM, mi...@osl.iu.edu wrote: > > Author: miked > > Date: 2009-03-19 02:11:04 EDT (Thu, 19 Mar 2009) > > New Revision: 1273 > > URL: https://svn.open-mpi.org/trac/mtt/changeset/1273 > > > > Log: > > HPL analyzer added > > > > Added: > >trunk/lib/MTT/Test/Analyze/Performance/HPL.pm > > > > Added: trunk/lib/MTT/Test/Analyze/Performance/HPL.pm > > > == > > --- (empty file) > > +++ trunk/lib/MTT/Test/Analyze/Performance/HPL.pm 2009-03-19 02:11:04 > EDT (Thu, 19 Mar 2009) > > @@ -0,0 +1,63 @@ > > +#!/usr/bin/env perl > > +# > > +# Copyright (c) 2006-2007 Sun Microsystems, Inc. All rights reserved. > > +# Copyright (c) 2007 Voltaire All rights reserved. > > +# $COPYRIGHT$ > > +# > > +# Additional copyrights may follow > > +# > > +# $HEADER$ > > +# > > + > > +package MTT::Test::Analyze::Performance::HPL; > > +use strict; > > +use Data::Dumper; > > +#use MTT::Messages; > > + > > +# Process the result_stdout emitted from one of hpl tests > > +sub Analyze { > > + > > +my($result_stdout) = @_; > > +my $report; > > +my(@t_v, > > + @time, > > + @gflops); > > + > > +$report->{test_name}="HPL"; > > +my @lines = split(/\n|\r/, $result_stdout); > > +# Sample result_stdout: > > +#- The matrix A is randomly generated for each test. > > +#- The following scaled residual check will be computed: > > +# ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * > N ) > > +#- The relative machine precision (eps) is taken to be > 1.110223e-16 > > +#- Computational tests pass if scaled residuals are less than > 16.0 > > > +# > > +#T/VNNB P Q Time > Gflops > > > +# > > +#WR00L2L2 29184 128 2 4 15596.86 > 1.063e+00 > > > +# > > +#||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=0.0008986 > .. PASSED > > > +# > > +#T/VNNB P Q Time > Gflops > > > +# > > +#WR00L2L4 29184 128 2 4 15251.81 > 1.087e+00 > > +my $line; > > +while (defined($line = shift(@lines))) { > > +#WR00L2L2 29184 128 2 4 15596.86 >1.063e+00 > > +if ($line =~ > m/^(\S+)\s+\d+\s+\d+\s+\d+\s+\d+\s+(\d+[\.\d]+)\s+(\S+)/) { > > +push(@t_v, $1); > > +push(@time, $2); > > +push(@gflops, $3); > > +} > > +} > > + > > + # Postgres uses brackets for array insertion > > +# (see postgresql.org/docs/7.4/interactive/arrays.html) > > +$report->{tv} = "{" . join(",", @t_v) . "}"; > > +$report->{time} = "{" . join(",", @time) . "}"; > > +$report->{gflops} = "{" . join(",", @gflops) . "}"; > > +return $report; > > +} > > + > > +1; > > + > > ___ > > mtt-svn mailing list > > mtt-...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/mtt-svn >
[OMPI devel] opal_util_register_stackhandlers()
Hi, Currently, opal_util_register_stackhandlers() in opal/util/stacktrace.c calls sigaction() with a third NULL argument, meaning you don't look at possibly previously installed signal handlers, and always override them with print_stackframe(). But there are actually realistic scenarios where an application actively uses these signals, and also wants to use MPI. As an example, the default opal "signal" parameter settings are such that SIG_SEGV is redirected. Typically, indeed, SIG_SEGV indicates a bug somewhere, and the stacktrace from Open MPI is a nice bonus. However, the Sun Java JDK uses SIG_SEGV to detect when stacks should be automatically extended, and it stops working rather ungracefully when that handler gets replaced. (BTW, we stumbled on this recently when we added an MPI backend for our Ibis grid programming environment. It took a bit of time to figure out what was happening, since we got no usable stacktrace for the thread that got bitten. We suspected a bug in our native code mapping at first, but MPICH did not have this problem). In most cases, you can of course work around it by manually changing the opal "signal" list, but it would be nicer if Open MPI would detect the situation, and e.g. only install the stack printer when there is no handler yet, or at least warn about the possible clash. Thanks! Kees Verstoep
Re: [OMPI devel] 1.3.1rc5
Things look good from the IBM side as well. So, RM-approved for release. --brad ompi 1.3 co-release manager On Thu, Mar 19, 2009 at 7:31 AM, Jeff Squyreswrote: > Looks good to cisco. Ship it. > > I'm still seeing a very low incidence of the sm segv during startup (.01% > -- 23 tests out of ~160k), so let's ship 1.3.1 and roll in Eugene's new sm > code for 1.3.2. > > -- > Jeff Squyres > Cisco Systems > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
[OMPI devel] 1.3.1rc5
Looks good to cisco. Ship it. I'm still seeing a very low incidence of the sm segv during startup (. 01% -- 23 tests out of ~160k), so let's ship 1.3.1 and roll in Eugene's new sm code for 1.3.2. -- Jeff Squyres Cisco Systems
Re: [MTT devel] GSOC application
Looks like there were 400 applications this year; they selected 150 -- 38%. We were in the unlucky 62%. Bummer. On Mar 18, 2009, at 4:05 PM, Ethan Mallove wrote: On Wed, Mar/18/2009 03:28:48PM, Josh Hursey wrote: > So they posted the list of accepted projects and we are -not- on it > for this year: > > http://socghop.appspot.com/program/accepted_orgs/google/gsoc2009 > > Maybe next year. I don't know if they will be sending around a note > regarding why we were not selected to participate. If they do I will > forward it along. Thanks, Josh. I'm reading that in 2008, they only accepted 174 out of the 7100 applications. -Ethan > > Cheers, > Josh > > On Mar 13, 2009, at 3:19 PM, Jeff Squyres wrote: > >> Awesome; many thanks for carrying the baton over the finish line, Josh! >> >> On Mar 13, 2009, at 2:56 PM, Josh Hursey wrote: >> >>> The application has been submitted. We find out on March 18 (3 pm) if >>> we have been accepted. Link to timeline below: >>>http://socghop.appspot.com/document/show/program/google/gsoc2009/ >>> timeline >>> >>> Cheers, >>> Josh >>> >>> On Mar 13, 2009, at 2:19 PM, Josh Hursey wrote: >>> >>> > I just pushed a final draft to the repository. I'll probably plan >>> > on submitting at 2:30/2:45. Let me know if you have any edits >>> > before then either through email or IM. >>> > >>> > Cheers, >>> > Josh >>> > >>> > On Mar 13, 2009, at 12:11 PM, Josh Hursey wrote: >>> > >>> >> I finished a first pass at cleaning up the Ideas page on the Wiki. >>> >> All of the ideas were preserved, just some rewording and formatting. >>> >> https://svn.open-mpi.org/trac/mtt/wiki/MttNewFeaturesIdeas >>> >> >>> >> If you get a chance, read through this and make sure the text >>> >> sounds ok (feel free to clean the text up as necessary). >>> >> >>> >> The application is due by 3 pm EST. So I hope to have the >>> >> application ready by 2ish. I'll move onto the application itself now. >>> >> >>> >> -- Josh >>> >> >>> >> On Mar 12, 2009, at 4:43 PM, Josh Hursey wrote: >>> >> >>> >>> Jeff is going to take the first pass at the application. >>> >>> >>> >>> I am going to go through the Idea page on the wiki and polish a bit: >>> >>> https://svn.open-mpi.org/trac/mtt/wiki/MttNewFeaturesIdeas >>> >>> >>> >>> I'll let folks know when I'm done, and we can start iterating on >>> >>> drafts. >>> >>> >>> >>> Cheers, >>> >>> Josh >>> >>> >>> >>> On Mar 12, 2009, at 4:08 PM, Jeff Squyres wrote: >>> >>> >>> I've created a quick-n-dirty hg to collaborate on the GSOC >>> application. There's a web form to fill out to apply, so let's >>> work on a .txt file in the hg to get it right. >>> >>> We have until 3pm US Eastern time tomorrow to submit. Here's >>> the HG: >>> >>> ssh://www.open-mpi.org/~jsquyres/hg/gsoc/ >>> >>> I've put the PDF there for now; I'll kruft up a quick .txt >>> shortly and push it there as well. >>> >>> -- >>> Jeff Squyres >>> Cisco Systems >>> >>> ___ >>> mtt-devel mailing list >>> mtt-de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel >>> >>> >>> >>> ___ >>> >>> mtt-devel mailing list >>> >>> mtt-de...@open-mpi.org >>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel >>> >> >>> >> ___ >>> >> mtt-devel mailing list >>> >> mtt-de...@open-mpi.org >>> >> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel >>> > >>> > ___ >>> > mtt-devel mailing list >>> > mtt-de...@open-mpi.org >>> > http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel >>> >>> ___ >>> mtt-devel mailing list >>> mtt-de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel >> >> >> -- >> Jeff Squyres >> Cisco Systems >> >> ___ >> mtt-devel mailing list >> mtt-de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel > > ___ > mtt-devel mailing list > mtt-de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel ___ mtt-devel mailing list mtt-de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel -- Jeff Squyres Cisco Systems
Re: [MTT devel] mtt text report oddity
On Mar 19, 2009, at 3:19 AM, Mike Dubman wrote: because the results are rendered in chunks during reporting phase. (100 pieces every flush) This caused same benchmark line to appear more then once in the final report. Ahhh... right. You can configure the reporter to issue results not by number, but for same benchmark at once: put this in the ini file: [MTT] submit_group_results=1 Perfect. Thanks! Also, html report is nicer and allows you easy navigation to the errors True, but my HTML files kept getting overwritten by successive submits in this case. -- Jeff Squyres Cisco Systems
Re: [MTT devel] mtt text report oddity
because the results are rendered in chunks during reporting phase. (100 pieces every flush) This caused same benchmark line to appear more then once in the final report. You can configure the reporter to issue results not by number, but for same benchmark at once: put this in the ini file: [MTT] submit_group_results=1 Also, html report is nicer and allows you easy navigation to the errors regards Mike 2009/3/19 Jeff Squyres> I got a fairly odd mtt text report (it's super wide, sorry): > > | Test Run| intel | 1.3.1rc5| 00:12| 5| | > | | Test_Run-intel-developer-1.3.1rc5.html| > | Test Run| intel | 1.3.1rc5| 02:59| 101 | | > | | Test_Run-intel-developer-1.3.1rc5.html| > | Test Run| intel | 1.3.1rc5| 03:08| 101 | | > | | Test_Run-intel-developer-1.3.1rc5.html| > | Test Run| intel | 1.3.1rc5| 02:51| 101 | | > | | Test_Run-intel-developer-1.3.1rc5.html| > | Test Run| intel | 1.3.1rc5| 02:59| 101 | | > | | Test_Run-intel-developer-1.3.1rc5.html| > | Test Run| intel | 1.3.1rc5| 03:48| 101 | | > | | Test_Run-intel-developer-1.3.1rc5.html| > | Test Run| intel | 1.3.1rc5| 03:10| 101 | | > | | Test_Run-intel-developer-1.3.1rc5.html| > | Test Run| intel | 1.3.1rc5| 03:05| 101 | | > | | Test_Run-intel-developer-1.3.1rc5.html| > | Test Run| intel | 1.3.1rc5| 03:09| 101 | | > | | Test_Run-intel-developer-1.3.1rc5.html| > | Test Run| intel | 1.3.1rc5| 03:25| 101 | | > | | Test_Run-intel-developer-1.3.1rc5.html| > | Test Run| intel | 1.3.1rc5| 02:46| 101 | | > | | Test_Run-intel-developer-1.3.1rc5.html| > | Test Run| intel | 1.3.1rc5| 02:59| 101 | | > | | Test_Run-intel-developer-1.3.1rc5.html| > | Test Run| intel | 1.3.1rc5| 03:23| 101 | | > | | Test_Run-intel-developer-1.3.1rc5.html| > | Test Run| intel | 1.3.1rc5| 02:50| 100 | 1| > | | Test_Run-intel-developer-1.3.1rc5.html| > | Test Run| intel | 1.3.1rc5| 02:56| 101 | | > | | Test_Run-intel-developer-1.3.1rc5.html| > | Test Run| intel | 1.3.1rc5| 02:53| 101 | | > | | Test_Run-intel-developer-1.3.1rc5.html| > | Test Run| intel | 1.3.1rc5| 03:22| 101 | | > | | Test_Run-intel-developer-1.3.1rc5.html| > | Test Run| intel | 1.3.1rc5| 04:21| 100 | | > 1| | Test_Run-intel-developer-1.3.1rc5.html| > | Test Run| intel | 1.3.1rc5| 04:12| 101 | | > | | Test_Run-intel-developer-1.3.1rc5.html| > | Test Run| intel | 1.3.1rc5| 03:36| 101 | | > | | Test_Run-intel-developer-1.3.1rc5.html| > | Test Run| intel | 1.3.1rc5| 02:48| 101 | | > | | Test_Run-intel-developer-1.3.1rc5.html| > | Test Run| intel | 1.3.1rc5| 02:47| 101 | | > | | Test_Run-intel-developer-1.3.1rc5.html| > | Test Run| intel | 1.3.1rc5| 03:08| 101 | | > | | Test_Run-intel-developer-1.3.1rc5.html| > | Test Run| intel | 1.3.1rc5| 02:57| 101 | | > | | Test_Run-intel-developer-1.3.1rc5.html| > | Test Run| intel | 1.3.1rc5| 02:43| 101 | | > | | Test_Run-intel-developer-1.3.1rc5.html| > | Test Run| intel | 1.3.1rc5| 02:48| 101 | | > | | Test_Run-intel-developer-1.3.1rc5.html| > > Notice that there are *many* "intel" lines, each with 101 passes. The only > difference between them is the times that they ran -- but there's even > repeats of that. > > Do we know why there is so many different lines for the intel test suite? > > Did this get changed in the text reporter changes from Voltaire (somewhat) > recently? > > -- > Jeff Squyres > Cisco Systems > > ___ > mtt-devel mailing list > mtt-de...@open-mpi.org >