I was able to reproduce it (with the correct version of OMPI, aka. the v2.x
branch). The problem seems to be that we are lacking a part of
the fe68f230991 commit, that remove a free on a statically allocated array.
Here is the corresponding patch:
diff --git a/ompi/errhandler/errhandler_predefined.c
b/ompi/errhandler/errhandler_predefined.c
index 4d50611c12..54ac63553c 100644
--- a/ompi/errhandler/errhandler_predefined.c
+++ b/ompi/errhandler/errhandler_predefined.c
@@ -15,6 +15,7 @@
* Copyright (c) 2010-2011 Oak Ridge National Labs. All rights reserved.
* Copyright (c) 2012 Los Alamos National Security, LLC.
* All rights reserved.
+ * Copyright (c) 2016 Intel, Inc. All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
@@ -181,6 +182,7 @@ static void backend_fatal_aggregate(char *type,
const char* const unknown_error_code = "Error code: %d (no associated
error message)";
const char* const unknown_error = "Unknown error";
const char* const unknown_prefix = "[?:?]";
+ bool generated = false;
// these do not own what they point to; they're
// here to avoid repeating expressions such as
@@ -211,6 +213,8 @@ static void backend_fatal_aggregate(char *type,
err_msg = NULL;
opal_output(0, "%s", "Could not write to err_msg");
opal_output(0, unknown_error_code, *error_code);
+ } else {
+ generated = true;
}
}
}
@@ -256,7 +260,9 @@ static void backend_fatal_aggregate(char *type,
}
free(prefix);
- free(err_msg);
+ if (generated) {
+ free(err_msg);
+ }
}
/*
George.
On Thu, May 4, 2017 at 10:03 PM, Jeff Squyres (jsquyres) <[email protected]
> wrote:
> Can you get a stack trace?
>
> > On May 4, 2017, at 6:44 PM, Dahai Guo <[email protected]> wrote:
> >
> > Hi, George:
> >
> > attached is the ompi_info. I built it on Power8 arch. The configure is
> also simple.
> >
> > ../configure --prefix=${installdir} \
> > --enable-orterun-prefix-by-default
> >
> > Dahai
> >
> > On Thu, May 4, 2017 at 4:45 PM, George Bosilca <[email protected]>
> wrote:
> > Dahai,
> >
> > You are right the segfault is unexpected. I can't replicate this on my
> mac. What architecture are you seeing this issue ? How was your OMPI
> compiled ?
> >
> > Please post the output of ompi_info.
> >
> > Thanks,
> > George.
> >
> >
> >
> > On Thu, May 4, 2017 at 5:42 PM, Dahai Guo <[email protected]> wrote:
> > Those messages are what I like to see. But, there are some other error
> messages and core dump I don't like, as I attached in my previous email. I
> think something might be wrong with errhandler in openmpi. Similar thing
> happened for Bcast, etc
> >
> > Dahai
> >
> > On Thu, May 4, 2017 at 4:32 PM, Nathan Hjelm <[email protected]> wrote:
> > By default MPI errors are fatal and abort. The error message says it all:
> >
> > *** An error occurred in MPI_Reduce
> > *** reported by process [3645440001,0]
> > *** on communicator MPI_COMM_WORLD
> > *** MPI_ERR_COUNT: invalid count argument
> > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> > *** and potentially your MPI job)
> >
> > If you want different behavior you have to change the default error
> handler on the communicator using MPI_Comm_set_errhandler. You can set it
> to MPI_ERRORS_RETURN and check the error code or you can create your own
> function. See MPI 3.1 Chapter 8.
> >
> > -Nathan
> >
> > On May 04, 2017, at 02:58 PM, Dahai Guo <[email protected]> wrote:
> >
> >> Hi,
> >>
> >> Using opemi 2.1, the following code resulted in the core dump,
> although only a simple error msg was expected. Any idea what is wrong? It
> seemed related the errhandler somewhere.
> >>
> >>
> >> D.G.
> >>
> >>
> >> *** An error occurred in MPI_Reduce
> >> *** reported by process [3645440001,0]
> >> *** on communicator MPI_COMM_WORLD
> >> *** MPI_ERR_COUNT: invalid count argument
> >> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> >> *** and potentially your MPI job)
> >> ......
> >>
> >> [1,1]<stderr>:1000151c0000-1000151e0000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:1000151e0000-100015250000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015250000-100015270000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015270000-1000152e0000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:1000152e0000-100015300000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015300000-100015510000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015510000-100015530000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015530000-100015740000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015740000-100015760000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015760000-100015970000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015970000-100015990000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015990000-100015ba0000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015ba0000-100015bc0000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015bc0000-100015dd0000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015dd0000-100015df0000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100015df0000-100016000000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100016000000-100016020000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100016020000-100016230000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100016230000-100016250000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100016250000-100016460000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:100016460000-100016470000 rw-p 00000000 00:00 0
> >> [1,1]<stderr>:3fffd4630000-3fffd46c0000 rw-p 00000000 00:00 0
> [stack]
> >> ------------------------------------------------------------
> --------------
> >>
> >> #include <stdlib.h>
> >> #include <stdio.h>
> >> #include <mpi.h>
> >> int main(int argc, char** argv)
> >> {
> >>
> >> int r[1], s[1];
> >> MPI_Init(&argc,&argv);
> >>
> >> s[0] = 1;
> >> r[0] = -1;
> >> MPI_Reduce(s,r,-1,MPI_INT,MPI_SUM,0,MPI_COMM_WORLD);
> >> printf("%d\n",r[0]);
> >> MPI_Finalize();
> >> }
> >>
> >> _______________________________________________
> >> devel mailing list
> >> [email protected]
> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> >
> > _______________________________________________
> > devel mailing list
> > [email protected]
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> >
> >
> > _______________________________________________
> > devel mailing list
> > [email protected]
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> >
> >
> > _______________________________________________
> > devel mailing list
> > [email protected]
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> >
> > <opmi_info.txt>_______________________________________________
> > devel mailing list
> > [email protected]
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
>
> --
> Jeff Squyres
> [email protected]
>
> _______________________________________________
> devel mailing list
> [email protected]
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
_______________________________________________
devel mailing list
[email protected]
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel