[hwloc-devel] Create success (hwloc git 1.9-17-g36da2ff)

2014-08-20 Thread MPI Team
Creating nightly hwloc snapshot git tarball was a success.

Snapshot:   hwloc 1.9-17-g36da2ff
Start time: Wed Aug 20 21:02:37 EDT 2014
End time:   Wed Aug 20 21:03:58 EDT 2014

Your friendly daemon,
Cyrador


[hwloc-devel] Create success (hwloc git dev-185-ge668fe2)

2014-08-20 Thread MPI Team
Creating nightly hwloc snapshot git tarball was a success.

Snapshot:   hwloc dev-185-ge668fe2
Start time: Wed Aug 20 21:01:01 EDT 2014
End time:   Wed Aug 20 21:02:25 EDT 2014

Your friendly daemon,
Cyrador


Re: [OMPI devel] MPI_Abort does not make mpirun return with the right exit code

2014-08-20 Thread Ralph Castain
I'm aware of the problem, but it will be fixed when the PMIx branch is merged 
later this week.

On Aug 19, 2014, at 10:00 PM, Gilles Gouaillardet 
 wrote:

> Folks,
> 
> let's look at the following trivial test program :
> 
> #include 
> #include 
> 
> int main (int argc, char * argv[]) {
>int rank, size;
>MPI_Init(, );
>MPI_Comm_size(MPI_COMM_WORLD, );
>MPI_Comm_rank(MPI_COMM_WORLD, );
>printf ("I am %d/%d and i abort\n", rank, size);
>MPI_Abort(MPI_COMM_WORLD, 2);
>printf ("%d/%d aborted !\n", rank, size);
>return 3;
> }
> 
> and let's run mpirun (trunk) on node0 and ask the mpi task to run on
> task 1 :
> with two tasks or more :
> 
> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort
> --
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode 2.
> 
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --
> I am 1/2 and i abort
> I am 0/2 and i abort
> [node0:00740] 1 more process has sent help message help-mpi-api.txt /
> mpi-abort
> [node0:00740] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
> 
> node0 $ echo $?
> 0
> 
> the exit status of mpirun is zero
> /* this is why the MPI_Errhandler_fatal_c test fails in mtt */
> 
> now if we run only one task :
> 
> node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort
> I am 0/1 and i abort
> --
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode 2.
> 
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --
> --
> mpirun has exited due to process rank 0 with PID 15884 on
> node node1 exiting improperly. There are three reasons this could occur:
> 
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
> 
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
> 
> 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
> orte_create_session_dirs is set to false. In this case, the run-time cannot
> detect that the abort call was an abnormal termination. Hence, the only
> error message you will receive is this one.
> 
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> 
> You can avoid this message by specifying -quiet on the mpirun command line.
> 
> --
> node0 $ echo $?
> 1
> 
> the program displayed a misleading error message and mpirun exited with
> error code 1
> /* i would have expected 2, or 3 in the worst case scenario */
> 
> 
> i digged it a bit and found a kind of race condition in orted (running
> on node 1)
> basically, when the process dies, it writes stuff in the openmpi session
> directory and exits.
> exiting send a SIGCHLD to orted and close the socket/pipe connected to
> orted.
> on orted, the loss of connection is generally processed before the
> SIGCHLD by libevent,
> and as a consequence, the exit code is not correctly set (e.g. it is
> left to zero).
> i did not see any kind of communication between the mpi task and orted
> (except writing a file in the openmpi session directory) as i would have
> expected
> /* but this was just my initial guess, the truth is i do not know what
> is supposed to happen */
> 
> i wrote the attached abort.patch patch to basically get it working.
> i highly suspect this is not the right thing to do so i did not commit it.
> 
> it works fine with two tasks or more.
> with only one task, mpirun display a misleading error message but the
> exit status is ok.
> 
> could someone (Ralph ?) have a look at this ?
> 
> Cheers,
> 
> Gilles
> 
> 
> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort
> I am 1/2 and i abort
> I am 0/2 and i abort
> --
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode 2.
> 
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, 

Re: [OMPI devel] [OMPI svn] svn:open-mpi r32556 - trunk/orte/mca/oob/tcp

2014-08-20 Thread Dave Goodell (dgoodell)
On Aug 20, 2014, at 11:55 AM, svn-commit-mai...@open-mpi.org wrote:

> Author: rhc (Ralph Castain)
> Date: 2014-08-20 12:55:36 EDT (Wed, 20 Aug 2014)
> New Revision: 32556
> URL: https://svn.open-mpi.org/trac/ompi/changeset/32556
> 
> Log:
> Track down the last piece of the connection problem. It appears that
> providing a netmask of 0 to opal_net_samenetwork results in everything
> looking like it is on the same network. Hence, we were not retaining any
> of the alternative addresses, so we had no other way to check them.
> 
> Refs #4870
> 
> Text files modified: 
>   trunk/orte/mca/oob/tcp/oob_tcp.c| 8 +++-
> 
>   trunk/orte/mca/oob/tcp/oob_tcp_connection.c | 1 +   
> 
>   2 files changed, 8 insertions(+), 1 deletions(-)
> 
> Modified: trunk/orte/mca/oob/tcp/oob_tcp.c
> ==
> --- trunk/orte/mca/oob/tcp/oob_tcp.c  Tue Aug 19 22:48:47 2014(r32555)
> +++ trunk/orte/mca/oob/tcp/oob_tcp.c  2014-08-20 12:55:36 EDT (Wed, 20 Aug 
> 2014)  (r32556)
> @@ -282,6 +282,8 @@
> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME));
> 
> if (AF_INET != pop->af_family) {
> +opal_output_verbose(20, orte_oob_base_framework.framework_output,
> + "%s NOT AF_INET", 
> ORTE_NAME_PRINT(ORTE_PROC_MY_NAME));
> goto cleanup;
> }
> 
> @@ -306,8 +308,12 @@
> 
> /* do we already have this address? */
> OPAL_LIST_FOREACH(maddr, >addrs, mca_oob_tcp_addr_t) {
> -if (opal_net_samenetwork(, (struct sockaddr*)>addr, 
> 0)) {
> +/* require only that the subnet be the same */
> +if (opal_net_samenetwork(, (struct sockaddr*)>addr, 
> 24)) {

So... what if I have my hosts on a 10.123.0.0/16 network or some other network 
with a non-24-bit netmask?

-Dave



Re: [OMPI devel] [OMPI svn] svn:open-mpi r32555 - trunk/opal/mca/btl/scif

2014-08-20 Thread Paul Hargrove
Can somebody confirm that configure is adding "-c9x" or "-c99" to CFLAGS
with this compiler?
If not then r32555 could possibly be reverted in favor of adding the proper
compiler flag.

Also, I am suspicious of this failure because even without a language-level
option pgcc 12.9 and 13.4 compile the following:

struct S { int i; double d; };
struct S x = {1,0};
int main (void)
{
  struct S y = { .i = x.i };
  return y.i;
}


-Paul


On Wed, Aug 20, 2014 at 7:20 AM, Nathan Hjelm  wrote:

> Really? That means PGI 2013 is NOT C99 compliant! Figures.
>
> -Nathan
>
> On Tue, Aug 19, 2014 at 10:48:48PM -0400, svn-commit-mai...@open-mpi.org
> wrote:
> > Author: ggouaillardet (Gilles Gouaillardet)
> > Date: 2014-08-19 22:48:47 EDT (Tue, 19 Aug 2014)
> > New Revision: 32555
> > URL: https://svn.open-mpi.org/trac/ompi/changeset/32555
> >
> > Log:
> > btl/scif: use safe syntax
> >
> > PGI compilers 2013 and older do not support the following syntax :
> > mca_btl_scif_modex_t modex = {.port_id = mca_btl_scif_module.port_id};
> > so split it on two lines
> >
> > cmr=v1.8.2:reviewer=hjelmn
> >
> > Text files modified:
> >trunk/opal/mca/btl/scif/btl_scif_component.c | 3 ++-
> >1 files changed, 2 insertions(+), 1 deletions(-)
> >
> > Modified: trunk/opal/mca/btl/scif/btl_scif_component.c
> >
> ==
> > --- trunk/opal/mca/btl/scif/btl_scif_component.c  Tue Aug 19
> 18:34:49 2014(r32554)
> > +++ trunk/opal/mca/btl/scif/btl_scif_component.c  2014-08-19
> 22:48:47 EDT (Tue, 19 Aug 2014)  (r32555)
> > @@ -208,7 +208,8 @@
> >
> >  static int mca_btl_scif_modex_send (void)
> >  {
> > -mca_btl_scif_modex_t modex = {.port_id =
> mca_btl_scif_module.port_id};
> > +mca_btl_scif_modex_t modex;
> > +modex.port_id = mca_btl_scif_module.port_id;
> >
> >  return opal_modex_send (_btl_scif_component.super.btl_version,
> , sizeof (modex));
> >  }
> > ___
> > svn mailing list
> > s...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/svn
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15667.php
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] [OMPI svn] svn:open-mpi r32555 - trunk/opal/mca/btl/scif

2014-08-20 Thread Ralph Castain
If that's the case, then I wonder why it doesn't complain in other areas of the 
code where we also use C99 syntax? Or is it perhaps "mostly" C99 compliant, but 
doesn't like that specific use-case?


On Aug 20, 2014, at 7:20 AM, Nathan Hjelm  wrote:

> Really? That means PGI 2013 is NOT C99 compliant! Figures.
> 
> -Nathan
> 
> On Tue, Aug 19, 2014 at 10:48:48PM -0400, svn-commit-mai...@open-mpi.org 
> wrote:
>> Author: ggouaillardet (Gilles Gouaillardet)
>> Date: 2014-08-19 22:48:47 EDT (Tue, 19 Aug 2014)
>> New Revision: 32555
>> URL: https://svn.open-mpi.org/trac/ompi/changeset/32555
>> 
>> Log:
>> btl/scif: use safe syntax
>> 
>> PGI compilers 2013 and older do not support the following syntax :
>> mca_btl_scif_modex_t modex = {.port_id = mca_btl_scif_module.port_id};
>> so split it on two lines
>> 
>> cmr=v1.8.2:reviewer=hjelmn
>> 
>> Text files modified: 
>>   trunk/opal/mca/btl/scif/btl_scif_component.c | 3 ++-   
>>   
>>   1 files changed, 2 insertions(+), 1 deletions(-)
>> 
>> Modified: trunk/opal/mca/btl/scif/btl_scif_component.c
>> ==
>> --- trunk/opal/mca/btl/scif/btl_scif_component.c Tue Aug 19 18:34:49 
>> 2014(r32554)
>> +++ trunk/opal/mca/btl/scif/btl_scif_component.c 2014-08-19 22:48:47 EDT 
>> (Tue, 19 Aug 2014)  (r32555)
>> @@ -208,7 +208,8 @@
>> 
>> static int mca_btl_scif_modex_send (void)
>> {
>> -mca_btl_scif_modex_t modex = {.port_id = mca_btl_scif_module.port_id};
>> +mca_btl_scif_modex_t modex;
>> +modex.port_id = mca_btl_scif_module.port_id;
>> 
>> return opal_modex_send (_btl_scif_component.super.btl_version, 
>> , sizeof (modex));
>> }
>> ___
>> svn mailing list
>> s...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/svn
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15667.php



Re: [OMPI devel] [OMPI svn] svn:open-mpi r32555 - trunk/opal/mca/btl/scif

2014-08-20 Thread Nathan Hjelm
Really? That means PGI 2013 is NOT C99 compliant! Figures.

-Nathan

On Tue, Aug 19, 2014 at 10:48:48PM -0400, svn-commit-mai...@open-mpi.org wrote:
> Author: ggouaillardet (Gilles Gouaillardet)
> Date: 2014-08-19 22:48:47 EDT (Tue, 19 Aug 2014)
> New Revision: 32555
> URL: https://svn.open-mpi.org/trac/ompi/changeset/32555
> 
> Log:
> btl/scif: use safe syntax
> 
> PGI compilers 2013 and older do not support the following syntax :
> mca_btl_scif_modex_t modex = {.port_id = mca_btl_scif_module.port_id};
> so split it on two lines
> 
> cmr=v1.8.2:reviewer=hjelmn
> 
> Text files modified: 
>trunk/opal/mca/btl/scif/btl_scif_component.c | 3 ++-   
>   
>1 files changed, 2 insertions(+), 1 deletions(-)
> 
> Modified: trunk/opal/mca/btl/scif/btl_scif_component.c
> ==
> --- trunk/opal/mca/btl/scif/btl_scif_component.c  Tue Aug 19 18:34:49 
> 2014(r32554)
> +++ trunk/opal/mca/btl/scif/btl_scif_component.c  2014-08-19 22:48:47 EDT 
> (Tue, 19 Aug 2014)  (r32555)
> @@ -208,7 +208,8 @@
>  
>  static int mca_btl_scif_modex_send (void)
>  {
> -mca_btl_scif_modex_t modex = {.port_id = mca_btl_scif_module.port_id};
> +mca_btl_scif_modex_t modex;
> +modex.port_id = mca_btl_scif_module.port_id;
>  
>  return opal_modex_send (_btl_scif_component.super.btl_version, 
> , sizeof (modex));
>  }
> ___
> svn mailing list
> s...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/svn


pgpEOxEpBvLIJ.pgp
Description: PGP signature


[OMPI devel] MPI_Abort does not make mpirun return with the right exit code

2014-08-20 Thread Gilles Gouaillardet
Folks,

let's look at the following trivial test program :

#include 
#include 

int main (int argc, char * argv[]) {
int rank, size;
MPI_Init(, );
MPI_Comm_size(MPI_COMM_WORLD, );
MPI_Comm_rank(MPI_COMM_WORLD, );
printf ("I am %d/%d and i abort\n", rank, size);
MPI_Abort(MPI_COMM_WORLD, 2);
printf ("%d/%d aborted !\n", rank, size);
return 3;
}

and let's run mpirun (trunk) on node0 and ask the mpi task to run on
task 1 :
with two tasks or more :

node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort
--
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 2.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--
I am 1/2 and i abort
I am 0/2 and i abort
[node0:00740] 1 more process has sent help message help-mpi-api.txt /
mpi-abort
[node0:00740] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all help / error messages

node0 $ echo $?
0

the exit status of mpirun is zero
/* this is why the MPI_Errhandler_fatal_c test fails in mtt */

now if we run only one task :

node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort
I am 0/1 and i abort
--
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 2.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--
--
mpirun has exited due to process rank 0 with PID 15884 on
node node1 exiting improperly. There are three reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

You can avoid this message by specifying -quiet on the mpirun command line.

--
node0 $ echo $?
1

the program displayed a misleading error message and mpirun exited with
error code 1
/* i would have expected 2, or 3 in the worst case scenario */


i digged it a bit and found a kind of race condition in orted (running
on node 1)
basically, when the process dies, it writes stuff in the openmpi session
directory and exits.
exiting send a SIGCHLD to orted and close the socket/pipe connected to
orted.
on orted, the loss of connection is generally processed before the
SIGCHLD by libevent,
and as a consequence, the exit code is not correctly set (e.g. it is
left to zero).
i did not see any kind of communication between the mpi task and orted
(except writing a file in the openmpi session directory) as i would have
expected
/* but this was just my initial guess, the truth is i do not know what
is supposed to happen */

i wrote the attached abort.patch patch to basically get it working.
i highly suspect this is not the right thing to do so i did not commit it.

it works fine with two tasks or more.
with only one task, mpirun display a misleading error message but the
exit status is ok.

could someone (Ralph ?) have a look at this ?

Cheers,

Gilles


node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort
I am 1/2 and i abort
I am 0/2 and i abort
--
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 2.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--
[node0:00920] 1 more process has sent help message help-mpi-api.txt /
mpi-abort
[node0:00920] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all help / error messages
node0 $ echo $?
2



node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort
I am 0/1 and i abort