from:"sylvain . jeaugey"

Re: [OMPI devel] Cuda build break

2017-10-04 Thread Sylvain Jeaugey


See my last comment on #4257 :

https://github.com/open-mpi/ompi/pull/4257#issuecomment-332900393

We should completely disable CUDA in hwloc. It is breaking the build, 
but more importantly, it creates an extra dependency on the CUDA runtime 
that Open MPI doesn't have, even when compiled with --with-cuda (we load 
symbols dynamically).


On 10/04/2017 10:42 AM, Barrett, Brian via devel wrote:

All -

It looks like nVidia’s MTT started failing on 9/26, due to not finding Cuda.  
There’s a suspicious commit given the error message in the hwloc cuda changes.  
Jeff and Brice, it’s your patch, can you dig into the build failures?

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] CUDA kernels in OpenMPI

2017-01-27 Thread Sylvain Jeaugey


Hi Chris,

First, you will need to have some configure stuff to detect nvcc and use 
it inside your Makefile. UTK may have some examples to show here.


For the C/C++ API, you need to add 'extern "C"' statements around the 
interfaces you want to export in C so that you can use them inside Open MPI.


You can look at the NCCL code for an example :
https://github.com/NVIDIA/nccl/blob/master/src/nccl.h#L19-L21
Note the ifdefs in case this .h is included from in C code.

In the .cu, the 'extern "C"' is burried into defines :
https://github.com/NVIDIA/nccl/blob/master/src/core.h#L149-L150

So an example would be :
myapi.h :
#ifdef __cplusplus
extern "C" {
#endif
void myfunc(...);
#ifdef __cplusplus
}
#endif
lib.cu :
extern "C" __attribute__ ((visibility("default"))) void myfunc(...) { ... }

Sylvain

On 01/27/2017 09:00 AM, Chris Ward wrote:
It looks like the mailing system deleted the attachment, so here it is 
inline

#
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
# University Research and Technology
# Corporation.  All rights reserved.
# Copyright (c) 2004-2005 The University of Tennessee and The University
# of Tennessee Research Foundation.  All rights
# reserved.
# Copyright (c) 2004-2009 High Performance Computing Center Stuttgart,
# University of Stuttgart.  All rights reserved.
# Copyright (c) 2004-2005 The Regents of the University of California.
# All rights reserved.
# Copyright (c) 2010  Cisco Systems, Inc.  All rights reserved.
# Copyright (c) 2012  Sandia National Laboratories. All rights 
reserved.

# Copyright (c) 2013  Los Alamos National Security, LLC. All rights
# reserved.
# Copyright (c) 2016  IBM Corporation.  All rights reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#

AM_CPPFLAGS = $(coll_ibm_CPPFLAGS)

sources = \
   coll_ibm.h \
   coll_ibm_component.c \
   coll_ibm_module.c \
   coll_ibm_allgather.c \
   coll_ibm_allgatherv.c \
   coll_ibm_allreduce.c \
   coll_ibm_alltoall.c \
   coll_ibm_alltoallv.c \
   coll_ibm_barrier.c \
   coll_ibm_bcast.c \
   coll_ibm_exscan.c \
   coll_ibm_gather.c \
   coll_ibm_gatherv.c \
   coll_ibm_reduce.c \
   coll_ibm_reduce_scatter.c \
   coll_ibm_reduce_scatter_block.c \
   coll_ibm_scan.c \
   coll_ibm_scatter.c \
   coll_ibm_scatterv.c \
   comm_gpu.cu \
   allreduce_overlap.cc

SUFFIXES = .cu
#
#comm_gpu.lo: comm_gpu.cu
#/usr/local/cuda/bin/nvcc -gencode arch=compute_60,code=sm_60 
-lcuda -O3 --compiler-options "-O2 -fopenmp -mcpu=power8 -fPIC" -c 
comm_gpu.cu

#mv comm_gpu.o comm_gpu.lo

%.lo : %.cu
   /usr/local/cuda/bin/nvcc -gencode arch=compute_60,code=sm_60 
-lcuda -O3 --compiler-options "-O2 -fopenmp -mcpu=power8 -fPIC" -c $<

   mv $*.o .libs/
   touch $*.lo


# Make the output library in this directory, and name it either
# mca__.la (for DSO builds) or libmca__.la
# (for static builds).

if MCA_BUILD_ompi_coll_ibm_DSO
component_noinst =
component_install = mca_coll_ibm.la
else
component_noinst = libmca_coll_ibm.la
component_install =
endif

mcacomponentdir = $(ompilibdir)
mcacomponent_LTLIBRARIES = $(component_install)
mca_coll_ibm_la_SOURCES = $(sources)
if WANT_COLL_IBM_WITH_PAMI
mca_coll_ibm_la_LIBADD = $(coll_ibm_LIBS) \
 $(OMPI_TOP_BUILDDIR)/ompi/mca/common/pami/libmca_common_pami.la
else
mca_coll_ibm_la_LIBADD = $(coll_ibm_LIBS)
endif
mca_coll_ibm_la_LDFLAGS = -module -avoid-version $(coll_ibm_LDFLAGS)

noinst_LTLIBRARIES = $(component_noinst)
libmca_coll_ibm_la_SOURCES =$(sources)
libmca_coll_ibm_la_LIBADD = $(coll_ibm_LIBS)
libmca_coll_ibm_la_LDFLAGS = -module -avoid-version $(coll_ibm_LDFLAGS)

*T J (Chris) Ward, IBM Research.
Scalable Data-Centric Computing - Active Storage Fabrics - IBM System 
BlueGene

IBM United Kingdom Ltd., Hursley Park, Winchester, Hants, SO21 2JN
011-44-1962-818679
LinkedIn **_https://www.linkedin.com/profile/view?id=60628729_**
ResearchGate **_https://www.researchgate.net/profile/T_Ward2_**
**_
_**_IBM System BlueGene Research_* 
***_
_**_IBM System BlueGene Marketing_* 
*

**_
_**_IBM Resources for Global Servants_* 
***_
_**_IBM Branded Products_* ***_IBM Branded 
Swag_* ** 	




UNIX in the Cloud - Find A Place Where There's Room To Grow, with the 
original Open Standard. _Free Trial Here Today_ 

New Lamps For Old - _Diskless Remote Boot Linux_ 
from _National Center for 
High-Performance Computing, Taiwan_ 




Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with 
number 741598.

Re: [OMPI devel] Process affinity detection

2016-04-26 Thread Sylvain Jeaugey


Oh, I see. No, we don't want to add a full modex if there isn't one already.

Now, if we restrict this to the intra-node (we don't care on which 
socket/core is a distant process), is there any simple way to do an 
intra-node-only modex ?


On 04/26/2016 04:28 PM, Ralph Castain wrote:

On Apr 26, 2016, at 3:35 PM, Sylvain Jeaugey <sjeau...@nvidia.com> wrote:

Indeed, I implied that affinity was set before MPI_Init (usually even before 
the process is launched).

And yes, that would require a modex ... but I thought there was one already and 
maybe we could pack the affinity information inside the existing one.

If the BTLs et al don’t require the modex, then we don’t perform it (e.g., when 
launched by mpirun or via a PMIx-enabled RM). So when someone does as you 
describe, then we would have to force the modex to exchange the info. Doable, 
but results in a scaling penalty, and so definitely not something we want to do 
by default.



On 04/26/2016 02:56 PM, Ralph Castain wrote:

Hmmm…you mean for procs on the same node? I’m not sure how you can do it 
without introducing another data exchange, and that would require the app to 
execute it since otherwise we have no idea when they set the affinity.

If we assume they set the affinity prior to calling MPI_Init, then we could do 
it - but at the cost of forcing a modex. You can only detect your own affinity, 
so to get the relative placement, you have to do an exchange if we can’t pass 
it to you. Perhaps we could offer it as an option?



On Apr 26, 2016, at 2:27 PM, Sylvain Jeaugey <sjeau...@nvidia.com> wrote:

Within the BTL code (and surely elsewhere), we can use those convenient 
OPAL_PROC_ON_LOCAL_{NODE,SOCKET, ...} macros to figure out where another 
endpoint is located compared to us.

The problem is that it only works when ORTE defines it. The NODE works almost 
always since ORTE is always doing it. But for the NUMA, SOCKET, or CORE to 
work, we need to use Open MPI binding/mapping capabilities. If the process 
affinity was set with something else (custom scripts using taskset, cpusets, 
...), it doesn't work.

How hard do you think it would it be to detect the affinity and set those flags 
using hwloc to figure out if we're on the same {SOCKET, CORE, ...} ? Where 
would it be simpler to do this ?

Thanks.
Sylvain

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
___
devel mailing list
de...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/04/18821.php

___
devel mailing list
de...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/04/18822.php

___
devel mailing list
de...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/04/18823.php

___
devel mailing list
de...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/04/18824.php

Re: [OMPI devel] Process affinity detection

2016-04-26 Thread Sylvain Jeaugey

Indeed, I implied that affinity was set before MPI_Init (usually even 
before the process is launched).


And yes, that would require a modex ... but I thought there was one 
already and maybe we could pack the affinity information inside the 
existing one.


On 04/26/2016 02:56 PM, Ralph Castain wrote:

Hmmm…you mean for procs on the same node? I’m not sure how you can do it 
without introducing another data exchange, and that would require the app to 
execute it since otherwise we have no idea when they set the affinity.

If we assume they set the affinity prior to calling MPI_Init, then we could do 
it - but at the cost of forcing a modex. You can only detect your own affinity, 
so to get the relative placement, you have to do an exchange if we can’t pass 
it to you. Perhaps we could offer it as an option?



On Apr 26, 2016, at 2:27 PM, Sylvain Jeaugey <sjeau...@nvidia.com> wrote:

Within the BTL code (and surely elsewhere), we can use those convenient 
OPAL_PROC_ON_LOCAL_{NODE,SOCKET, ...} macros to figure out where another 
endpoint is located compared to us.

The problem is that it only works when ORTE defines it. The NODE works almost 
always since ORTE is always doing it. But for the NUMA, SOCKET, or CORE to 
work, we need to use Open MPI binding/mapping capabilities. If the process 
affinity was set with something else (custom scripts using taskset, cpusets, 
...), it doesn't work.

How hard do you think it would it be to detect the affinity and set those flags 
using hwloc to figure out if we're on the same {SOCKET, CORE, ...} ? Where 
would it be simpler to do this ?

Thanks.
Sylvain

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
___
devel mailing list
de...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/04/18821.php

___
devel mailing list
de...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/04/18822.php

[OMPI devel] Process affinity detection

2016-04-26 Thread Sylvain Jeaugey

Within the BTL code (and surely elsewhere), we can use those convenient 
OPAL_PROC_ON_LOCAL_{NODE,SOCKET, ...} macros to figure out where another 
endpoint is located compared to us.


The problem is that it only works when ORTE defines it. The NODE works 
almost always since ORTE is always doing it. But for the NUMA, SOCKET, 
or CORE to work, we need to use Open MPI binding/mapping capabilities. 
If the process affinity was set with something else (custom scripts 
using taskset, cpusets, ...), it doesn't work.


How hard do you think it would it be to detect the affinity and set 
those flags using hwloc to figure out if we're on the same {SOCKET, 
CORE, ...} ? Where would it be simpler to do this ?


Thanks.
Sylvain

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI devel] Crash in orte_iof_hnp_read_local_handler

2016-02-26 Thread Sylvain Jeaugey

No, the child processes are only calling MPIX_Query_cuda_support which 
is just "return OPAL_CUDA_SUPPORT". I can reproduce the problem with 
"ls" (see above).


I don't have the line numbers, but from the calling stack, the only way 
it could segfault is that ">stdinev->daemon" is wrong in 
orte_iof_hnp_read_local_handler (orte/mca/iof/hnp/iof_hnp_read.c:145).


Which means that the cbdata passed from libevent to 
orte_iof_hnp_read_local_handler() is wrong or was destroyed, freed,  
The crash seems to happen after some ranks already finished (but others 
didn't start yet).


Finally, I found how to reproduce it easily. You need to have orted do 3 
things at the same time : process stdout (child processes writing to 
stdout), stdin (I'm hitting enter to produce stdin to mpirun) and tcp 
connections (mpirun between multiple nodes). If run within the node, I 
get no crash, if I don't hit "enter", no crash. If I call "sleep 1" 
instead of "ls /", no crash.


So I run this loop :
  while mpirun -host  -np 6 ls /; do true; done
  

I'm not sure why MTT is reproducing the error ... does it write to 
mpirun stdin ?


On 02/26/2016 11:46 AM, Ralph Castain wrote:

So the child processes are not calling orte_init or anything like that? I can 
check it - any chance you can give me a line number via a debug build?


On Feb 26, 2016, at 11:42 AM, Sylvain Jeaugey <sjeau...@nvidia.com> wrote:

I got this strange crash on master this night running nv/mpix_test :

Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x50
[ 0] /lib64/libpthread.so.0(+0xf710)[0x7f9f19a80710]
[ 1] 
/ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/libopen-rte.so.0(orte_util_compare_name_fields+0x81)[0x7f9f1a88f6d7]
[ 2] 
/ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/openmpi/mca_iof_hnp.so(orte_iof_hnp_read_local_handler+0x247)[0x7f9f1109b4ab]
[ 3] 
/ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0xbf1)[0x7f9f1a5b68f1]
[ 4] mpirun[0x405649][drossetti-ivy4:31651] [ 5] mpirun[0x403a48]
[ 6] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9f196fbd1d]
[ 7] mpirun[0x4038e9]
*** End of error message ***

This test is not even calling MPI_Init/Finalize, only MPIX_Query_cuda_support. 
So it is really an ORTE race condition, and the problem is hard to reproduce. 
It takes sometimes more than 50 runs with random sleep between runs to see the 
problem.

I don't even know if we want to fix that -- what do you think ?

Sylvain



---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
___
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/02/18635.php

___
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/02/18636.php

[OMPI devel] Crash in orte_iof_hnp_read_local_handler

2016-02-26 Thread Sylvain Jeaugey


I got this strange crash on master this night running nv/mpix_test :

Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x50
[ 0] /lib64/libpthread.so.0(+0xf710)[0x7f9f19a80710]
[ 1] 
/ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/libopen-rte.so.0(orte_util_compare_name_fields+0x81)[0x7f9f1a88f6d7]
[ 2] 
/ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/openmpi/mca_iof_hnp.so(orte_iof_hnp_read_local_handler+0x247)[0x7f9f1109b4ab]
[ 3] 
/ivylogin/home/sjeaugey/tests/mtt/scratches/mtt-scratch-4/installs/eGXW/install/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0xbf1)[0x7f9f1a5b68f1]

[ 4] mpirun[0x405649][drossetti-ivy4:31651] [ 5] mpirun[0x403a48]
[ 6] /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f9f196fbd1d]
[ 7] mpirun[0x4038e9]
*** End of error message ***

This test is not even calling MPI_Init/Finalize, only 
MPIX_Query_cuda_support. So it is really an ORTE race condition, and the 
problem is hard to reproduce. It takes sometimes more than 50 runs with 
random sleep between runs to see the problem.


I don't even know if we want to fix that -- what do you think ?

Sylvain



---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI devel] [OMPI users] configuring open mpi 10.1.2 with cuda on NVIDIA TK1

2016-01-22 Thread Sylvain Jeaugey


[Moving To Devel]

I tried to look at the configure to understand why the hwloc part failed 
at getting the CUDA path. I guess the --with-cuda information is not 
propagated to the hwloc part of the configure.


If an m4 expert has an idea of how to do this the The Right Way, that 
would help.


Thanks,
Sylvain

On 01/22/2016 10:07 AM, Sylvain Jeaugey wrote:
It looks like the errors are produced by the hwloc configure ; this 
one somehow can't find CUDA (I have to check if that's a problem btw). 
Anyway, later in the configure, the VT configure finds cuda correctly, 
so it seems specific to the hwloc configure.


On 01/22/2016 10:01 AM, Kuhl, Spencer J wrote:


Hi Sylvain,


The configure does not stop, 'make all install' completes.  After 
remaking and recompiling then ignoring the configure errors, and 
confirming both a functional cuda install and functional openmpi 
install.  I went to the /usr/local/cuda/samples directory and ran 
'make' and succesfully ran 'simpleMPI' provided by NVIDIA.  The 
output suggested that everything works perfectly fine between openMPI 
and cuda on my Jetson TK1 install.  Because of this, I think it is as 
you suspected; it was just ./configure output noise.



What a frustrating exercise.  Thanks for the suggestion.  I think I 
can say 'case closed'



Spencer





*From:* users <users-boun...@open-mpi.org> on behalf of Sylvain 
Jeaugey <sjeau...@nvidia.com>

*Sent:* Friday, January 22, 2016 11:34 AM
*To:* us...@open-mpi.org
*Subject:* Re: [OMPI users] configuring open mpi 10.1.2 with cuda on 
NVIDIA TK1

Hi Spencer,

Could you be more specific about what fails ? Did the configure stop 
at some point ? Or is it a compile error during the build ?


I'm not sure the errors you are seeing in config.log are actually the 
real problem (I'm seeing the same error traces on a perfectly working 
machine). Not pretty, but maybe just noise.


Thanks,
Sylvain

On 01/22/2016 06:48 AM, Kuhl, Spencer J wrote:


Thanks for the suggestion Ryan, I will remove the symlinks and start 
try again.  I checked config.log, and it appears that the configure 
finds cuda support, (result: yes), but once configure checks for 
cuda.h usability, conftest.c reports that a fatal error occurred, 
'cuda.h no such file or directory.'



I have copied here some grep'ed output of config.log


$ ./configure --prefix=/usr/local --with-cuda=/usr/local/cuda-6.5 
--enable-mpi-java

configure:9829: checking if --with-cuda is set
configure:9883: result: found (/usr/local/cuda-6.5/include/cuda.h)
| #include 
configure:10055: checking if have cuda support
configure:10058: result: yes (-I/usr/local/cuda-6.5)
configure:66435: result: '--prefix=/usr/local' 
'--with-cuda=/usr/local/cuda-6.5' '--enable-mpi-java'

configure:74182: checking cuda.h usability
conftest.c:643:18: fatal error: cuda.h: No such file or directory
 #include 
| #include 
configure:74182: checking cuda.h presence
conftest.c:610:18: fatal error: cuda.h: No such file or directory
 #include 
| #include 
configure:74182: checking for cuda.h
configure:74265: checking cuda_runtime_api.h usability
conftest.c:643:30: fatal error: cuda_runtime_api.h: No such file or 
directory

 #include 
| #include 
configure:74265: checking cuda_runtime_api.h presence
conftest.c:610:30: fatal error: cuda_runtime_api.h: No such file or 
directory

 #include 
| #include 
configure:74265: checking for cuda_runtime_api.h
configure:97946: running /bin/bash './configure' --disable-dns 
--disable-http --disable-rpc --disable-openssl 
--enable-thread-support --disable-evport '--prefix=/usr/local' 
'--with-cuda=/usr/local/cuda-6.5' '--enable-mpi-java' 
--cache-file=/dev/null --srcdir=. --disable-option-checking

configure:187066: result: verbs_usnic, ugni, sm, verbs, cuda
configure:193532: checking for MCA component common:cuda compile mode
configure:193585: checking if MCA component common:cuda can compile




*From:* users <users-boun...@open-mpi.org> on behalf of Novosielski, 
Ryan <novos...@ca.rutgers.edu>

*Sent:* Friday, January 22, 2016 1:20 AM
*To:* Open MPI Users
*Subject:* Re: [OMPI users] configuring open mpi 10.1.2 with cuda on 
NVIDIA TK1
I would check config.log carefully to see what specifically failed 
or wasn't found where. I would never mess around with the contents 
of /usr/include. That is sloppy stuff and likely to get you into 
trouble someday.


 *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
|| \\UTGERS  |-*O*-
||_// Biomedical | Ryan Novosielski - Senior Technologist
|| \\ and Health | novos...@rutgers.edu 
<mailto:novos...@rutgers.edu>- 973/972.0922 (2x0922)

||  \\  Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
`'

On Jan 21, 2016, at 17:45, Kuhl, Spencer J <spencer-k...@uiowa.edu> 
wrote:




Openm

Re: [OMPI devel] FOSS for scientists devroom at FOSDEM 2013

2012-11-20 Thread Sylvain Jeaugey


Hi Jeff,

Do you mean "attend" or "do a talk" ?

Sylvain

Le 20/11/2012 16:16, Jeff Squyres a écrit :

Cool!  Thanks for the invite.

Do we have any European friends who would be able to attend this conference?


On Nov 20, 2012, at 10:02 AM, Sylwester Arabas wrote:


Dear Open MPI Team,

A day-long session ("devroom") on Free/Libre and Open Source Software (FLOSS) 
for scientists will be held during the next FOSDEM conference, Brussels, 2-3 February 
2013 (http://fosdem.org/2013).

We aim at having a dozen or two short talks introducing projects, advertising 
brand new features of established tools, discussing issues relevant to the 
development of software for scientific computing, and touching on the 
interdependence of FLOSS and open science.

You can find more info on the call for talks at:
http://slayoo.github.com/fosdem2013/

The deadline for sending talk proposals is December 16th 2012.

Please send your submissions or comments to:
foss4scientists-devr...@lists.fosdem.org

Please do forward this message to anyone potentially interested.  Please also 
let us know if you have any suggestions for what would you like to hear about 
in the devroom.

Looking forward to meeting you in Brussels.
Thanks in advance.

The conveners,
Sylwester Arabas, Juan Antonio Añel, Christos Siopis

P.S. There are open calls for main-track talks, lightning talks, and stands at 
FOSDEM as well, see: https://www.fosdem.org/2013/

--
http://www.igf.fuw.edu.pl/~slayoo/
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] poor btl sm latency

2012-02-13 Thread sylvain . jeaugey

Hi Matthias,

You might want to play with process binding to see if your problem is 
related to bad memory affinity.

Try to launch pingpong on two CPUs of the same socket, then on different 
sockets (i.e. bind each process to a core, and try different 
configurations).

Sylvain



De :Matthias Jurenz 
A : Open MPI Developers 
Date :  13/02/2012 12:12
Objet : [OMPI devel] poor btl sm latency
Envoyé par :devel-boun...@open-mpi.org



Hello all,

on our new AMD cluster (AMD Opteron 6274, 2,2GHz) we get very bad 
latencies 
(~1.5us) when performing 0-byte p2p communication on one single node using 
the 
Open MPI sm BTL. When using Platform MPI we get ~0.5us latencies which is 
pretty good. The bandwidth results are similar for both MPI 
implementations 
(~3,3GB/s) - this is okay.

One node has 64 cores and 64Gb RAM where it doesn't matter how many ranks 
allocated by the application. We get similar results with different number 
of 
ranks.

We are using Open MPI 1.5.4 which is built by gcc 4.3.4 without any 
special 
configure options except the installation prefix and the location of the 
LSF 
stuff.

As mentioned at http://www.open-mpi.org/faq/?category=sm we tried to use 
/dev/shm instead of /tmp for the session directory, but it had no effect. 
Furthermore, we tried the current release candidate 1.5.5rc1 of Open MPI 
which 
provides an option to use the SysV shared memory (-mca shmem sysv) - also 
this 
results in similar poor latencies.

Do you have any idea? Please help!

Thanks,
Matthias
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Bull Vendor ID disappeared from IB ini file

2011-09-07 Thread sylvain . jeaugey

True. I'm very sorry. I could have sworn it was this patch. And I doubled 
checked in SVN _and_ HG it was this one. But now I confirm it's Ralph's 
(very explicit) patch, and the patch I was refering to is really doing 
what it pretends.

Weird.

Btw, commit done.

Sylvain

devel-boun...@open-mpi.org a écrit sur 07/09/2011 16:00:18 :

> De : Rolf vandeVaart <rvandeva...@nvidia.com>
> A : Open MPI Developers <de...@open-mpi.org>
> Date : 07/09/2011 16:00
> Objet : Re: [OMPI devel] Bull Vendor ID disappeared from IB ini file
> Envoyé par : devel-boun...@open-mpi.org
> 
> 
> Actually, I think you are off by which commit undid the change.  It 
> was this one.  And the message does suggest it might have caused 
problems.
> 
> https://svn.open-mpi.org/trac/ompi/changeset/23764
> Timestamp:
> 09/17/10 19:04:06 (12 months ago) 
> Author:
> rhc
> Message:
> WARNING: Work on the temp branch being merged here encountered 
> problems with bugs in subversion. Considerable effort has gone into 
> validating the branch. However, not all conditions can be checked, 
> so users are cautioned that it may be advisable to not update from 
> the trunk for a few days to allow MTT to identify platform-specific 
issues.
>This merges the branch containing the revamped build system based
> around converting autogen from a bash script to a Perl program. Jeff
> has provided emails explaining the features contained in the change.
> Please note that configure requirements on components HAVE 
> CHANGED. For example. a configure.params file is no longer required 
> in each component directory. See Jeff's emails for an explanation.
> 
> 
> 
> ________
> From: devel-boun...@open-mpi.org [devel-boun...@open-mpi.org] On 
> Behalf Of Sylvain Jeaugey [sylvain.jeau...@bull.net]
> Sent: Wednesday, September 07, 2011 8:56 AM
> To: Open MPI Developers
> Subject: [OMPI devel] Bull Vendor ID disappeared from IB ini file
> 
> Hi All,
> 
> I just realized that Bull Vendor IDs for Infiniband cards disappeared 
from
> the trunk. Actually, they were removed shortly after we included them in
> last September.
> 
> The original commit was :
> r23715 | derbeyn | 2010-09-03 16:13:19 +0200 (Fri, 03 Sep 2010) | 1 line
> Added Bull vendor id for ConnectX card
> 
> An here is the commit that undid Nadia's patch :
> r23791 | swise | 2010-09-22 20:16:53 +0200 (Wed, 22 Sep 2010) | 2 lines
> Add T4 device IDs to openib btl params ini file.
> 
> It does indeed add some T4 device IDs and removes our vendor ID. The 
other
> thing that bugs me is that unlike the commit message suggests, this 
patch
> does a lot more than adding T4 device ids. So, It looks like something
> went wrong on this commit (something like : I forgot to update and 
forced
> the commit) and it may be worth checking nothing else were reverted with
> this commit ...
> 
> Sylvain
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
---
> This email message is for the sole use of the intended recipient(s) 
> and may contain
> confidential information.  Any unauthorized review, use, disclosure 
> or distribution
> is prohibited.  If you are not the intended recipient, please 
> contact the sender by
> reply email and destroy all copies of the original message.
> 
---
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Bull Vendor ID disappeared from IB ini file

2011-09-07 Thread sylvain . jeaugey

v1.4 and v1.5 seem fine. So, it's only missing in the trunk.

I'll commit this asap.

Thanks for your explanations,
Sylvain



De :Jeff Squyres <jsquy...@cisco.com>
A : Open MPI Developers <de...@open-mpi.org>
Date :  07/09/2011 17:21
Objet : Re: [OMPI devel] Bull Vendor ID disappeared from IB ini file
Envoyé par :devel-boun...@open-mpi.org



+1.  Sorry about that, Sylvain -- please re-commit.

Is the right stuff on v1.4 / v1.5?


On Sep 7, 2011, at 10:04 AM, Ralph Castain wrote:

> Quite possible - subversion was having its typical convulsions over the 
configure system change as there were lots of conflicting commits during 
that time. I'd suggest just re-committing your change.
> 
> 
> On Sep 7, 2011, at 8:00 AM, Rolf vandeVaart wrote:
> 
>> 
>> Actually, I think you are off by which commit undid the change.  It was 
this one.  And the message does suggest it might have caused problems.
>> 
>> https://svn.open-mpi.org/trac/ompi/changeset/23764
>> Timestamp:
>>   09/17/10 19:04:06 (12 months ago) 
>> Author:
>>   rhc
>> Message:
>>   WARNING: Work on the temp branch being merged here encountered 
problems with bugs in subversion. Considerable effort has gone into 
validating the branch. However, not all conditions can be checked, so 
users are cautioned that it may be advisable to not update from the trunk 
for a few days to allow MTT to identify platform-specific issues.
>>  This merges the branch containing the revamped build system based 
around converting autogen from a bash script to a Perl program. Jeff has 
provided emails explaining the features contained in the change.
>>   Please note that configure requirements on components HAVE CHANGED. 
For example. a configure.params file is no longer required in each 
component directory. See Jeff's emails for an explanation.
>> 
>> 
>> 
>> ____
>> From: devel-boun...@open-mpi.org [devel-boun...@open-mpi.org] On Behalf 
Of Sylvain Jeaugey [sylvain.jeau...@bull.net]
>> Sent: Wednesday, September 07, 2011 8:56 AM
>> To: Open MPI Developers
>> Subject: [OMPI devel] Bull Vendor ID disappeared from IB ini file
>> 
>> Hi All,
>> 
>> I just realized that Bull Vendor IDs for Infiniband cards disappeared 
from
>> the trunk. Actually, they were removed shortly after we included them 
in
>> last September.
>> 
>> The original commit was :
>> r23715 | derbeyn | 2010-09-03 16:13:19 +0200 (Fri, 03 Sep 2010) | 1 
line
>> Added Bull vendor id for ConnectX card
>> 
>> An here is the commit that undid Nadia's patch :
>> r23791 | swise | 2010-09-22 20:16:53 +0200 (Wed, 22 Sep 2010) | 2 lines
>> Add T4 device IDs to openib btl params ini file.
>> 
>> It does indeed add some T4 device IDs and removes our vendor ID. The 
other
>> thing that bugs me is that unlike the commit message suggests, this 
patch
>> does a lot more than adding T4 device ids. So, It looks like something
>> went wrong on this commit (something like : I forgot to update and 
forced
>> the commit) and it may be worth checking nothing else were reverted 
with
>> this commit ...
>> 
>> Sylvain
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
---
>> This email message is for the sole use of the intended recipient(s) and 
may contain
>> confidential information.  Any unauthorized review, use, disclosure or 
distribution
>> is prohibited.  If you are not the intended recipient, please contact 
the sender by
>> reply email and destroy all copies of the original message.
>> 
---
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] Bull Vendor ID disappeared from IB ini file

2011-09-07 Thread Sylvain Jeaugey


Hi All,

I just realized that Bull Vendor IDs for Infiniband cards disappeared from 
the trunk. Actually, they were removed shortly after we included them in 
last September.


The original commit was :
r23715 | derbeyn | 2010-09-03 16:13:19 +0200 (Fri, 03 Sep 2010) | 1 line
Added Bull vendor id for ConnectX card

An here is the commit that undid Nadia's patch :
r23791 | swise | 2010-09-22 20:16:53 +0200 (Wed, 22 Sep 2010) | 2 lines
Add T4 device IDs to openib btl params ini file.

It does indeed add some T4 device IDs and removes our vendor ID. The other 
thing that bugs me is that unlike the commit message suggests, this patch 
does a lot more than adding T4 device ids. So, It looks like something 
went wrong on this commit (something like : I forgot to update and forced 
the commit) and it may be worth checking nothing else were reverted with 
this commit ...


Sylvain

Re: [OMPI devel] "Open MPI"-based MPI library used by K computer

2011-06-29 Thread sylvain . jeaugey

Kawashima-san,

Congratulations for your machine, this is a stunning achievement !

> Kawashima  wrote :
> Also, we modified tuned COLL to implement interconnect-and-topology-
> specific bcast/allgather/alltoall/allreduce algorithm. These algorithm
> implementations also bypass PML/BML/BTL to eliminate protocol and 
software
> overhead.
This seems perfectly valid to me. The current coll components use normal 
MPI_Send/Recv semantics, hence the PML/BML/BTL chain, but I always saw the 
coll framework as a way to be able to integrate smoothly "custom" 
collective components for a specific interconnect. I think that Mellanox 
also did a specific collective component using directly their ConnectX HCA 
capabilities.

However, modifying the "tuned" component may not be the better way to 
integrate your collective work. You may consider creating a "tofu" coll 
component which would only provide the collectives you optimized (and the 
coll framework will fallback on tuned for the ones you didn't optimize).

> To achieve above, we created 'tofu COMMON', like sm 
(ompi/mca/common/sm/).
> 
> Is there interesting one?
It may be interesting, yes. I don't know the tofu model, but if it is not 
secret, contributing it is usually a good thing.

Your communication model may be similar to others and portions of code may 
be shared with other technologies (I'm thinking of IB, MX, PSM,...). 
People writing new code would also consider your model and let you take 
advantage of it. Knowing how tofu is integrated into Open MPI may also 
impact major decisions the open-source community is taking.

Sylvain

Re: [OMPI devel] BTL preferred_protocol , large message

2011-03-10 Thread Sylvain Jeaugey


On Wed, 9 Mar 2011, George Bosilca wrote:

One gets multiple non-overlapping BTL (in terms of peers), each with its 
own set of parameters and eventually accepted protocols. Mainly there 
will be one BTL per memory hierarchy.

Pretty cool :-)


I'll cleanup the code and send you a patch.

We'd be happy to review/test/discuss it.

Sylvain

Re: [OMPI devel] BTL preferred_protocol , large message

2011-03-09 Thread Sylvain Jeaugey

Hi George,

This certainly looks like our motivations are close. However, I don't see
in the presentation how you implement it (maybe I misread it), especially
how you manage to not modify the BTL interface.

Do you have any code / SVN commit references for us to better understand
what it's about ?

Thanks,
Sylvain

On Tue, 8 Mar 2011, George Bosilca wrote:

On Mar 8, 2011, at 12:12 , Damien Guinier wrote:

Hi Jeff

Sorry, your email went on the devel mailing list of Open MPI.

I'm working on large message exchange optimization. My optimization consists in
"choosing
the best protocol for each large message".
In fact,
- for each device, the way to chose the best protocol is different.
- the faster protocol for a given device depends on that device hardware and on
the message
specifications.

So the device/BTL itself is the best place to dynamically select the fastest
protocol.

Presently, for large messages, the protocol selection is only based on device
capabilities.
My optimization consists in asking the device/BTL for a "preferred protocol" and
then make a choice based on :
- the device capabilities and the BTL's recommendation.

As a BTL will not randomly change its preferred protocol, one can assume
it will depend on the peer. Here is a similar approach to one you
describe in your email, but without modification of the BTL interface.

https://fs.hlrs.de/projects/eurompi2010/TALKS/WEDNESDAY_AFTERNOON/george_bosilca_locality_and_topology_aware.pdf

george.

Technical view:
The optimization is located in mca_pml_ob1_send_request_start_btl(), after the
device/btl selection.
In the large message section, I call a new function :
mca_pml_ob1_preferred_protocol() => mca_bml_base_preferred_protocol()
This one will try to launch
btl->btl_preferred_protocol()
So, selecting a protocol before a large message in not in the critical path.
It is the BTL's responsibility to define this function to select a preferred
protocol.

If this function is not defined, nothing changes in the code path
To do this optimization , I had to add an interface to the btl module structure in
"btl.h", this is the drawback.

I have already used this feature to optimize the "shared memory" device/BTL. I use the
"preferred_protocol" feature to enable/disable
KNEM according to intra/inter socket communication. This optimization increases a
"IMB pingping benchmark" bandwidth by ~36%.

The next step is now to use the "preferred protocol" feature with openib ( with
many IB cards)

Attached 2 patches:
1) BTL_preferred.patch:
introduces the new preferred protocol interface
2) SM_KNEM_intra_socket.patch:
defines the preferred protocol for the sm btl
Note: Since the "ess" framework can't give us the "socket locality
information", I used hitopo that has been proposed in an RFC
some times ago:
http://www.open-mpi.org/community/lists/devel/2010/11/8677.php

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

"I disapprove of what you say, but I will defend to the death your right to say
it"
-- Evelyn Beatrice Hall

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [RFC] Hierarchical Topology

2010-11-16 Thread Sylvain Jeaugey


On Mon, 15 Nov 2010, Ralph Castain wrote:


Guess I am a little confused. Every MPI process already has full knowledge
of what node all other processes are located on - this has been true for
quite a long time.

Ok, I didn't see that.


Once my work is complete, mpirun will have full knowledge of each node's
hardware resources. Terry will then use that in mpirun's mappers. The
resulting launch message will contain a full mapping of procs to cores -
i.e., every daemon will know the core placement of every process in the job.
That info will be passed down to each MPI proc. Thus, upon launch, every MPI
process will know not only the node for each process, but also the hardware
resources of that node, and the bindings of every process in the job to that
hardware.

Allright.

Some things bug me however :
 1. What if the placement has been done by a wrapper script or by the 
resource manager ? I.e. how do you know where MPI procs are located ?
 2. How scalable is it ? I would think there an allgather with 1 process 
per node ; am I right ?

 3. How is that information represented ? As a graph ?


So the only thing missing is the switch topology of the cluster (the
inter-node topology). We modified carto a while back to support input of
switch topology information, though I'm not sure how many people ever used
that capability - not much value in it so far. We just set it up so that
people could describe the topology, and then let carto compute hop distance.

Ok. I didn't know we also had some work on switches in carto.


HTH

This helps !

So, I'm now wondering if both work, which would seem similar are really 
redundant. We though about this before starting hitopo, and since a graph 
didn't fit our needs, we started work towards computing an address. 
Perhaps hitopo addresses could be computed using hwloc's graph.


I understand that for sm optimization, hwloc is richer. The only thing 
that bugs me is how much time it takes to figure out what capability I 
have between process A and B. The great thing in hitopo is that a single 
comparison can give you a property of two processes (e.g. they are on the 
same socket).


Anyway, I just wanted to present hitopo in case someone would need it. And 
I think hitopo's prefered domain remains collectives, where you do not 
really need distances, but groups which share a certain locality.


Sylvain


On Mon, Nov 15, 2010 at 9:00 AM, Sylvain Jeaugey
<sylvain.jeau...@bull.net>wrote:


I already mentionned it answering Terry's e-mail, but to be sure I'm clear
: don't confuse node full topology with MPI job topology. It _is_ different.

And every process does not get the whole topology in hitopo, only its own,
which should not cause storms.


On Mon, 15 Nov 2010, Ralph Castain wrote:

 I think the two efforts (the paffinity and this one) do overlap somewhat.

I've been writing the local topology discovery code for Jeff, Terry, and
Josh - uses hwloc (or any other method - it's a framework) to discover
what
hardware resources are available on each node in the job so that the info
can be used in mapping the procs.

As part of that work, we are passing down to the mpi processes the local
hardware topology. This is done because of prior complaints when we had
each
mpi process discover that info for itself - it creates a bit of a "storm"
on
the node of large smp's.

Note that what I've written (still to be completed before coming over)
doesn't tell the proc what cores/HT's it is bound to - that's the part
Terry
et al are adding. Nor were we discovering the switch topology of the
cluster.

So a little overlap that could be resolved. And a concern on my part: we
have previously introduced capabilities that had every mpi process read
local system files to get node topology, and gotten user complaints about
it. We probably shouldn't go back to that practice.

Ralph


On Mon, Nov 15, 2010 at 8:15 AM, Terry Dontje <terry.don...@oracle.com

wrote:


  A few comments:


1.  Have you guys considered using hwloc for level 4-7 detection?
2.  Is L2 related to L2 cache?  If no then is there some other term you
could use?
3.  What do you see if the process is bound to multiple
cores/hyperthreads?
4.  What do you see if the process is not bound to any level 4-7 items?
5.  What about L1 and L2 cache locality as some levels? (hwloc exposes
these but these are also at different depths depending on the platform).

Note I am working with Jeff Squyres and Josh Hursey on some new paffinity
code that uses hwloc.  Though the paffinity code may not have direct
relationship to hitopo the use of hwloc and standardization of what you
call
level 4-7 might help avoid some user confusions.

--td


On 11/15/2010 06:56 AM, Sylvain Jeaugey wrote:

As a followup of Stuttgart's developper's meeting, here is an RFC for our
topology detection framework.

WHAT: Add a framework for hardware topology detection to be used by any
other part of Open MPI to help optimization.

WHY: Collective operatio

Re: [OMPI devel] [RFC] Hierarchical Topology

2010-11-15 Thread Sylvain Jeaugey


On Mon, 15 Nov 2010, Terry Dontje wrote:


A few comments:

1.  Have you guys considered using hwloc for level 4-7 detection?
Yes, and I agree there may be something to improve on level 4-7 detection. 
But note that hitopo differs from hwloc because it is not discovering the 
whole machine, only where MPI processes have been spawned. More on this 
after.


2.  Is L2 related to L2 cache?  If no then is there some other term you could 
use?
It is not L2 cache. However, claiming that L2 is always related to L2 
cache is a bit exagerated in my opinion. The term in hitopo is "L2NUMA" 
which seems clear to me. And there are L2 Infiniband switches, L2 
support, ... :-)



3.  What do you see if the process is bound to multiple cores/hyperthreads?
4.  What do you see if the process is not bound to any level 4-7 items?
Currently (and this is not optimal), as soon as the process is not bound 
to 1 core, the cpuid component returns nothing (no socket, no core). We 
could improving this by returning only the socket when we are bound to a 
socket.


When placement is not per-core, socket number will therefore be 0 and core 
number will be renumbered by the "renumber" phase from 0 to N (N being the 
number of MPI processes on the node).


Hyperthread are only used if two processes are bound on the same core (the 
renumber phase will mark them as 0, 1, ...).


5.  What about L1 and L2 cache locality as some levels? (hwloc exposes these 
but these are also at different depths depending on the platform).
This is something hitopo doesn't [want to] show. But we could imagine 
calling hwloc to know what are the properties of MPI process on the same 
core/socket/...


Note I am working with Jeff Squyres and Josh Hursey on some new paffinity 
code that uses hwloc.  Though the paffinity code may not have direct 
relationship to hitopo the use of hwloc and standardization of what you call 
level 4-7 might help avoid some user confusions.
I agree there is a big potential for confusion between hwloc, carto, 
hitopo, ... One could think we should mutualise code, which is often not 
possible or not what we want.


My (maybe incorrect) vision is that hwloc and carto discover the hardware 
topology, i.e. what exists on the node (not what will be used). This is 
used by placement modules or btls to know what resources to use when 
launching processes.


HiTopo provides where (inside this discovery) MPI process end up being 
spawned [btw, not only intra-node but also inter-node]. We could get this 
information from Open MPI components that do the spawning, but since it is 
not enough (resource manager may do part of the binding), we re-do the 
discovery in the end.


To sum up, here is the complete picture as I see it :

[ 0. Resource manager restricts node/cpu/io/mem sets ]
  1. Hwloc discovers what's available for intra-node
  2. Spawning/placement is done by a combination of RMs, paffinity, ...
  3. HiTopo discovers what is used from intra- to inter- node.

Sylvain


On 11/15/2010 06:56 AM, Sylvain Jeaugey wrote:
As a followup of Stuttgart's developper's meeting, here is an RFC for our 
topology detection framework.


WHAT: Add a framework for hardware topology detection to be used by any 
other part of Open MPI to help optimization.


WHY: Collective operations or shared memory algorithms among others may 
have optimizations depending on the hardware relationship between two MPI 
processes. HiTopo is an attempt to provide it in a unified manner.


WHERE: ompi/mca/hitopo/

WHEN: When wanted.

== 
We developped the HiTopo framework for our collective operation component, 
but it may be useful for other parts of Open MPI, so we'd like to 
contribute it.


A wiki page has been setup :
https://svn.open-mpi.org/trac/ompi/wiki/HiTopo

and a bitbucket repository :
http://bitbucket.org/jeaugeys/hitopo/

In a few words, we have 3 steps in HiTopo :

 - Detection : each MPI process detects its topology at various levels :
- core/socket : through the cpuid component
- node : through gethostname
- switch/island : through openib (mad) or slurm
  [ Other topology detection components may be added for other
resource managers, specific hardware or whatever we want ...]

 - Collection : an allgather is performed to have all other processes' 
addresses


 - Renumbering : "string" addresses are converted to numbers starting at 0 
(Example : nodenames "foo" and "bar" are renamed 0 and 1).


Any comment welcome,
Sylvain
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>

Re: [OMPI devel] Use of OPAL_PREFIX to relocate a lib

2010-11-04 Thread Sylvain Jeaugey


Hi Brian,

I finally found some time to test your patch and it solves my problem.

Thanks a lot !

Sylvain

On Wed, 27 Oct 2010, Barrett, Brian W wrote:


I found the issue - somehow, we let the priorities used in installdirs get lost 
when we rewrote part of the configure system a couple months ago.  I have a 
fix, but it involves changing the configure system, so I won't commit it until 
this evening.

Thanks for pointing out the bug!

Brian

On Oct 26, 2010, at 8:36 AM, Barrett, Brian W wrote:


I'll take a look at fixing this the right way today.

Since I wrote both the original autogen.sh that guaranteed static-components 
was ordered and PREFIX code, I had considered it to be a documented feature 
that there was strong otdering in the static-components list.  So personally, 
I'd consider it a bug in autogen.pl, but I think we can work around it.

Brian

On Oct 26, 2010, at 6:01 AM, Sylvain Jeaugey wrote:


On Tue, 26 Oct 2010, Jeff Squyres wrote:


I don't think this is the right way to fix it.  Sorry!  :-(

I don't think it is the right way to do it either :-)


I say this because it worked somewhat by luck before, and now it's
broken.  If we put in another "it'll work because of a side effect of an
unintentional characteristic of the build system" hack, it'll just
likely break again someday if/when we change the build system.

I completely agree.


I'd prefer a more robust solution that won't break as a side-effect of
the build system.

I'd prefer too, but it would require adding much more logic in the
framework, including component sort with priority. And since no-one except
me seems to care about this functionality, I'm fine with this patch.

More generally, I understand your demand for high quality patches that do
things The Right Way. However, I feel it's sometimes exagerated,
especially when talking about parts of the code that don't meet these high
quality standards.

In the end, my feeling is that we don't replace very bad (broken) code
with bad (working) code because we want to wait for a perfect (never
happening) code. I don't think it's beneficial to the project.

Sylvain
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
 Brian W. Barrett
 Dept. 1423: Scalable System Software
 Sandia National Laboratories



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
 Brian W. Barrett
 Dept. 1423: Scalable System Software
 Sandia National Laboratories



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Use of OPAL_PREFIX to relocate a lib

2010-10-26 Thread Sylvain Jeaugey


On Tue, 26 Oct 2010, Jeff Squyres wrote:


I don't think this is the right way to fix it.  Sorry!  :-(

I don't think it is the right way to do it either :-)

I say this because it worked somewhat by luck before, and now it's 
broken.  If we put in another "it'll work because of a side effect of an 
unintentional characteristic of the build system" hack, it'll just 
likely break again someday if/when we change the build system.

I completely agree.

I'd prefer a more robust solution that won't break as a side-effect of 
the build system.
I'd prefer too, but it would require adding much more logic in the 
framework, including component sort with priority. And since no-one except 
me seems to care about this functionality, I'm fine with this patch.


More generally, I understand your demand for high quality patches that do 
things The Right Way. However, I feel it's sometimes exagerated, 
especially when talking about parts of the code that don't meet these high 
quality standards.


In the end, my feeling is that we don't replace very bad (broken) code 
with bad (working) code because we want to wait for a perfect (never 
happening) code. I don't think it's beneficial to the project.


Sylvain

Re: [OMPI devel] Use of OPAL_PREFIX to relocate a lib

2010-10-26 Thread Sylvain Jeaugey


Hi all,

This problem may be a detail, but it bugs me a lot, so I'd like to have it 
fixed. Here is a patch that changes the path setting algorithm to "last 
component wins" instead of "first component wins".


This is as wrong as was the original code, except that it is 
consistent with the way autogen.pl generates static-components.h.


If nobody objects, I'll commit it tomorrow.

Sylvain

diff -r c9746f7a683a opal/mca/installdirs/base/installdirs_base_components.c
--- a/opal/mca/installdirs/base/installdirs_base_components.c   Tue Oct 26 
10:56:53 2010 +0200
+++ b/opal/mca/installdirs/base/installdirs_base_components.c   Tue Oct 26 
12:48:41 2010 +0200
@@ -25,7 +25,7 @@

 #define CONDITIONAL_COPY(target, origin, field) \
 do {\
-if (origin.field != NULL && target.field == NULL) { \
+if (origin.field != NULL) { \
 target.field = origin.field;\
 }   \
 } while (0)

On Thu, 7 Oct 2010, Ralph Castain wrote:


What you are seeing is just the difference in how the build system (old vs new 
vs RPM script) travels across the directory tree. The new build system and RPM 
do it in alphabetical order, so config comes before env. The old autogen.sh did 
it in reverse alpha order, so env came before config. I don't think anyone 
thought it made a difference, though you correctly point to one place where it 
does.

Modifying the build system to have one place do it differently would be a 
mistake, IMO. The better solution would be to use priorities to order the 
processing, though that means two passes through the components (one to get the 
priorities, and then another to execute) and additional API functions in the 
various modules.


On Oct 7, 2010, at 6:25 AM, Sylvain Jeaugey wrote:


Hi list,

Remember this old bug ? I think I finally found out what was going wrong.

The opal "installdirs" framework has two static components : config and env. 
These components are automatically detected by the MCA system and they are listed in 
opal/mca/installdirs/base/static-components.h.

The problem is that no priority is given, while the order matters : the first 
opened component sets the value.

When I build the v1.5 branch, I get 1.env 2.config :
const mca_base_component_t *mca_installdirs_base_static_components[] = {
 _installdirs_env_component,
 _installdirs_config_component,
 NULL
};

When I build an RPM *or* the new default branch, I get 1.config 2.env :
const mca_base_component_t *mca_installdirs_base_static_components[] = {
 _installdirs_config_component,
 _installdirs_env_component,
 NULL
};

I don't know why the generated file is not consistent. It may be related to the 
order in which directories are created.

Anyway, the first case seems to be what was intended in the first place. Since 
config sets all the values, having it in first position makes env useless. 
Besides, in the first configuration, env only needs to sets OPAL_PREFIX and 
since config sets all other pathes relatively to ${prefix}, it works.

So, how could we solve this ?

1. Make autogen/configure/whatever generate a consistent static-components.h 
with env THEN config ;

2. Add priorities to these components so that env is opened first regardless of 
its position in the static components array ;

3. Any other idea ?

Sylvain

On Fri, 19 Jun 2009, Sylvain Jeaugey wrote:


On Thu, 18 Jun 2009, Jeff Squyres wrote:


On Jun 18, 2009, at 11:25 AM, Sylvain Jeaugey wrote:

My problem seems related to library generation through RPM, not with
1.3.2, nor the patch.

I'm not sure I understand -- is there something we need to fix in our SRPM?


I need to dig a bit, but here is the thing : I generated an RPM from the 
official openmpi-1.3.2-1.src.rpm (with some defines like install-in-opt, ...) 
and the OPAL_PREFIX trick doesn't seem to work with it.

But don't take too much time on this, I'll find out why and maybe this is just 
me building it the wrong way.

Sylvain



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] New Romio for OpenMPI available in bitbucket

2010-10-07 Thread Sylvain Jeaugey


On Wed, 29 Sep 2010, Ashley Pittman wrote:


On 17 Sep 2010, at 11:36, Pascal Deveze wrote:

Hi all,

In charge of ticket 1888 (see at 
https://svn.open-mpi.org/trac/ompi/ticket/1888) ,
I have put the resulting code in bitbucket at:
http://bitbucket.org/devezep/new-romio-for-openmpi/

The work in this repo consisted in refreshing ROMIO to a newer
version: the one from the very last MPICH2 release (mpich2-1.3b1).


Is there any word on when this will be pulled into the mainline?
I would say as soon as someone has time to review the changes and 
acknowledge them.


Maybe we should go the RFC way and set a timeout.

Sylvain

Re: [OMPI devel] Use of OPAL_PREFIX to relocate a lib

2010-10-07 Thread Sylvain Jeaugey


Hi list,

Remember this old bug ? I think I finally found out what was going wrong.

The opal "installdirs" framework has two static components : config and 
env. These components are automatically detected by the MCA system and 
they are listed in opal/mca/installdirs/base/static-components.h.


The problem is that no priority is given, while the order matters : the 
first opened component sets the value.


When I build the v1.5 branch, I get 1.env 2.config :
const mca_base_component_t *mca_installdirs_base_static_components[] = {
  _installdirs_env_component,
  _installdirs_config_component,
  NULL
};

When I build an RPM *or* the new default branch, I get 1.config 2.env :
const mca_base_component_t *mca_installdirs_base_static_components[] = {
  _installdirs_config_component,
  _installdirs_env_component,
  NULL
};

I don't know why the generated file is not consistent. It may be related 
to the order in which directories are created.


Anyway, the first case seems to be what was intended in the first place. 
Since config sets all the values, having it in first position makes env 
useless. Besides, in the first configuration, env only needs to sets 
OPAL_PREFIX and since config sets all other pathes relatively to 
${prefix}, it works.


So, how could we solve this ?

1. Make autogen/configure/whatever generate a consistent 
static-components.h with env THEN config ;


2. Add priorities to these components so that env is opened first 
regardless of its position in the static components array ;


3. Any other idea ?

Sylvain

On Fri, 19 Jun 2009, Sylvain Jeaugey wrote:


On Thu, 18 Jun 2009, Jeff Squyres wrote:


On Jun 18, 2009, at 11:25 AM, Sylvain Jeaugey wrote:


My problem seems related to library generation through RPM, not with
1.3.2, nor the patch.



I'm not sure I understand -- is there something we need to fix in our SRPM?


I need to dig a bit, but here is the thing : I generated an RPM from the 
official openmpi-1.3.2-1.src.rpm (with some defines like install-in-opt, ...) 
and the OPAL_PREFIX trick doesn't seem to work with it.


But don't take too much time on this, I'll find out why and maybe this is 
just me building it the wrong way.


Sylvain

Re: [OMPI devel] Possible memory leak

2010-09-01 Thread Sylvain Jeaugey

Hi ananda,

I didn't try to run your program, but this seems logical to me.

The problem with calling MPI_Bcast repeatedly is that you may have an
infinite desynchronization between the sender and the receiver(s).
MPI_Bcast is an unidirectional operation. It does not necessary block
until the receiver(s) gets the message, hence causing a huge number of
messages to be buffered (and in the case of ft-enable-cr, I guess
everything is saved until an operation going the other way is done).

To "solve" this issue, the sync collective component has been created to
perform a barrier every N operations. So, running with -mca coll
basic,sync should make the problem disappear. I don't think it is really a
memory leak, the memory used is needed (in case of fault) and should be
freed at the next operation going the other way (reduce, barrier,
recv/send).

This seems to be the classical problem of MPI_Bcast benchmarks. Real
applications usually don't suffer this kind of problems.

Sylvain

On Tue, 31 Aug 2010, ananda.mu...@wipro.com wrote:

When I run the attached program with the following arguments, the size of MPI
processes keep increasing alarmingly (I saw that the size grew from 18M to 12G
in under 10 minutes) making me suspect that there is a major memory leak:

mpirun -am ft-enable-cr --mca coll basic -np 2

If I run this program without checkpoint control ie; remove "-am ft-enable-cr",
the size of MPI processes stays constant.

Also if I run this program without setting "--mca coll basic", the size of the
MPI processes stays constant.

I set the mca parameter to "--mca coll basic" during my debugging attempts of
the problem related to checkpointing the program that has repetitive MPI_Bcast() calls.

FYI, I am using OpenMPI v1.4.2 with BLCR 0.8.2

Thanks
Ananda

Ananda B Mudar, PMP
Senior Technical Architect
Wipro Technologies
Ph: 972 765 8093

Please do not print this email unless it is absolutely necessary.

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.

www.wipro.com

Re: [OMPI devel] v1.5: sigsegv in case of extremely low settings in theSRQs

2010-06-23 Thread Sylvain Jeaugey


On Wed, 23 Jun 2010, Jeff Squyres wrote:


BTW, are you guys waiting for us to commit that, or do we ever give you guys 
SVN commit access?

Nadia is off today. She should commit it tomorrow.

Sylvain

Re: [OMPI devel] v1.5: sigsegv in case of extremely low settings in theSRQs

2010-06-23 Thread Sylvain Jeaugey


Hi Jeff,

Why do we want to set this value so low ? Well, just to see if it crashes 
:-)


More seriously, we're working on lowering the memory usage of the openib 
BTL, which is achieved at most by having only 1 send queue element (at 
very large scale, send queues prevail).


This "extreme" configuration used to work with the 1.3/1.4 branches but 
failed on 1.5.


Note that recent IB cards having very high issue rates, I don't know if we 
are often waiting for the send queue to be empty. More importantly, it 
often prevents remote receive queue to be filled to quickly (which 
prevents RNR nacks, threads refilling the SRQ, ...). We didn't notice 
major performance drops with this configuration.


Sylvain

On Tue, 22 Jun 2010, Jeff Squyres wrote:


I think your fix looks right.

But I'm getting my head warped trying to understand why you'd want 
numbers so low (4, 2, 1) and exactly what our algorithm will re-post for 
numbers that low, etc.  Why do you want them so low?



On Jun 18, 2010, at 11:10 AM, nadia.derbey wrote:


Hi,

Reference is the v1.5 branch

If an SRQ has the following settings: S,,4,2,1

1) setup_qps() sets the following:
mca_btl_openib_component.qp_infos[qp].u.srq_qp.rd_num=4
mca_btl_openib_component.qp_infos[qp].u.srq_qp.rd_init=rd_num/4=1

2) create_srq() sets the following:
openib_btl->qps[qp].u.srq_qp.rd_curr_num = 1 (rd_init value)
openib_btl->qps[qp].u.srq_qp.rd_low_local = rd_curr_num - (rd_curr_num

2) = rd_curr_num = 1


3) if mca_btl_openib_post_srr() is called with rd_posted=1:
rd_posted > rd_low_local is false
num_post=rd_curr_num-rd_posted=0
the loop is not executed
wr is never initialized (remains NULL)
wr->next: address not mapped
 ==> SIGSEGV

The attached patch solves the problem by ensuring that we'll actually
enter the loop and leave otherwise.
Can someone have a look please: the patch solves the problem with my
reproducer, but I'm not sure the fix covers all the situations.

Regards,
Nadia

<001_openib_low_rd_num.patch>___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-11 Thread Sylvain Jeaugey


On Fri, 11 Jun 2010, Jeff Squyres wrote:


On Jun 11, 2010, at 5:43 AM, Paul H. Hargrove wrote:


Interesting. Do you think this behavior of the linux kernel would
change if the file was unlink()ed after attach ?
After a little talk with kernel guys, it seems that unlinking wouldn't 
change anything to performance (just prevent cleaning issues).


Sylvain

Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-10 Thread Sylvain Jeaugey


On Thu, 10 Jun 2010, Jeff Squyres wrote:

Sam -- if the shmat stuff fails because the limits are too low, it'll 
(silently) fall back to the mmap module, right?
From my experience, it completely disabled the sm component. Having a nice 

fallback would be indeed a very Good thing.

Sylvain

Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-10 Thread Sylvain Jeaugey


On Thu, 10 Jun 2010, Paul H. Hargrove wrote:

One should not ignore the option of POSIX shared memory: shm_open() and 
shm_unlink().  When present this mechanism usually does not suffer from 
the small (eg 32MB) limits of SysV, and uses a "filename" (in an 
abstract namespace) which can portably be up 14 characters in length. 
Because shm_unlink() may be called as soon as the final process has done 
its shm_open() one can get approximately the safety of the IPC_RMID 
mechanism, but w/o being restricted to Linux.


I have used POSIX shared memory for another project and found it works 
well on Linux, Solaris (10 and Open), FreeBSD and AIX.  That is probably 
a narrow coverage than SysV, but still worth consideration IMHO.
I was just doing research on shm_open() to ensure it had no limitation 
before introducing it in this thread. You saved me some time !


With mmap(), SysV and POSIX (plus XPMEM on the SGI Altix) as mechanisms 
for sharing memory between processes, I think we have an argument for a 
full-blown "shared pages" framework as opposed to just a "mpi_common_sm" 
MCA parameter.  That brings all the benefits like possibly "failing 
over" from one component to another (otherwise less desired) one if some 
limit is exceeded.  For instance, SysV could (for a given set of 
priorities) be used by default, but mmap-on-real-fs could be 
automatically selected when the requested/required size exceeds the 
shmmax value.

Would be indeed nice.

As for why mmap is slower.  When the file is on a real (not tmpfs or other 
ramdisk) I am 95% certain that this is an artifact of the Linux swapper/pager 
behavior which is thinking it is being smart by "swapping ahead".  Even when 
there is no memory pressure that requires swapping, Linux starts queuing swap 
I/O for pages to keep the number of "clean" pages up when possible. This 
results in pages of the shared memory file being written out to the actual 
block device.  Both the background I/O and the VM metadata updates contribute 
to the lost time.  I say 95% certain because I have a colleague who looked 
into this phenomena in another setting and I am recounting what he reported 
as clearly as I can remember, but might have misunderstood or inserted my own 
speculation by accident.  A sufficiently motivated investigator (not me) 
could probably devise an experiment to verify this.
Interesting. Do you think this behavior of the linux kernel would change 
if the file was unlink()ed after attach ?


Sylvain

Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-10 Thread Sylvain Jeaugey


On Wed, 9 Jun 2010, Jeff Squyres wrote:


On Jun 9, 2010, at 3:26 PM, Samuel K. Gutierrez wrote:


System V shared memory cleanup is a concern only if a process dies in
between shmat and shmctl IPC_RMID.  Shared memory segment cleanup
should happen automagically in most cases, including abnormal process
termination.


Umm... right.  Duh.  I knew that.

Really.

So -- we're good!

Let's open the discussion of making sysv the default on systems that support 
the IPC_RMID behavior (which, AFAIK, is only Linux)...

I'm sorry, but I think System V has many disadvantages over mmap.

1. As discussed before, cleaning is not as easy as for a file. It is a 
good thing to remove the shm segment after creation, but since problems 
often happen during shmget/shmat, there's still a high risk of letting 
things behind.


2. There are limits in the kernel you need to grow (kernel.shmall, 
kernel.shmmax). On most linux distribution, shmmax is 32MB, which does 
not permit the sysv mechanism to work. Mmapped files are unlimited.


3. Each shm segment is identified by a 32 bit integer. This namespace is 
small (and non-intuitive, as opposed to a file name), and the probability 
for a collision is not null, especially when you start creating multiple 
shared memory segments (for collectives, one-sided operations, ...).


So, I'm a bit reluctant to work with System V mechanisms again. I don't 
think there is a *real* reason for System V to be faster than mmap, since 
it should just be memory. I'd rather find out why mmap is slower.


Sylvain

Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Sylvain Jeaugey


On Wed, 2 Jun 2010, Jeff Squyres wrote:


Don't you mean return NULL?  This function is supposed to return a (struct 
ibv_cq *).
Oops. My bad. Yes, it should return NULL. And it seems that if I make 
ibv_create_cq always return NULL, the scenario described by George works 
smoothly : returned OMPI_ERROR => bitmask cleared => connectivity problem 
=> stop or tcp fallback. The problem is more complicated than I thought.


But it made me progress on why I'm crashing : in my case, only a subset of 
processes have their create_cq fail. But others work fine, hence they 
request a qp creation, and my process which failed over on tcp starts 
creating a qp ... and crashes.


If you replace :
return NULL;
by :
if (atoi(getenv("OMPI_COMM_WORLD_RANK")) == 26)
return NULL;
(yes, that's ugly, but it's just to debug the problem) and run on -say- 32 
processes, you should be able to reproduce the bug. Well, unless I'm 
mistaken again.


The crash stack should look like this :
#0  0x003d0d605a30 in ibv_cmd_create_qp () from /usr/lib64/libibverbs.so.1
#1  0x7f28b44e049b in ibv_cmd_create_qp () from /usr/lib64/libmlx4-rdmav2.so
#2  0x003d0d609a42 in ibv_create_qp () from /usr/lib64/libibverbs.so.1
#3  0x7f28b6be6e6e in qp_create_one () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_btl_openib.so
#4  0x7f28b6be78a4 in oob_module_start_connect () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_btl_openib.so
#5  0x7f28b6be7fbb in rml_recv_cb () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_btl_openib.so
#6  0x7f28b8c56868 in orte_rml_recv_msg_callback () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_rml_oob.so
#7  0x7f28b8a4cf96 in mca_oob_tcp_msg_recv_complete () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_oob_tcp.so
#8  0x7f28b8a4e2c2 in mca_oob_tcp_peer_recv_handler () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/openmpi/mca_oob_tcp.so
#9  0x7f28b9496898 in opal_event_base_loop () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/libopen-pal.so.0
#10 0x7f28b948ace9 in opal_progress () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/libopen-pal.so.0
#11 0x7f28b9951ed5 in ompi_request_default_wait_all () from 
/home_nfs/jeaugeys/DISTS/openmpi-1.4.2/lib/libmpi.so.0

This new advance may change everything. Of course, stopping at the bml 
level still "solves" the problem, but maybe we can fix this more properly 
within the openib BTL. Unless this is a general 
out-of-band-connection-protocol issue ().


Sylvain

Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Sylvain Jeaugey


On Tue, 1 Jun 2010, Jeff Squyres wrote:


On May 31, 2010, at 5:10 AM, Sylvain Jeaugey wrote:


In my case, the error happens in :
   mca_btl_openib_add_procs()
 mca_btl_openib_size_queues()
   adjust_cq()
 ibv_create_cq_compat()
   ibv_create_cq()


Can you nail this down any further?  If I modify adjust_cq() to always 
return OMPI_ERROR, I see the openib BTL fail over properly to the TCP 
BTL.
It must be because create_cq actually creates cqs. Try to apply this 
patch which makes create_cq_compat() *not* creates the cqs and return an 
error instead :


diff -r 13df81d1d862 ompi/mca/btl/openib/btl_openib.c
--- a/ompi/mca/btl/openib/btl_openib.c  Fri May 28 14:50:25 2010 +0200
+++ b/ompi/mca/btl/openib/btl_openib.c  Wed Jun 02 10:56:57 2010 +0200
@@ -146,6 +146,7 @@
 int cqe, void *cq_context, struct ibv_comp_channel *channel,
 int comp_vector)
 {
+return OMPI_ERROR;
 #if OMPI_IBV_CREATE_CQ_ARGS == 3
 return ibv_create_cq(context, cqe, channel);
 #else


You should see MPI_Init complete nicely and your application segfault on 
the next MPI operation.


Sylvain

Re: [OMPI devel] BTL add procs errors

2010-06-02 Thread Sylvain Jeaugey


Couldn't explain it better. Thanks Jeff for the summary !

On Tue, 1 Jun 2010, Jeff Squyres wrote:


On May 31, 2010, at 10:27 AM, Ralph Castain wrote:

Just curious - your proposed fix sounds exactly like what was done in 
the OPAL SOS work. Are you therefore proposing to use SOS to provide a 
more informational status return?


No, I think Sylvain's talking about slightly modifying the existing 
mechanism:


1. Return OMPI_SUCCESS: bml then obeys whatever is in the connectivity 
bitmask -- even if the bitmask indicates that this BTL can't talk to 
anyone.


2. Return != OMPI_SUCCESS: treat the problem as a fatal error.

I think Sylvain's point is that OMPI_SUCCESS can be returned for 
non-fatal errors if a BTL just wants to be ignored.  In such cases, the 
BTL can just set its connectivity mask to 0. This will allow OMPI to 
continue the job.


For example, if verbs is borked (e.g., can't create CQ's), it can return 
a connectivity mask of 0 and OMPI_SUCCESS.  The BML is then free to fail 
over to some other BTL.


But if a malloc() fails down in some BTL, then the job is hosed anyway 
-- so why not return != OMPI_SUCCESS and try to abort cleanly?


For sites that want to treat verbs failures as fatal, we can add a new 
MCA param either in the openib BTL that says "treat all init failures as 
fatal to the job" or perhaps a new MCA param in R2 that says "if the 
connectivity map for BTL  is empty, abort the job".  Or something 
like that.


If so, then it would seem the only real dispute here is: is there -any- 
condition whereby a given BTL should have the authority to tell OMPI to 
terminate an application, even if other BTLs could still function?


I think his cited example was if malloc() fails.

I could see some sites wanting to abort if their high-speed network was 
down (e.g., MX or openib BTLs failed to init) -- they wouldn't want OMPI 
to fail over to TCP.  The flip side of this argument is that the 
sysadmin could set "btl = ^tcp" in the system file, and then if 
openib/mx fails, the BML will abort because some peers won't be 
reachable.


I understand that the current code may not yet support that operation, 
but I do believe that was the intent of the design. So only the case 
where -all- BTLs say "I can't do it" would result in termination. 
Rather than change that design, I believe the intent is to work towards 
completing that implementation. In the interim, it would seem most 
sensible to me that we add an MCA param that specifies the termination 
behavior (i.e., attempt to continue or terminate on first fatal BTL 
error).


Agreed.

I think that there are multiple different exit conditions from a BTL 
init:


1. BTL succeeded in initializing, and some peers are reachable 2. BTL 
succeeded in initializing, and no peers are reachable 3. BTL failed to 
initialize, but that failure is localized to the BTL (e.g., openib 
failed to create a CQ) 4. BTL failed to initialize, and the error is 
global in nature (e.g., malloc() fail)


I think it might be a site-specific decision as to whether to abort the 
job for condition 3 or not.  Today we default to not failing and pair 
that with an indirect method of failing (i.e., setting btl=^tcp).


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] BTL add procs errors

2010-05-31 Thread Sylvain Jeaugey


In my case, the error happens in :
  mca_btl_openib_add_procs()
mca_btl_openib_size_queues()
  adjust_cq()
ibv_create_cq_compat()
  ibv_create_cq()

ibv_create_cq() returns an error which goes up until 
mca_btl_openib_add_procs(). As george mentionned, the openib btl should be 
completely ignored, since the bitmask is not taken into account when a 
error is returned. However -I don't know why- openib get called again and 
crashes.


So, yes, there must be a bug in openib.

And I know this is how you guys designed the bml layer. But I was hoping 
we could improve the design to improve error handling.


So, this is my last try to explain my opinion. If you disagree, then we'll 
fix this on the openib side.


Ignoring BTL errors bugs me because the current errors are all serious. 
Our try to continue will therefore always leads to a crash (George, you 
introduced an error return code, not a real error, hence you managed to 
continue). This confuses the user as of why we have a problem, because the 
first serious error will be flooded by further errors or crashes. This is 
true for openib, but also for sm (I would like to stop on the first 
"malloc()" that fails).


We have a two-level system (bitmask + return code) we could use to handle 
non severe errors (bitmask) and severe errors (return code). Currently, we 
just use the return code as a way to ignore the bitmask, but we could use 
the return code as a more serious message and thus improve our error 
management.


To sum up, my proposition is to change the meaning of an error return code 
in add_procs() from "I got a problem, continue without me" which can be 
perfectly handled with the bitmask alone, to "I got a fatal error, please 
stop the application".


I know this can be seen as an attempt to prevent fixing a bug in openib by 
changing the design of the BML, but in this case, I think changing the BML 
design would improve the overall behavior.


Sylvain

On Fri, 28 May 2010, Jeff Squyres wrote:

To that point, where exactly in the openib BTL init / query sequence is 
it returning an error for you, Sylvain?  Is it just a matter of tidying 
something up properly before returning the error?



On May 28, 2010, at 2:21 PM, George Bosilca wrote:


On May 28, 2010, at 10:03 , Sylvain Jeaugey wrote:


On Fri, 28 May 2010, Jeff Squyres wrote:


On May 28, 2010, at 9:32 AM, Jeff Squyres wrote:


Understood, and I agreed that the bug should be fixed.  Patches would be 
welcome.  :-)

I sent a patch on the bml layer in my first e-mail. We will apply it on our 
tree, but as always we're trying to send patches back to open-source (that was 
not my intent to start such a debate).


The only problem with your patch is that it solve something that is not 
supposed to happen. As a proof of concept I did return errors from the tcp and 
sm BTLs, and Open MPI gracefully deal with them. So, it is not a matter of 
aborting we're looking at is a matter of the opebib BTL doing something it is 
not supposed to do.

Going through the code it looks like the bitmask doesn't matter, if an error is 
returned by a BTL we zero the bitmask and continue to another BTL.

Example: the SM BTL returns OMPI_ERROR after creating all the internal 
structures.


mpirun -np 4 --host node01 --mca btl sm,self ./ring


--
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[22047,1],3]) is on host: node01
  Process 2 ([[22047,1],0]) is on host: node01
  BTLs attempted: self sm

Your MPI job is now going to abort; sorry.
--

Now if I allow TCP on the node:

mpirun -np 4 --host node01 --mca btl sm,self,tcp ./ring


Process 0 sending 10 to 1, tag 201 (4 procs in ring)
Process 0 sent to 1
Process 3 exiting
Process 0 decremented num: 9
Process 0 decremented num: 8

Thus, Open MPI does the right thing when the BTLs are playing the game.

  george.




I should clarify rather than being flip:

1. I agree: the bug should be fixed.  Clearly, we should never crash.

2. After the bug is fixed, there is clearly a choice: some people may want to 
use a different transport if a given BTL is unavailable. Others may want to 
abort.  Once the bug is fixed, this seems like a pretty straightforward thing 
to add.

If you use my patch, you still have no choice. Errors on BTLs lead to an 
immediate stop instead of trying to continue (and crash).
If someone wants to go further on this, then that's great. If nobody does, I 
think you should take my patch. Maybe it's not the be

Re: [OMPI devel] BTL add procs errors

2010-05-28 Thread Sylvain Jeaugey


On Fri, 28 May 2010, Jeff Squyres wrote:


On May 28, 2010, at 9:32 AM, Jeff Squyres wrote:

Understood, and I agreed that the bug should be fixed.  Patches would 
be welcome.  :-)
I sent a patch on the bml layer in my first e-mail. We will apply it on 
our tree, but as always we're trying to send patches back to open-source 
(that was not my intent to start such a debate).



I should clarify rather than being flip:

1. I agree: the bug should be fixed.  Clearly, we should never crash.

2. After the bug is fixed, there is clearly a choice: some people may 
want to use a different transport if a given BTL is unavailable. 
Others may want to abort.  Once the bug is fixed, this seems like a 
pretty straightforward thing to add.
If you use my patch, you still have no choice. Errors on BTLs lead to an 
immediate stop instead of trying to continue (and crash).


If someone wants to go further on this, then that's great. If nobody does, 
I think you should take my patch. Maybe it's not the best solution, but 
it's still better than the current state.


Sylvain

Re: [OMPI devel] BTL add procs errors

2010-05-28 Thread Sylvain Jeaugey


On Fri, 28 May 2010, Jeff Squyres wrote:

Herein lies the quandary: we don't/can't know the user or sysadmin 
intent.  They may not care if the IB is borked -- they might just want 
the job to fall over to TCP and continue.  But they may care a lot if IB 
is borked -- they might want the job to abort (because it would be too 
slow over TCP).
There is no intent nor choice : Open MPI today always crashes on such an 
error. The thing is, we crash at the wrong place, which is why I'd like it 
to stop on the real error rather than trying to continue and hide the real 
problem within a ton of error traces.


Frankly, I don't know how to be clearer. The discussion started on a bug 
and you moved it to a nice-feature-we-would-like-to-have.


So please, fix the bug first, then if you want that "automatic failover to 
TCP" feature, develop it. Put a parameter for an eventual notification, or 
abort, or whatever you want. But it doesn't exist today. It just doesn't 
work, with any BTL. Errors reported by BTLs are all fatal.


Sylvain

Re: [OMPI devel] BTL add procs errors

2010-05-28 Thread Sylvain Jeaugey


On Thu, 27 May 2010, Jeff Squyres wrote:


On May 27, 2010, at 10:32 AM, Sylvain Jeaugey wrote:


That's pretty much my first proposition : abort when an error arises,
because if we don't, we'll crash soon afterwards. That's my original
concern and this should really be fixed.

Now, if you want to fix the openib BTL so that an error in IB results in
an elegant fallback on TCP (elegant = notified ;-)), then hooray.


You're specifically referring to the point where the openib btl sets the 
reachable bit, but then later decides "oops, an error occurred, so 
return !=OMPI_SUCCESS" -- and assume that the openib btl is not called 
again.


Right?

Perfectly right.

If so, then yes, that's a bug.  The openib btl should be fixed to unset 
the reachable bit(s) that it just set before returning the error.


Or the BML could assume that !=OMPI_SUCCESS codes means that the 
reachable bits it got back were invalid and should be ignored.


I'd lead towards the former.

Can you file and bug and submit a patch?

I'd like to (though I don't have an svn account), but some things
bother me.

Having errors on add_procs stop the application seems a good thing in all 
cases, so why not do it ? That would solve my original problem which lead 
to this discussion.


Yes, the openib BTL may be suboptimal (stopping the application instead of 
nicely returning), but I'm fine with that, so I'm not very inclined to 
spend time on this.


Sylvain

Re: [OMPI devel] BTL add procs errors

2010-05-27 Thread Sylvain Jeaugey

That's pretty much my first proposition : abort when an error arises, 
because if we don't, we'll crash soon afterwards. That's my original 
concern and this should really be fixed.


Now, if you want to fix the openib BTL so that an error in IB results in 
an elegant fallback on TCP (elegant = notified ;-)), then hooray.


Sylvain

On Thu, 27 May 2010, Barrett, Brian W wrote:


Sylvain -

I have to agree with Ralph.  The situation you describe (IB failing) may 
or may not be what the user wants.  And, in fact, we will print a 
warning message to the user that such a situation (falling back to TCP) 
has occurred.  However, it also does not fall under the category of 
"fail the job" bad according to the design goals of Open MPI -- we 
explicitly wanted to allow easy fallback to another interconnect when 
something bad happens.  The bml and pml have the opprotunity to 
determine if they like the BTL choices made, and this is the right level 
to make that decision.  Lower layer transports should not be 
implementing high level policy.  So I go back to: if the job can run to 
completion (even if slower), add_procs should not return an error.  If 
add_procs does return an error, the job should abort.


Brian

--
 Brian W. Barrett
 Scalable System Software Group
 Sandia National Laboratories

From: devel-boun...@open-mpi.org [devel-boun...@open-mpi.org] On Behalf Of 
Sylvain Jeaugey [sylvain.jeau...@bull.net]
Sent: Thursday, May 27, 2010 1:47 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] BTL add procs errors

I don't think what the openib BTL is doing is that bad. It is returning an
error because something really went bad in IB. So yes, it could blank the
bitmask and return success, but would you really want IB to fail and
fallback on TCP once in a while without any notice ? I wouldn't.

So, as it seems that all "normal" problems can be handled through the
reachable bitmask, it seems a good idea to me that BTLs returning errors
make the application stop.

Sylvain

On Wed, 26 May 2010, Barrett, Brian W wrote:


George -

I'm not sure I agree - the return code should indicate a failure beyond
"something prohibited me from talking to the remote side" - something
occurred that resulted in it being highly unlikely the app can
successfully run to completion (such as malloc failing).  On the other
hand, I also think that the OpenIB BTL is probably doing the wrong thing
- I can't imagine that the error returned reaches that state of badness,
and it should probably zero out the bitmask and quietly return rather
than try to cause the app to abort.

Just my $0.02.

Brian


On May 25, 2010, at 12:27 PM, George Bosilca wrote:


The BTLs are allowed to fail adding procs without major consequences in
the short term. As you noticed each BTL returns a bit mask array
containing all procs reachable through this particular instance of the
BTL. Later (in the same file line 395) we check for the complete
coverage for all procs, and only complain if one of the peers is
unreachable.

If you replace the continue statement by a return, we will never give a
chance to the other BTLs and we will complain about lack of
connectivity as soon as one BTL fails (for some reasons). Without
talking about the fact that all the eager, send and rmda endpoint
arrays will not be built.

 george.

On May 25, 2010, at 05:10 , Sylvain Jeaugey wrote:


Hi,

I'm currently trying to have Open MPI exit more gracefully when a BTL returns an error 
during the "add procs" phase.

The current bml/r2 code silently ignores btl->add_procs() error codes with the 
following comment :
 ompi/mca/bml/r2/bml_r2.c:208 
/* This BTL has troubles adding the nodes. Let's continue maybe some other BTL
 * can take care of this task. */
continue;
--

This seems wrong to me : either a proc is reached (the "reachable" bit field is 
therefore updated), either it is not (and nothing is done). Any error code should denote 
a fatal error needing a clean abort.

In the current openib btl code, the "reachable" bit is set but an error is 
returned - then ignored by r2. The next call to the openib BTL results in a segmentation 
fault.

So, maybe this simple fix would do the trick :

diff -r 96e0793d7885 ompi/mca/bml/r2/bml_r2.c
--- a/ompi/mca/bml/r2/bml_r2.c  Wed May 19 14:35:27 2010 +0200
+++ b/ompi/mca/bml/r2/bml_r2.c  Tue May 25 10:54:19 2010 +0200
@@ -210,7 +210,7 @@
   /* This BTL has troubles adding the nodes. Let's continue maybe some 
other BTL
* can take care of this task.
*/
-continue;
+return rc;
   }

   /* for each proc that is reachable */


Does anyone see a case (with a specific btl) where add_p

Re: [OMPI devel] BTL add procs errors

2010-05-27 Thread Sylvain Jeaugey

I don't think what the openib BTL is doing is that bad. It is returning an 
error because something really went bad in IB. So yes, it could blank the 
bitmask and return success, but would you really want IB to fail and 
fallback on TCP once in a while without any notice ? I wouldn't.


So, as it seems that all "normal" problems can be handled through the 
reachable bitmask, it seems a good idea to me that BTLs returning errors

make the application stop.

Sylvain

On Wed, 26 May 2010, Barrett, Brian W wrote:


George -

I'm not sure I agree - the return code should indicate a failure beyond 
"something prohibited me from talking to the remote side" - something 
occurred that resulted in it being highly unlikely the app can 
successfully run to completion (such as malloc failing).  On the other 
hand, I also think that the OpenIB BTL is probably doing the wrong thing 
- I can't imagine that the error returned reaches that state of badness, 
and it should probably zero out the bitmask and quietly return rather 
than try to cause the app to abort.


Just my $0.02.

Brian


On May 25, 2010, at 12:27 PM, George Bosilca wrote:

The BTLs are allowed to fail adding procs without major consequences in 
the short term. As you noticed each BTL returns a bit mask array 
containing all procs reachable through this particular instance of the 
BTL. Later (in the same file line 395) we check for the complete 
coverage for all procs, and only complain if one of the peers is 
unreachable.


If you replace the continue statement by a return, we will never give a 
chance to the other BTLs and we will complain about lack of 
connectivity as soon as one BTL fails (for some reasons). Without 
talking about the fact that all the eager, send and rmda endpoint 
arrays will not be built.


 george.

On May 25, 2010, at 05:10 , Sylvain Jeaugey wrote:


Hi,

I'm currently trying to have Open MPI exit more gracefully when a BTL returns an error 
during the "add procs" phase.

The current bml/r2 code silently ignores btl->add_procs() error codes with the 
following comment :
 ompi/mca/bml/r2/bml_r2.c:208 
/* This BTL has troubles adding the nodes. Let's continue maybe some other BTL
 * can take care of this task. */
continue;
--

This seems wrong to me : either a proc is reached (the "reachable" bit field is 
therefore updated), either it is not (and nothing is done). Any error code should denote 
a fatal error needing a clean abort.

In the current openib btl code, the "reachable" bit is set but an error is 
returned - then ignored by r2. The next call to the openib BTL results in a segmentation 
fault.

So, maybe this simple fix would do the trick :

diff -r 96e0793d7885 ompi/mca/bml/r2/bml_r2.c
--- a/ompi/mca/bml/r2/bml_r2.c  Wed May 19 14:35:27 2010 +0200
+++ b/ompi/mca/bml/r2/bml_r2.c  Tue May 25 10:54:19 2010 +0200
@@ -210,7 +210,7 @@
   /* This BTL has troubles adding the nodes. Let's continue maybe some 
other BTL
* can take care of this task.
*/
-continue;
+return rc;
   }

   /* for each proc that is reachable */


Does anyone see a case (with a specific btl) where add_procs returns an error 
but we still want to continue ?

Sylvain
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
 Brian W. Barrett
 Dept. 1423: Scalable System Software
 Sandia National Laboratories





___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] BTL add procs errors

2010-05-25 Thread Sylvain Jeaugey


Hi,

I'm currently trying to have Open MPI exit more gracefully when a BTL 
returns an error during the "add procs" phase.


The current bml/r2 code silently ignores btl->add_procs() error codes with 
the following comment :

 ompi/mca/bml/r2/bml_r2.c:208 
  /* This BTL has troubles adding the nodes. Let's continue maybe some other BTL
   * can take care of this task. */
  continue;
--

This seems wrong to me : either a proc is reached (the "reachable" bit 
field is therefore updated), either it is not (and nothing is done). Any 
error code should denote a fatal error needing a clean abort.


In the current openib btl code, the "reachable" bit is set but an error is 
returned - then ignored by r2. The next call to the openib BTL results in 
a segmentation fault.


So, maybe this simple fix would do the trick :

diff -r 96e0793d7885 ompi/mca/bml/r2/bml_r2.c
--- a/ompi/mca/bml/r2/bml_r2.c  Wed May 19 14:35:27 2010 +0200
+++ b/ompi/mca/bml/r2/bml_r2.c  Tue May 25 10:54:19 2010 +0200
@@ -210,7 +210,7 @@
 /* This BTL has troubles adding the nodes. Let's continue maybe 
some other BTL
  * can take care of this task.
  */
-continue;
+return rc;
 }

 /* for each proc that is reachable */


Does anyone see a case (with a specific btl) where add_procs returns an 
error but we still want to continue ?


Sylvain

Re: [OMPI devel] Infiniband memory usage with XRC

2010-05-19 Thread Sylvain Jeaugey


On Mon, 17 May 2010, Pavel Shamis (Pasha) wrote:


Sylvain Jeaugey wrote:
The XRC protocol seems to create shared receive queues, which is a good 
thing. However, comparing memory used by an "X" queue versus and "S" 
queue, we can see a large difference. Digging a bit into the code, we 
found some

So, do you see that X consumes more that S ? This is really odd.

Yes, but that's what we see. At least after MPI_Init.

What is the difference (in Kb)?
At 32 nodes x 32 cores (1024 MPI processes), I get a difference of ~2300 
KB in favor of "S,65536,16,4,1" versus "X,65536,16,4,1".


The proposed patch doesn't seem to solve the problem however, there's 
still something that's taking more memory than expected.


Sylvain

Re: [OMPI devel] Infiniband memory usage with XRC

2010-05-17 Thread Sylvain Jeaugey


Thanks Pasha for these details.

On Mon, 17 May 2010, Pavel Shamis (Pasha) wrote:

blocking is the receive queues, because they are created during MPI_Init, 
so in a way, they are the "basic fare" of MPI.
BTW SRQ resources are also allocated on demand. We start with very small SRQ 
and it is increased on SRQ limit event.

Ok. Understood. So maybe the increased memory is only due to CQs.

The XRC protocol seems to create shared receive queues, which is a good 
thing. However, comparing memory used by an "X" queue versus and "S" queue, 
we can see a large difference. Digging a bit into the code, we found some

So, do you see that X consumes more that S ? This is really odd.

Yes, but that's what we see. At least after MPI_Init.

strange things, like the completion queue size not being the same as "S" 
queues (the patch below would fix it, but the root of the problem may be 
elsewhere).


Is anyone able to comment on this ?

The fix looks ok, please submit it to trunk.
I don't have an account to do this, so I'll let maintainers push it into 
SVN.


BTW do you want to prepare the patch for send queue size factor ? It should 
be quite simple.
Maybe we can do this. However, we are a little playing with parameters and 
code without really knowing the deep consequences of what we do. 
Therefore, I would feel more confortable if someone who knows much on the 
openib btl confirms it's not breaking everything.


Sylvain

Re: [OMPI devel] Thread safety levels

2010-05-10 Thread Sylvain Jeaugey


On Mon, 10 May 2010, N.M. Maclaren wrote:


As explained by Sylvain, current Open MPI implementation always returns
MPI_THREAD_SINGLE as provided thread level if neither --enable-mpi-threads
nor --enable-progress-threads was specified at configure (v1.4).


That is definitely the correct action.  Unless an application or library 
has been built with thread support, or can guaranteed to be called only 
from a single thread, using threads is catastrophic.
I personnaly see that as a bug, but I certainly lack some knowledge on 
non-linux OSes. From my point of view, any normal library should be 
THREAD_SERIALIZED, and thread-safe library should be THREAD_MULTIPLE. I 
don't see other libraries which claims to be "totally incompatible with 
the use of threads". They may not be thread-safe, in which case the 
programmer must ensure locking and memory coherency to use them in 
conjunction with threads, but that's about what THREAD_SERIALIZED is about 
IMO.



And, regrettably,
given modern approaches to building software and the  configure
design, configure is where the test has to go.
configure is where the tests is. And configure indeed returns "We have 
threads" (OMPI_HAVE_THREADS = 1). And given this, I don't see why we 
wouldn't be MPI_THREAD_SERIALIZED. At least MPI_THREAD_FUNELLED.



On some systems, there are certain actions that require thread affinity
(sometimes including I/O, and often undocumented).  zOS is one, but I
have seen it under a few Unices, too.

On others, they use a completely different (and seriously incompatible,
at both the syntactic and semantic levels) set of libraries.  E.g. AIX.


If we use OpenMP with MPI, we need at least MPI_THREAD_FUNNELED even
if MPI functions are called only outside of omp parallel region,
like below.

   #pragma omp parallel for
   for (...) {
   /* computation */
   }
   MPI_Allreduce(...);
   #pragma omp parallel for
   for (...) {
   /* computation */
   }


I don't think that's correct.  That would call MPI_Allreduce once for
each thread, in parallel, on the same process - which wouldn't work.
I think the idea is precisely _not_ to call MPI_Allreduce within parallel 
sections, i.e. only have the master thread call MPI.



This means Open MPI users must specify --enable-mpi-threads or
--enable-progress-threads to use OpenMP. Is it true?
But this two configure options, i.e. OMPI_HAVE_THREAD_SUPPORT macro,
lead to performance penalty by mutex lock/unlock.


That's unavoidable, in general, with one niggle.  If the programmer
guarantees BOTH to call MPI on the global master thread AND to ensure
that all memory is synchronised before it does so, there is no need
for mutexes. The MPI specification lacks some of the necessary
paranoia in this respect.
In my understanding of MPI_THREAD_SERIALIZED, the memory coherency was 
guaranteed. If not, the programmer has to ensure it.



I believe OMPI_HAVE_THREADS (not OMPI_HAVE_THREAD_SUPPORT !) is sufficient
to support MPI_THREAD_FUNNELED and MPI_THREAD_SERIALIZED, and therefore
OMPI_HAVE_THREAD_SUPPORT should be OMPI_HAVE_THREADS at following
part in ompi_mpi_init function, as suggested by Sylvain.


I can't comment on that, though I doubt it's quite that simple.  There's
a big difference between MPI_THREAD_FUNNELED and MPI_THREAD_SERIALIZED
in implementation impact.
I don't see the relationship between THREAD_SERIALIZED/FUNNELED and 
OMPI_HAVE_THREAD_SUPPORT. Actually, OMPI_HAVE_THREAD_SUPPORT seems to have 
no relationship with how the OS supports threads (that's why I think it is 
misleading).


But I don't see a big difference between THREAD_SERIALIZED and 
THREAD_FUNNELED anyway. Do you have more information on systems where the 
caller thread id makes a difference in MPI ?


Just for the record, we (at Bull) patched our MPI library and had no 
problem so far with applications using MPI + Threads or MPI + OpenMP, 
given that they don't call MPI within parallel sections. But of course, we 
only use linux, so your mileage may vary.


Sylvain

[OMPI devel] RDMA with ob1 and openib

2010-04-27 Thread Sylvain Jeaugey


Hi list,

I'm currently working on IB bandwidth improvements and maybe some of you 
may help me understanding some things. I'm trying to align every IB RDMA 
operation to 64 bytes, because having it unaligned can hurt your 
performance from lightly to very badly, depending on your architecture.


So, I'm trying to understand the RDMA protocol (PUT and GET), and here is 
what I understood :


* if we have one btl, RDMA is performed with only one GET operation, 
otherwise, we use multiple PUT operations. I can understand that the GET 
operation improves asynchronous aspects. So, why not always use GET 
operations ?


* if mpi_leave_pinned is 0, this is becoming more strange. We start a 
rendez-vous (not RDMA) with a size equal to the eager limit, then we 
switch to RDMA because the remote peer asks for RDMA PUTs (even if 
btl_openib_flags does not have the PUT operation btw). Why this corner 
case ? Why not starting a normal RDMA (especially since we switch back to 
RDMA afterwards) ?


* the openib btl has a "buffer alignment" parameter. Fantastic, just what 
I needed. Unfortunately, I can't see where it is used (and indeed 
performance is bad if my buffers are not aligned to 64 bytes). Am I 
missing something ?


* I did a prototype to split GET operations in openib into two operations 
: a small one to correct buffer alignment and a big aligned one. It would 
certainly be better to perform the first one with a normal send/recv, but 
for the prototype, doing it inside the openib GET was simpler. Performance 
on unaligned buffers is much better (but this is just a prototype). Is 
there anyone working on this right now or should I pursue my effort to 
make it clean and stable ?


Thanks in advance for any feedback,
Sylvain

Re: [OMPI devel] RFC 1/1: improvements to the "notifier" framework and ORTE WDC

2010-03-30 Thread Sylvain Jeaugey


On Mon, 29 Mar 2010, Abhishek Kulkarni wrote:


#define ORTE_NOTIFIER_DEFINE_EVENT(eventstr, associated_text) {
 static int event = -1;
 if (OPAL_UNLIKELY(event == -1) {
event = opal_sos_create_new_event(eventstr, associated_text);
 }
 ..
}

This is a good suggestion, but then I think we end up relying on run-time 
generation of the event numbers

Yes.

and have to pay the extra cost of looking up the event in a 
list/array/hash each time we log the event.
No. Of course not, that's the point of the "static int" here. The 
"create_new_event" function will be only called once ; the event is then 
stored and used directly whenever we enter this code again.


But yes, I'm adding an "if", which may cost a little more than just the 
counter increment.


From what I understand, and from the discussions that took place when this 
proposal was first put up on the devel list, is that since the event tracing 
hooks could lie in the critical path, we want the overhead to be as low as 
possible. By manually defining the unique identifiers, we can generate the 
event tracing macro at compile-time and have a minimal tracing impact.
Not in the critical path. And from my point on view not on error pathes 
too. I prefer to talk about some "slow path" : not critical, but slow.


Sylvain


On Mon, 29 Mar 2010, Ralph Castain wrote:


 Hi Abhishek
 I'm confused by the WDC wiki page, specifically the part about the
 new ORTE_NOTIFIER_DEFINE_EVENT macro. Are you saying
 that I (as the developer) have to provide this macro with a unique
 notifier id? So that would mean that ORTE/OMPI would
 have to maintain a global notifier id counter to ensure it is unique?

 If so, that seems really cumbersome. Could you please clarify?

 Thanks
 Ralph

 On Mar 29, 2010, at 8:57 AM, Abhishek Kulkarni wrote:


   ==
   [RFC 1/2]
   ==

   WHAT: Merge improvements to the "notifier" framework from the OPAL
   SOS
   and the ORTE WDC mercurial branches into the SVN trunk.

   WHY: Some improvements and interface changes were put into the ORTE
      notifier framework during the development of the OPAL SOS[1] and
      ORTE WDC[2] branches.

   WHERE: Mostly restricted to ORTE notifier files and files using the
    notifier interface in OMPI.

   TIMEOUT: The weekend of April 2-3.

   REFERENCE MERCURIAL REPOS:
   * SOS development: http://bitbucket.org/jsquyres/opal-sos-fixed/
   * WDC development: http://bitbucket.org/derbeyn/orte-wdc-fixed/

   ==

   BACKGROUND:

   The notifier interface and its components underwent a host of
   improvements and changes during the development of the SOS[1] and
   the
   WDC[2] branches.  The ORTE WDC (Warning Data Capture) branch 
enables

   accounting of events through the use of notifier interface, whereas
   OPAL SOS uses the notifier interface by setting up callbacks to
   relay
   out logged events.

   Some of the improvements include:

   - added more severity levels.
   - "ftb" notifier improvements.
   - "command" notifier improvements.
   - added "file" notifier component
   - changes in the notifier modules selection
   - activate only a subset of the callbacks
   (i.e. any combination of log, help, log_peer)
   - define different output media for any given callback (e.g.
   log_peer
   can be redirected to the syslog and smtp, while the show_help can 
be

   sent to the hnp).
   - ORTE_NOTIFIER_LOG_EVENT() (that accounts and warns about unusual
   events)

   Much more information is available on these two wiki pages:

   [1] http://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages
   [2] http://svn.open-mpi.org/trac/ompi/wiki/ORTEWDC

   NOTE: This is first of a two-part RFC to bring the SOS and WDC
   branches
   to the trunk. This only brings in the "notifier" changes from the
   SOS
   branch, while the rest of the branch will be brought over after the
   timeout of the second RFC.

   ==
   ___
   devel mailing list
   de...@open-mpi.org
   http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] RFC 1/1: improvements to the "notifier" framework and ORTE WDC

2010-03-29 Thread Sylvain Jeaugey


Hi Ralph,

For now, I think that yes, this is a unique identifier. However, in my 
opinion, this could be improved in the future replacing it by a unique 
string.


Something like :

#define ORTE_NOTIFIER_DEFINE_EVENT(eventstr, associated_text) {
static int event = -1;
if (OPAL_UNLIKELY(event == -1) {
event = opal_sos_create_new_event(eventstr, associated_text);
}
..
}

This would move the event numbering to the OPAL layer, making it 
transparent to the developper.


Just my 2 cents ...

Sylvain

On Mon, 29 Mar 2010, Ralph Castain wrote:


Hi Abhishek
I'm confused by the WDC wiki page, specifically the part about the new 
ORTE_NOTIFIER_DEFINE_EVENT macro. Are you saying
that I (as the developer) have to provide this macro with a unique notifier id? 
So that would mean that ORTE/OMPI would
have to maintain a global notifier id counter to ensure it is unique?

If so, that seems really cumbersome. Could you please clarify?

Thanks
Ralph

On Mar 29, 2010, at 8:57 AM, Abhishek Kulkarni wrote:


  ==
  [RFC 1/2]
  ==

  WHAT: Merge improvements to the "notifier" framework from the OPAL SOS
  and the ORTE WDC mercurial branches into the SVN trunk.

  WHY: Some improvements and interface changes were put into the ORTE
     notifier framework during the development of the OPAL SOS[1] and
     ORTE WDC[2] branches.

  WHERE: Mostly restricted to ORTE notifier files and files using the
   notifier interface in OMPI.

  TIMEOUT: The weekend of April 2-3.

  REFERENCE MERCURIAL REPOS:
  * SOS development: http://bitbucket.org/jsquyres/opal-sos-fixed/
  * WDC development: http://bitbucket.org/derbeyn/orte-wdc-fixed/

  ==

  BACKGROUND:

  The notifier interface and its components underwent a host of
  improvements and changes during the development of the SOS[1] and the
  WDC[2] branches.  The ORTE WDC (Warning Data Capture) branch enables
  accounting of events through the use of notifier interface, whereas
  OPAL SOS uses the notifier interface by setting up callbacks to relay
  out logged events.

  Some of the improvements include:

  - added more severity levels.
  - "ftb" notifier improvements.
  - "command" notifier improvements.
  - added "file" notifier component
  - changes in the notifier modules selection
  - activate only a subset of the callbacks
  (i.e. any combination of log, help, log_peer)
  - define different output media for any given callback (e.g. log_peer
  can be redirected to the syslog and smtp, while the show_help can be
  sent to the hnp).
  - ORTE_NOTIFIER_LOG_EVENT() (that accounts and warns about unusual
  events)

  Much more information is available on these two wiki pages:

  [1] http://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages
  [2] http://svn.open-mpi.org/trac/ompi/wiki/ORTEWDC

  NOTE: This is first of a two-part RFC to bring the SOS and WDC branches
  to the trunk. This only brings in the "notifier" changes from the SOS
  branch, while the rest of the branch will be brought over after the
  timeout of the second RFC.

  ==
  ___
  devel mailing list
  de...@open-mpi.org
  http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] RFC: s/ENABLE_MPI_THREADS/ENABLE_THREAD_SAFETY/g

2010-02-09 Thread Sylvain Jeaugey

While we're at it, why not call the option giving MPI_THREAD_MULTIPLE 
support --enable-thread-multiple ?


About ORTE and OPAL, if you have --enable-thread-multiple=yes, it may 
force the usage of --enable-thread-safety to configure OPAL and/or ORTE.


I know there are other projects using ORTE and OPAL, but the vast majority 
of users are still using OMPI and were already confused by 
--enable-mpi-threads. Switching to --enable-multi-threads or 
--enable-thread-safety will surely confuse them one more time.


Sylvain

On Mon, 8 Feb 2010, Barrett, Brian W wrote:


Well, does --disable-multi-threads disable progress threads?  And do you want 
to disable thread support in ORTE because you don't want MPI_THREAD_MULTIPLE?  
Perhaps a third option is a rational way to go?

Brain

On Feb 8, 2010, at 6:54 PM, Jeff Squyres wrote:


How about

 --enable-mpi-threads  ==>  --enable-multi-threads
   ENABLE_MPI_THREADS  ==>ENABLE_MULTI_THREADS

Essentially, s/mpi/multi/ig.  This gives us "progress thread" support and "multi 
thread" support.  Similar, but different.

Another possibility instead of "mpi" could be "concurrent".



On Jan 28, 2010, at 9:24 PM, Barrett, Brian W wrote:


Jeff -

I think the idea is ok, but I think the name needs some thought.  There's 
currently two ways to have the lower layers be thread safe -- enabling MPI 
threads or progress threads.  The two can be done independently -- you can 
disable MPI threads and still enable thread support by enabling progress 
threads.

So either that behavior would need to change or we need a better name (in my 
opinion...).

Brian

On Jan 28, 2010, at 8:53 PM, Jeff Squyres wrote:


WHAT: Rename --enable-mpi-threads and ENABLE_MPI_THREADS to 
--enable-thread-safety and ENABLE_THREAD_SAFETY, respectively 
(--enable-mpi-threads will be maintained as a synonym to --enable-thread-safety 
for backwards compat, of course).

WHY: Other projects are starting to use ORTE and OPAL without OMPI.  The fact that thread 
safety in OPAL and ORTE requires a configure switch with "mpi" in the name is 
very non-intuitive.

WHERE: A bunch of places in the code; see attached patch.

WHEN: Next Friday COB

TIMEOUT: COB, Friday, Feb 5, 2010



More details:

Cisco is starting to investigate using ORTE and OPAL in various threading scenarios -- 
without the OMPI layer.  The fact that you need to enable thread safety in ORTE/OPAL with 
a configure switch that has the word "mpi" in it is extremely counter-intuitive 
(it bit some of our engineers very badly, and they were mighty annoyed!!).

Since this functionality actually has nothing to do with MPI (it's actually the 
other way around -- MPI_THREAD_MULTIPLE needs this functionality), we really 
should rename this switch and the corresponding AC_DEFINE -- I suggest:

--enable|disable-thread-safety
ENABLE_THREAD_SAFETY

Of course, we need to keep the configure switch "--enable|disable-mpi-threads" 
for backwards compatibility, so that can be a synonym to --enable-thread-safety.

See the attached patch (the biggest change is in 
opal/config/opal_config_threads.m4).  If there are no objections, I'll commit 
this next Friday COB.

--
Jeff Squyres
jsquy...@cisco.com
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
 Brian W. Barrett
 Dept. 1423: Scalable System Software
 Sandia National Laboratories





___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




--
Jeff Squyres
jsquy...@cisco.com

For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
 Brian W. Barrett
 Dept. 1423: Scalable System Software
 Sandia National Laboratories





___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] VT config.h.in

2010-01-19 Thread Sylvain Jeaugey


Hi list,

The file ompi/contrib/vt/vt/config.h.in seems to have been added to the 
repository, but it is also created by autogen.sh.


Is it normal ?

The result is that when I commit after autogen, I have my patches polluted 
with diffs in this file.


Sylvain

Re: [OMPI devel] MALLOC_MMAP_MAX (and MALLOC_MMAP_THRESHOLD)

2010-01-18 Thread Sylvain Jeaugey


On Jan 17, 2010, at 11:31 AM, Ashley Pittman wrote:
Tuning the libc malloc implementation using the options they provide to 
do is is valid and provides real benefit to a lot of applications.  For 
the record we used to disable mmap based allocations by default on 
Quadrics systems and I can't think of a single case of people who needed 
to re-enable it.
It happened to me once. An application couldn't run (not enough memory) 
because of the Quadrics stack setting the malloc options.


On Sun, 17 Jan 2010, Barrett, Brian W wrote:
I'm glad that you've been so fortunate.  Unfortunately, I have a large 
application base in which that is not the case, and we have had to use 
mmap based allocation, otherwise memory usage would essentially grow 
without bound.  So we go back to the age-old debate of correctness vs. 
performance.
It would be interesting (though difficult) to know which proportion of 
applications are suffering from these settings.


For my part, it is less than 1% (but not 0!).

Sylvain

Re: [OMPI devel] MALLOC_MMAP_MAX (and MALLOC_MMAP_THRESHOLD)

2010-01-08 Thread Sylvain Jeaugey


On Thu, 7 Jan 2010, Eugene Loh wrote:

Could someone tell me how these settings are used in OMPI or give any 
guidance on how they should or should not be used?
This is a very good question :-) As this whole e-mail, though it's hard 
(in my opinion) to give it a Good (TM) answer.


This means that if you loop over the elements of multiple large arrays 
(which is common in HPC), you can generate a lot of cache conflicts, 
depending on the cache associativity.
On the other hand, high buffer alignment sometimes gives better 
performance (e.g. Infiniband QDR bandwidth).


There are multiple reasons one might want to modify the behavior of the 
memory allocator, including high cost of mmap calls, wanting to register 
memory for faster communications, and now this cache-conflict issue.  The 
usual solution is


setenv MALLOC_MMAP_MAX_0
setenv MALLOC_TRIM_THRESHOLD_ -1

or the equivalent mallopt() calls.
But yes, this set of settings is the number one tweak on HPC code that I'm 
aware of.



This issue becomes an MPI issue for at least three reasons:

*)  MPI may care about these settings due to memory registration and pinning. 
(I invite you to explain to me what I mean.  I'm talking over my head here.)
Avoiding mmap is good since it prevents from calling munmap (a function we 
need to hack to prevent data corruption).


*)  (Related to the previous bullet), MPI performance comparisons may reflect 
these effects.  Specifically, in comparing performance of OMPI, Intel MPI, 
Scali/Platform MPI, and MVAPICH2, some tests (such as HPCC and SPECmpi) have 
shown large performance differences between the various MPIs when, it seems, 
none were actually spending much time in MPI.  Rather, some MPI 
implementations were turning off large-malloc mmaps and getting good 
performance (and sadly OMPI looked bad in comparison).
I don't think this bullet is related to the previous one. The first one is 
a good reason, this one is typically the Bad reason. Bad, but 
unfortunately true : competitors' MPI libraries are faster because ... 
they do much more than MPI (accelerate malloc being the main difference). 
Which I think is Bad, because all these settings should be let in 
developper's hands. You'll always find an application where these settings 
will waste memory and prevent an application from running.


*)  These settings seem to be desirable for HPC codes since they don't do 
much allocation/deallocation and they do tend to have loop nests that wade 
through multiple large arrays at once.  For best "out of the box" 
performance, a software stack should turn these settings on for HPC.  Codes 
don't typically identify themselves as "HPC", but some indicators include 
Fortran, OpenMP, and MPI.
In practice, I agree. Most HPC codes benefit from it. But I also ran into 
codes where the memory waste was a problem.


I don't know the full scope of the problem, but I've run into this with at 
least HPCC STREAM (which shouldn't depend on MPI at all, but OMPI looks much 
slower than Scali/Platform on some tests) and SPECmpi (primarily one or two 
codes, though it depends also on problem size).
I had also those codes in mind. That's also why I don't like those MPI 
"benchmarks", since they benchmark much more than MPI. They hence 
encourage MPI provider to incorporate into their libraries things that 
have (more or less) nothing to do with MPI.


But again, yes, from the (basic) user point of view, library X seems 
faster than library Y. When there is nothing left to improve on MPI, start 
optimizing the rest .. maybe we should reimplement a faster libc inside 
MPI :-)


Sylvain

[OMPI devel] Thread safety levels

2010-01-05 Thread Sylvain Jeaugey


Hi list,

I'm currently playing with thread levels in Open MPI and I'm quite 
surprised by the current code.


First, the C interface :
at ompi/mpi/c/init_thread.c:56 we have :
#if OPAL_ENABLE_MPI_THREADS
*provided = MPI_THREAD_MULTIPLE;
#else
*provided = MPI_THREAD_SINGLE;
#endif
prior to the call to ompi_mpi_init() which will in turn override the 
"provided" value. Should we remove these 5 lines ?


Then at ompi/runtime/ompi_mpi_init.c:372, we have -I guess- the real code 
which is :


ompi_mpi_thread_requested = requested;
if (OPAL_HAVE_THREAD_SUPPORT == 0) {
ompi_mpi_thread_provided = *provided = MPI_THREAD_SINGLE;
ompi_mpi_main_thread = NULL;
} else if (OPAL_ENABLE_MPI_THREADS == 1) {
ompi_mpi_thread_provided = *provided = requested;
ompi_mpi_main_thread = opal_thread_get_self();
} else {
if (MPI_THREAD_MULTIPLE == requested) {
ompi_mpi_thread_provided = *provided = MPI_THREAD_SERIALIZED;
} else {
ompi_mpi_thread_provided = *provided = requested;
}
ompi_mpi_main_thread = opal_thread_get_self();
}

This code seems ok to me provided that :
 * (OPAL_ENABLE_MPI_THREADS == 1) means "Open MPI configured to provide 
thread multiple",
 * (OPAL_HAVE_THREAD_SUPPORT == 0) means "we do not have threads at all" 
though even if we do not have threads at compile time, it does in no way 
prevent us from doing THREAD_FUNNELED or THREAD_SERIALIZED.


The reality seems different at opal/include/opal_config_bottom.h:70 :

/* Do we have posix or solaris thread lib */
#define OPAL_HAVE_THREADS (OPAL_HAVE_POSIX_THREADS || OPAL_HAVE_SOLARIS_THREADS)
/* Do we have thread support? */
#define OPAL_HAVE_THREAD_SUPPORT (OPAL_ENABLE_MPI_THREADS || 
OPAL_ENABLE_PROGRESS_THREADS)

"we do not have threads at all" seems to me to be OPAL_HAVE_THREADS and 
not OPAL_HAVE_THREAD_SUPPORT. What do you think ? Maybe 
OPAL_HAVE_THREAD_SUPPORT should be renamed, too (seems misleading to me).


The result is that the current default configuration of Open MPI has 
OPAL_HAVE_THREAD_SUPPORT defined to 0 and Open MPI always returns 
THREAD_SINGLE, even if it is perfectly capable of THREAD_FUNNELED and 
THREAD_SERIALIZED.


Sylvain

Re: [OMPI devel] Crash when using MPI_REAL8

2009-12-08 Thread Sylvain Jeaugey

Thanks Rainer for the patch. I confirm it solves my testcase as well as 
the real application that triggered the bug.


Sylvain

On Mon, 7 Dec 2009, Rainer Keller wrote:


Hello Sylvain,

On Friday 04 December 2009 02:27:22 pm Sylvain Jeaugey wrote:

There is definetly something wrong in types.

Yes, the new ids for optional Fortran datatypes are wrong.

So, as with other Fortran types, IMHO they need to map to C types, aka the IDs
should map. Therefore, we should _not_ increase the number of predefined types
-- these are fixed as representable by C...

The below patch does just that and fixes Your testcase!

George, what do You think? Could You check, please?

Best regards,
Rainer

PS: Yes, You're perfectly right, that the number of Fortran tests (esp. with
regard to optional ddt) are too low.
Several features of MPI (MPI-2 are not well represented in MTT).
--

Rainer Keller, PhD  Tel: +1 (865) 241-6293
Oak Ridge National Lab  Fax: +1 (865) 241-4811
PO Box 2008 MS 6164   Email: kel...@ornl.gov
Oak Ridge, TN 37831-2008AIM/Skype: rusraink

Re: [OMPI devel] Crash when using MPI_REAL8

2009-12-04 Thread Sylvain Jeaugey


There is definetly something wrong in types.

OMPI_DATATYPE_MAX_PREDEFINED is set to 45, while there are 55 predefined 
types. When accessing ompi_op_ddt_map[ddt->id] with MPI_REAL8 
(ddt->id=54), we're reading the ompi_mpi_op_bxor struct.


Depending on various things (padding, uninitialized memory), we may get 0 
and not crash. If you're not lucky, you get a random value and crash soon 
afterwards.


So, I extended things a bit and it seems to fix my problem. I'm not sure 
all types are now handled, I just added some that are not defined.


Sylvain

diff -r e82b914000bd -r 1a40aee2925c ompi/datatype/ompi_datatype.h
--- a/ompi/datatype/ompi_datatype.h Thu Dec 03 04:46:31 2009 +
+++ b/ompi/datatype/ompi_datatype.h Fri Dec 04 19:59:26 2009 +0100
@@ -57,7 +57,7 @@
 #define OMPI_DATATYPE_FLAG_DATA_FORTRAN  0xC000
 #define OMPI_DATATYPE_FLAG_DATA_LANGUAGE 0xC000

-#define OMPI_DATATYPE_MAX_PREDEFINED 45
+#define OMPI_DATATYPE_MAX_PREDEFINED 55

 #if OMPI_DATATYPE_MAX_PREDEFINED > OPAL_DATATYPE_MAX_SUPPORTED
 #error Need to increase the number of supported dataypes by OPAL (value 
OPAL_DATATYPE_MAX_SUPPORTED).
diff -r e82b914000bd -r 1a40aee2925c ompi/op/op.c
--- a/ompi/op/op.c  Thu Dec 03 04:46:31 2009 +
+++ b/ompi/op/op.c  Fri Dec 04 19:59:26 2009 +0100
@@ -137,6 +137,14 @@
 ompi_op_ddt_map[OMPI_DATATYPE_MPI_2INTEGER] = OMPI_OP_BASE_TYPE_2INTEGER;
 ompi_op_ddt_map[OMPI_DATATYPE_MPI_LONG_DOUBLE_INT] = 
OMPI_OP_BASE_TYPE_LONG_DOUBLE_INT;
 ompi_op_ddt_map[OMPI_DATATYPE_MPI_WCHAR] = OMPI_OP_BASE_TYPE_WCHAR;
+ompi_op_ddt_map[OMPI_DATATYPE_MPI_INTEGER2] = OMPI_OP_BASE_TYPE_INTEGER2;
+ompi_op_ddt_map[OMPI_DATATYPE_MPI_INTEGER4] = OMPI_OP_BASE_TYPE_INTEGER4;
+ompi_op_ddt_map[OMPI_DATATYPE_MPI_INTEGER8] = OMPI_OP_BASE_TYPE_INTEGER8;
+ompi_op_ddt_map[OMPI_DATATYPE_MPI_INTEGER16] = OMPI_OP_BASE_TYPE_INTEGER16;
+ompi_op_ddt_map[OMPI_DATATYPE_MPI_REAL2] = OMPI_OP_BASE_TYPE_REAL2;
+ompi_op_ddt_map[OMPI_DATATYPE_MPI_REAL4] = OMPI_OP_BASE_TYPE_REAL4;
+ompi_op_ddt_map[OMPI_DATATYPE_MPI_REAL8] = OMPI_OP_BASE_TYPE_REAL8;
+ompi_op_ddt_map[OMPI_DATATYPE_MPI_REAL16] = OMPI_OP_BASE_TYPE_REAL16;

 /* Create the intrinsic ops */

diff -r e82b914000bd -r 1a40aee2925c opal/datatype/opal_datatype.h
--- a/opal/datatype/opal_datatype.h Thu Dec 03 04:46:31 2009 +
+++ b/opal/datatype/opal_datatype.h Fri Dec 04 19:59:26 2009 +0100
@@ -56,7 +56,7 @@
  *
  * XXX TODO Adapt to whatever the OMPI-layer needs
  */
-#define OPAL_DATATYPE_MAX_SUPPORTED  46
+#define OPAL_DATATYPE_MAX_SUPPORTED  56


 /* flags for the datatypes. */

On Fri, 4 Dec 2009, Sylvain Jeaugey wrote:

For the record, and to try to explain why all MTT tests may have missed this 
"bug", configuring without --enable-debug makes the bug disappear.


Still trying to figure out why.

Sylvain

On Thu, 3 Dec 2009, Sylvain Jeaugey wrote:


Hi list,

I hope this time I won't be the only one to suffer this bug :)

It is very simple indeed, just perform an allreduce with MPI_REAL8 
(fortran) and you should get a crash in ompi/op/op.h:411. Tested with trunk 
and v1.5, working fine on v1.3.


From what I understand, in the trunk, MPI_REAL8 has now a fixed id (in 
ompi/datatype/ompi_datatype_internal.h), but operations do not have an 
index going as far as 54 (0x36), leading to a crash when looking for 
op->o_func.intrinsic.fns[ompi_op_ddt_map[ddt->id]] in ompi_op_is_valid() 
(or, if I disable mpi_param_check, in ompi_op_reduce()).


Here is a reproducer, just in case :
program main
use mpi
integer ierr
real(8) myreal, realsum
call MPI_INIT(ierr)
call MPI_ALLREDUCE(myreal, realsum, 1, MPI_REAL8, MPI_SUM, MPI_COMM_WORLD, 
ierr)

call MPI_FINALIZE(ierr)
stop
end

Has anyone an idea on how to fix this ? Or am I doing something wrong ?

Thanks for any help,
Sylvain




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Crash when using MPI_REAL8

2009-12-04 Thread Sylvain Jeaugey

For the record, and to try to explain why all MTT tests may have missed 
this "bug", configuring without --enable-debug makes the bug disappear.


Still trying to figure out why.

Sylvain

On Thu, 3 Dec 2009, Sylvain Jeaugey wrote:


Hi list,

I hope this time I won't be the only one to suffer this bug :)

It is very simple indeed, just perform an allreduce with MPI_REAL8 (fortran) 
and you should get a crash in ompi/op/op.h:411. Tested with trunk and v1.5, 
working fine on v1.3.


From what I understand, in the trunk, MPI_REAL8 has now a fixed id (in 
ompi/datatype/ompi_datatype_internal.h), but operations do not have an index 
going as far as 54 (0x36), leading to a crash when looking for 
op->o_func.intrinsic.fns[ompi_op_ddt_map[ddt->id]] in ompi_op_is_valid() (or, 
if I disable mpi_param_check, in ompi_op_reduce()).


Here is a reproducer, just in case :
program main
use mpi
integer ierr
real(8) myreal, realsum
call MPI_INIT(ierr)
call MPI_ALLREDUCE(myreal, realsum, 1, MPI_REAL8, MPI_SUM, MPI_COMM_WORLD, 
ierr)

call MPI_FINALIZE(ierr)
stop
end

Has anyone an idea on how to fix this ? Or am I doing something wrong ?

Thanks for any help,
Sylvain

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-03 Thread Sylvain Jeaugey

Too bad. But no problem, that's very nice of you to have spent so much 
time on this.


I wish I knew why our experiments are so different, maybe we will find out 
eventually ...


Sylvain

On Wed, 2 Dec 2009, Ralph Castain wrote:


I'm sorry, Sylvain - I simply cannot replicate this problem (tried yet another 
slurm system):

./configure --prefix=blah --with-platform=contrib/platform/iu/odin/debug

[rhc@odin ~]$ salloc -N 16 tcsh
salloc: Granted job allocation 75294
[rhc@odin mpi]$ mpirun -pernode ./hello
Hello, World, I am 1 of 16
Hello, World, I am 7 of 16
Hello, World, I am 15 of 16
Hello, World, I am 4 of 16
Hello, World, I am 13 of 16
Hello, World, I am 3 of 16
Hello, World, I am 5 of 16
Hello, World, I am 8 of 16
Hello, World, I am 0 of 16
Hello, World, I am 9 of 16
Hello, World, I am 12 of 16
Hello, World, I am 2 of 16
Hello, World, I am 6 of 16
Hello, World, I am 10 of 16
Hello, World, I am 14 of 16
Hello, World, I am 11 of 16
[rhc@odin mpi]$ setenv ORTE_RELAY_DELAY 1
[rhc@odin mpi]$ mpirun -pernode ./hello
[odin.cs.indiana.edu:15280] [[28699,0],0] delaying relay by 1 seconds
[odin.cs.indiana.edu:15280] [[28699,0],0] delaying relay by 1 seconds
[odin.cs.indiana.edu:15280] [[28699,0],0] delaying relay by 1 seconds
[odin.cs.indiana.edu:15280] [[28699,0],0] delaying relay by 1 seconds
Hello, World, I am 2 of 16
Hello, World, I am 0 of 16
Hello, World, I am 3 of 16
Hello, World, I am 1 of 16
Hello, World, I am 4 of 16
Hello, World, I am 10 of 16
Hello, World, I am 7 of 16
Hello, World, I am 12 of 16
Hello, World, I am 6 of 16
Hello, World, I am 8 of 16
Hello, World, I am 5 of 16
Hello, World, I am 13 of 16
Hello, World, I am 11 of 16
Hello, World, I am 14 of 16
Hello, World, I am 9 of 16
Hello, World, I am 15 of 16
[odin.cs.indiana.edu:15280] [[28699,0],0] delaying relay by 1 seconds
[rhc@odin mpi]$ setenv ORTE_RELAY_DELAY 2
[rhc@odin mpi]$ mpirun -pernode ./hello
[odin.cs.indiana.edu:15302] [[28781,0],0] delaying relay by 2 seconds
[odin.cs.indiana.edu:15302] [[28781,0],0] delaying relay by 2 seconds
[odin.cs.indiana.edu:15302] [[28781,0],0] delaying relay by 2 seconds
[odin.cs.indiana.edu:15302] [[28781,0],0] delaying relay by 2 seconds
Hello, World, I am 2 of 16
Hello, World, I am 3 of 16
Hello, World, I am 4 of 16
Hello, World, I am 7 of 16
Hello, World, I am 6 of 16
Hello, World, I am 0 of 16
Hello, World, I am 1 of 16
Hello, World, I am 10 of 16
Hello, World, I am 5 of 16
Hello, World, I am 9 of 16
Hello, World, I am 8 of 16
Hello, World, I am 14 of 16
Hello, World, I am 13 of 16
Hello, World, I am 12 of 16
Hello, World, I am 11 of 16
Hello, World, I am 15 of 16
[odin.cs.indiana.edu:15302] [[28781,0],0] delaying relay by 2 seconds
[rhc@odin mpi]$

Sorry I don't have more time to continue pursuing this. I have no idea what is 
going on with your system(s), but it clearly is something peculiar to what you 
are doing or the system(s) you are running on.

Ralph


On Dec 2, 2009, at 1:56 AM, Sylvain Jeaugey wrote:


Ok, so I tried with RHEL5 and I get the same (even at 6 nodes) : when setting 
ORTE_RELAY_DELAY to 1, I get the deadlock systematically with the typical stack.

Without my "reproducer patch", 80 nodes was the lower bound to reproduce the 
bug (and you needed a couple of runs to get it). But since this is a race condition, your 
mileage may vary on a different cluster.

With the patch however, I'm in every time. I'll continue to try different 
configurations (e.g. without slurm ...) to see if I can reproduce it on much 
common configurations.

Sylvain

On Mon, 30 Nov 2009, Sylvain Jeaugey wrote:


Ok. Maybe I should try on a RHEL5 then.

About the compilers, I've tried with both gcc and intel and it doesn't seem to 
make a difference.

On Mon, 30 Nov 2009, Ralph Castain wrote:


Interesting. The only difference I see is the FC11 - I haven't seen anyone 
running on that OS yet. I wonder if that is the source of the trouble? Do we 
know that our code works on that one? I know we had problems in the past with 
FC9, for example, that required fixes.
Also, what compiler are you using? I wonder if there is some optimization issue 
here, or some weird interaction between FC11 and the compiler.
On Nov 30, 2009, at 8:48 AM, Sylvain Jeaugey wrote:

Hi Ralph,
I'm also puzzled :-)
Here is what I did today :
* download the latest nightly build (openmpi-1.7a1r22241)
* untar it
* patch it with my "ORTE_RELAY_DELAY" patch
* build it directly on the cluster (running FC11) with :
./configure --platform=contrib/platform/lanl/tlcc/debug-nopanasas --prefix=
make && make install
* deactivate oob_tcp_if_include=ib0 in openmpi-mca-params.conf (IPoIB is broken 
on my machine) and run with :
salloc -N 10 mpirun ./helloworld
And .. still the same behaviour : ok by default, deadlock with the typical 
stack when setting ORTE_RELAY_DELAY to 1.
About my previous e-mail, I was wrong about all components having a 0 priority : it was 
based on default parameters reported by "ompi

[OMPI devel] Crash when using MPI_REAL8

2009-12-03 Thread Sylvain Jeaugey


Hi list,

I hope this time I won't be the only one to suffer this bug :)

It is very simple indeed, just perform an allreduce with MPI_REAL8 
(fortran) and you should get a crash in ompi/op/op.h:411. Tested with 
trunk and v1.5, working fine on v1.3.


From what I understand, in the trunk, MPI_REAL8 has now a fixed id (in 
ompi/datatype/ompi_datatype_internal.h), but operations do not have an 
index going as far as 54 (0x36), leading to a crash when looking for 
op->o_func.intrinsic.fns[ompi_op_ddt_map[ddt->id]] in ompi_op_is_valid() 
(or, if I disable mpi_param_check, in ompi_op_reduce()).


Here is a reproducer, just in case :
program main
 use mpi
 integer ierr
 real(8) myreal, realsum
 call MPI_INIT(ierr)
 call MPI_ALLREDUCE(myreal, realsum, 1, MPI_REAL8, MPI_SUM, MPI_COMM_WORLD, 
ierr)
 call MPI_FINALIZE(ierr)
 stop
end

Has anyone an idea on how to fix this ? Or am I doing something wrong ?

Thanks for any help,
Sylvain

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-12-02 Thread Sylvain Jeaugey

Ok, so I tried with RHEL5 and I get the same (even at 6 nodes) : when 
setting ORTE_RELAY_DELAY to 1, I get the deadlock systematically with the 
typical stack.


Without my "reproducer patch", 80 nodes was the lower bound to reproduce 
the bug (and you needed a couple of runs to get it). But since this is a 
race condition, your mileage may vary on a different cluster.


With the patch however, I'm in every time. I'll continue to try different 
configurations (e.g. without slurm ...) to see if I can reproduce it on 
much common configurations.


Sylvain

On Mon, 30 Nov 2009, Sylvain Jeaugey wrote:


Ok. Maybe I should try on a RHEL5 then.

About the compilers, I've tried with both gcc and intel and it doesn't seem 
to make a difference.


On Mon, 30 Nov 2009, Ralph Castain wrote:

Interesting. The only difference I see is the FC11 - I haven't seen anyone 
running on that OS yet. I wonder if that is the source of the trouble? Do 
we know that our code works on that one? I know we had problems in the past 
with FC9, for example, that required fixes.


Also, what compiler are you using? I wonder if there is some optimization 
issue here, or some weird interaction between FC11 and the compiler.


On Nov 30, 2009, at 8:48 AM, Sylvain Jeaugey wrote:


Hi Ralph,

I'm also puzzled :-)

Here is what I did today :
* download the latest nightly build (openmpi-1.7a1r22241)
* untar it
* patch it with my "ORTE_RELAY_DELAY" patch
* build it directly on the cluster (running FC11) with :
 ./configure --platform=contrib/platform/lanl/tlcc/debug-nopanasas 
--prefix=

 make && make install

* deactivate oob_tcp_if_include=ib0 in openmpi-mca-params.conf (IPoIB is 
broken on my machine) and run with :

 salloc -N 10 mpirun ./helloworld

And .. still the same behaviour : ok by default, deadlock with the typical 
stack when setting ORTE_RELAY_DELAY to 1.


About my previous e-mail, I was wrong about all components having a 0 
priority : it was based on default parameters reported by "ompi_info -a | 
grep routed". It seems that the truth is not always in ompi_info ...


Sylvain

On Fri, 27 Nov 2009, Ralph Castain wrote:



On Nov 27, 2009, at 8:23 AM, Sylvain Jeaugey wrote:


Hi Ralph,

I tried with the trunk and it makes no difference for me.


Strange



Looking at potential differences, I found out something strange. The bug 
may have something to do with the "routed" framework. I can reproduce 
the bug with binomial and direct, but not with cm and linear (you 
disabled the build of the latest in your configure options -- why ?).


You won't with cm because there is no relay. Likewise, direct doesn't 
have a relay - so I'm really puzzled how you can see this behavior when 
using the direct component???


I disable components in my build to save memory. Every component we open 
costs us memory that may or may not be recoverable during the course of 
execution.




Btw, all components have a 0 priority and none is defined to be the 
default component. Which one is the default then ? binomial (as the 
first in alphabetical order) ?


I believe you must have a severely corrupted version of the code. The 
binomial component has priority 70 so it will be selected as the default.


Linear has priority 40, though it will only be selected if you say 
^binomial.


CM and radix have special selection code in them so they will only be 
selected when specified.


Direct and slave have priority 0 to ensure they will only be selected 
when specified




Can you check which one you are using and try with binomial explicitely 
chosen ?


I am using binomial for all my tests

From what you are describing, I think you either have a corrupted copy 
of the code, are picking up mis-matched versions, or something strange 
as your experiences don't match what anyone else is seeing.


Remember, the phase you are discussing here has nothing to do with the 
native launch environment. This is dealing with the relative timing of 
the application launch versus relaying the launch message itself - i.e., 
the daemons are already up and running before any of this starts. Thus, 
this "problem" has nothing to do with how we launch the daemons. So, if 
it truly were a problem in the code, we would see it on every environment 
- torque, slurm, ssh, etc.


We routinely launch jobs spanning hundreds to thousands of nodes without 
problem. If this timing problem was as you have identified, then we would 
see this constantly. Yet nobody is seeing it, and I cannot reproduce it 
even with your reproducer.


I honestly don't know what to suggest at this point. Any chance you are 
picking up mis-matched OMPI versions are your backend nodes or something? 
Tried fresh checkouts of the code? Is this a code base you have modified, 
or are you seeing this with the "stock" code from the repo?


Just fishing at this point - can't find anything wrong! :-/
Ralph




Thanks for your time,
Sylvain

On Thu, 26 N

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-30 Thread Sylvain Jeaugey

Ok. Maybe I should try on a RHEL5 then.

About the compilers, I've tried with both gcc and intel and it doesn't
seem to make a difference.

On Mon, 30 Nov 2009, Ralph Castain wrote:

Interesting. The only difference I see is the FC11 - I haven't seen
anyone running on that OS yet. I wonder if that is the source of the
trouble? Do we know that our code works on that one? I know we had
problems in the past with FC9, for example, that required fixes.

Also, what compiler are you using? I wonder if there is some
optimization issue here, or some weird interaction between FC11 and the
compiler.

On Nov 30, 2009, at 8:48 AM, Sylvain Jeaugey wrote:

Hi Ralph,

I'm also puzzled :-)

Here is what I did today :
* download the latest nightly build (openmpi-1.7a1r22241)
* untar it
* patch it with my "ORTE_RELAY_DELAY" patch
* build it directly on the cluster (running FC11) with :
./configure --platform=contrib/platform/lanl/tlcc/debug-nopanasas --prefix=
make && make install

* deactivate oob_tcp_if_include=ib0 in openmpi-mca-params.conf (IPoIB is broken
on my machine) and run with :
salloc -N 10 mpirun ./helloworld

And .. still the same behaviour : ok by default, deadlock with the typical
stack when setting ORTE_RELAY_DELAY to 1.

About my previous e-mail, I was wrong about all components having a 0 priority : it was
based on default parameters reported by "ompi_info -a | grep routed". It seems
that the truth is not always in ompi_info ...

Sylvain

On Fri, 27 Nov 2009, Ralph Castain wrote:

On Nov 27, 2009, at 8:23 AM, Sylvain Jeaugey wrote:

Hi Ralph,

I tried with the trunk and it makes no difference for me.

Strange

Looking at potential differences, I found out something strange. The bug may have
something to do with the "routed" framework. I can reproduce the bug with
binomial and direct, but not with cm and linear (you disabled the build of the latest in
your configure options -- why ?).

You won't with cm because there is no relay. Likewise, direct doesn't have a
relay - so I'm really puzzled how you can see this behavior when using the
direct component???

I disable components in my build to save memory. Every component we open costs
us memory that may or may not be recoverable during the course of execution.

Btw, all components have a 0 priority and none is defined to be the default
component. Which one is the default then ? binomial (as the first in
alphabetical order) ?

I believe you must have a severely corrupted version of the code. The binomial
component has priority 70 so it will be selected as the default.

Linear has priority 40, though it will only be selected if you say ^binomial.

CM and radix have special selection code in them so they will only be selected
when specified.

Direct and slave have priority 0 to ensure they will only be selected when
specified

Can you check which one you are using and try with binomial explicitely chosen ?

I am using binomial for all my tests

From what you are describing, I think you either have a corrupted copy of the
code, are picking up mis-matched versions, or something strange as your
experiences don't match what anyone else is seeing.

Remember, the phase you are discussing here has nothing to do with the native launch
environment. This is dealing with the relative timing of the application launch versus
relaying the launch message itself - i.e., the daemons are already up and running before
any of this starts. Thus, this "problem" has nothing to do with how we launch
the daemons. So, if it truly were a problem in the code, we would see it on every
environment - torque, slurm, ssh, etc.

We routinely launch jobs spanning hundreds to thousands of nodes without
problem. If this timing problem was as you have identified, then we would see
this constantly. Yet nobody is seeing it, and I cannot reproduce it even with
your reproducer.

I honestly don't know what to suggest at this point. Any chance you are picking up
mis-matched OMPI versions are your backend nodes or something? Tried fresh checkouts of
the code? Is this a code base you have modified, or are you seeing this with the
"stock" code from the repo?

Just fishing at this point - can't find anything wrong! :-/
Ralph

Thanks for your time,
Sylvain

On Thu, 26 Nov 2009, Ralph Castain wrote:

Hi Sylvain

Well, I hate to tell you this, but I cannot reproduce the "bug" even with this code in
ORTE no matter what value of ORTE_RELAY_DELAY I use. The system runs really slow as I increase the
delay, but it completes the job just fine. I ran jobs across 16 nodes on a slurm machine, 1-4 ppn,
a "hello world" app that calls MPI_Init immediately upon execution.

So I have to conclude this is a problem in your setup/config. Are you sure you
didn't --enable-progress-threads?? That is the only way I can recreate this
behavior.

I plan to modify the relay/message processin

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-30 Thread Sylvain Jeaugey


Hi Ralph,

I'm also puzzled :-)

Here is what I did today :
 * download the latest nightly build (openmpi-1.7a1r22241)
 * untar it
 * patch it with my "ORTE_RELAY_DELAY" patch
 * build it directly on the cluster (running FC11) with :
  ./configure --platform=contrib/platform/lanl/tlcc/debug-nopanasas 
--prefix=

  make && make install

 * deactivate oob_tcp_if_include=ib0 in openmpi-mca-params.conf (IPoIB is 
broken on my machine) and run with :

  salloc -N 10 mpirun ./helloworld

And .. still the same behaviour : ok by default, deadlock with the typical 
stack when setting ORTE_RELAY_DELAY to 1.


About my previous e-mail, I was wrong about all components having a 0 
priority : it was based on default parameters reported by "ompi_info -a | 
grep routed". It seems that the truth is not always in ompi_info ...


Sylvain

On Fri, 27 Nov 2009, Ralph Castain wrote:



On Nov 27, 2009, at 8:23 AM, Sylvain Jeaugey wrote:


Hi Ralph,

I tried with the trunk and it makes no difference for me.


Strange



Looking at potential differences, I found out something strange. The bug may have 
something to do with the "routed" framework. I can reproduce the bug with 
binomial and direct, but not with cm and linear (you disabled the build of the latest in 
your configure options -- why ?).


You won't with cm because there is no relay. Likewise, direct doesn't have a 
relay - so I'm really puzzled how you can see this behavior when using the 
direct component???

I disable components in my build to save memory. Every component we open costs 
us memory that may or may not be recoverable during the course of execution.



Btw, all components have a 0 priority and none is defined to be the default 
component. Which one is the default then ? binomial (as the first in 
alphabetical order) ?


I believe you must have a severely corrupted version of the code. The binomial 
component has priority 70 so it will be selected as the default.

Linear has priority 40, though it will only be selected if you say ^binomial.

CM and radix have special selection code in them so they will only be selected 
when specified.

Direct and slave have priority 0 to ensure they will only be selected when 
specified



Can you check which one you are using and try with binomial explicitely chosen ?


I am using binomial for all my tests


From what you are describing, I think you either have a corrupted copy of the 
code, are picking up mis-matched versions, or something strange as your 
experiences don't match what anyone else is seeing.


Remember, the phase you are discussing here has nothing to do with the native launch 
environment. This is dealing with the relative timing of the application launch versus 
relaying the launch message itself - i.e., the daemons are already up and running before 
any of this starts. Thus, this "problem" has nothing to do with how we launch 
the daemons. So, if it truly were a problem in the code, we would see it on every 
environment - torque, slurm, ssh, etc.

We routinely launch jobs spanning hundreds to thousands of nodes without 
problem. If this timing problem was as you have identified, then we would see 
this constantly. Yet nobody is seeing it, and I cannot reproduce it even with 
your reproducer.

I honestly don't know what to suggest at this point. Any chance you are picking up 
mis-matched OMPI versions are your backend nodes or something? Tried fresh checkouts of 
the code? Is this a code base you have modified, or are you seeing this with the 
"stock" code from the repo?

Just fishing at this point - can't find anything wrong! :-/
Ralph




Thanks for your time,
Sylvain

On Thu, 26 Nov 2009, Ralph Castain wrote:


Hi Sylvain

Well, I hate to tell you this, but I cannot reproduce the "bug" even with this code in 
ORTE no matter what value of ORTE_RELAY_DELAY I use. The system runs really slow as I increase the 
delay, but it completes the job just fine. I ran jobs across 16 nodes on a slurm machine, 1-4 ppn, 
a "hello world" app that calls MPI_Init immediately upon execution.

So I have to conclude this is a problem in your setup/config. Are you sure you 
didn't --enable-progress-threads?? That is the only way I can recreate this 
behavior.

I plan to modify the relay/message processing method anyway to clean it up. But 
there doesn't appear to be anything wrong with the current code.
Ralph

On Nov 20, 2009, at 6:55 AM, Sylvain Jeaugey wrote:


Hi Ralph,

Thanks for your efforts. I will look at our configuration and see how it may 
differ from ours.

Here is a patch which helps reproducing the bug even with a small number of 
nodes.

diff -r b622b9e8f1ac orte/orted/orted_comm.c
--- a/orte/orted/orted_comm.c   Wed Nov 18 09:27:55 2009 +0100
+++ b/orte/orted/orted_comm.c   Fri Nov 20 14:47:39 2009 +0100
@@ -126,6 +126,13 @@
   ORTE_ERROR_LOG(ret);
   goto CLEANUP;
   }
+{ /* Add delay to reproduce

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-27 Thread Sylvain Jeaugey


Hi Ralph,

I tried with the trunk and it makes no difference for me.

Looking at potential differences, I found out something strange. The bug 
may have something to do with the "routed" framework. I can reproduce the 
bug with binomial and direct, but not with cm and linear (you disabled the 
build of the latest in your configure options -- why ?).


Btw, all components have a 0 priority and none is defined to be the 
default component. Which one is the default then ? binomial (as the first 
in alphabetical order) ?


Can you check which one you are using and try with binomial explicitely 
chosen ?


Thanks for your time,
Sylvain

On Thu, 26 Nov 2009, Ralph Castain wrote:


Hi Sylvain

Well, I hate to tell you this, but I cannot reproduce the "bug" even with this code in 
ORTE no matter what value of ORTE_RELAY_DELAY I use. The system runs really slow as I increase the 
delay, but it completes the job just fine. I ran jobs across 16 nodes on a slurm machine, 1-4 ppn, 
a "hello world" app that calls MPI_Init immediately upon execution.

So I have to conclude this is a problem in your setup/config. Are you sure you 
didn't --enable-progress-threads?? That is the only way I can recreate this 
behavior.

I plan to modify the relay/message processing method anyway to clean it up. But 
there doesn't appear to be anything wrong with the current code.
Ralph

On Nov 20, 2009, at 6:55 AM, Sylvain Jeaugey wrote:


Hi Ralph,

Thanks for your efforts. I will look at our configuration and see how it may 
differ from ours.

Here is a patch which helps reproducing the bug even with a small number of 
nodes.

diff -r b622b9e8f1ac orte/orted/orted_comm.c
--- a/orte/orted/orted_comm.c   Wed Nov 18 09:27:55 2009 +0100
+++ b/orte/orted/orted_comm.c   Fri Nov 20 14:47:39 2009 +0100
@@ -126,6 +126,13 @@
ORTE_ERROR_LOG(ret);
goto CLEANUP;
}
+{ /* Add delay to reproduce bug */
+char * str = getenv("ORTE_RELAY_DELAY");
+int sec = str ? atoi(str) : 0;
+if (sec) {
+sleep(sec);
+}
+}
}

CLEANUP:

Just set ORTE_RELAY_DELAY to 1 (second) and you should reproduce the bug.

During our experiments, the bug disappeared when we added a delay before 
calling MPI_Init. So, configurations where processes are launched slowly or 
take some time before MPI_Init should be immune to this bug.

We usually reproduce the bug with one ppn (faster to spawn).

Sylvain

On Thu, 19 Nov 2009, Ralph Castain wrote:


Hi Sylvain

I've spent several hours trying to replicate the behavior you described on 
clusters up to a couple of hundred nodes (all running slurm), without success. 
I'm becoming increasingly convinced that this is a configuration issue as 
opposed to a code issue.

I have enclosed the platform file I use below. Could you compare it to your 
configuration? I'm wondering if there is something critical about the config 
that may be causing the problem (perhaps we have a problem in our default 
configuration).

Also, is there anything else you can tell us about your configuration? How many 
ppn triggers it, or do you always get the behavior every time you launch over a 
certain number of nodes?

Meantime, I will look into this further. I am going to introduce a "slow down" 
param that will force the situation you encountered - i.e., will ensure that the relay is 
still being sent when the daemon receives the first collective input. We can then use 
that to try and force replication of the behavior you are encountering.

Thanks
Ralph

enable_dlopen=no
enable_pty_support=no
with_blcr=no
with_openib=yes
with_memory_manager=no
enable_mem_debug=yes
enable_mem_profile=no
enable_debug_symbols=yes
enable_binaries=yes
with_devel_headers=yes
enable_heterogeneous=no
enable_picky=yes
enable_debug=yes
enable_shared=yes
enable_static=yes
with_slurm=yes
enable_contrib_no_build=libnbc,vt
enable_visibility=yes
enable_memchecker=no
enable_ipv6=no
enable_mpi_f77=no
enable_mpi_f90=no
enable_mpi_cxx=no
enable_mpi_cxx_seek=no
enable_mca_no_build=pml-dr,pml-crcp2,crcp
enable_io_romio=no

On Nov 19, 2009, at 8:08 AM, Ralph Castain wrote:



On Nov 19, 2009, at 7:52 AM, Sylvain Jeaugey wrote:


Thank you Ralph for this precious help.

I setup a quick-and-dirty patch basically postponing process_msg (hence 
daemon_collective) until the launch is done. In process_msg, I therefore 
requeue a process_msg handler and return.


That is basically the idea I proposed, just done in a slightly different place



In this "all-must-be-non-blocking-and-done-through-opal_progress" algorithm, I 
don't think that blocking calls like the one in daemon_collective should be allowed. This 
also applies to the blocking one in send_relay. [Well, actually, one is okay, 2 may lead 
to interlocking.]


Well, that would be problematic - you will find "progressed_wait" used 
repeatedly in the code. Removing t

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-20 Thread Sylvain Jeaugey


Hi Ralph,

Thanks for your efforts. I will look at our configuration and see how it 
may differ from ours.


Here is a patch which helps reproducing the bug even with a small number 
of nodes.


diff -r b622b9e8f1ac orte/orted/orted_comm.c
--- a/orte/orted/orted_comm.c   Wed Nov 18 09:27:55 2009 +0100
+++ b/orte/orted/orted_comm.c   Fri Nov 20 14:47:39 2009 +0100
@@ -126,6 +126,13 @@
 ORTE_ERROR_LOG(ret);
 goto CLEANUP;
 }
+{ /* Add delay to reproduce bug */
+char * str = getenv("ORTE_RELAY_DELAY");
+int sec = str ? atoi(str) : 0;
+if (sec) {
+sleep(sec);
+}
+}
 }

 CLEANUP:

Just set ORTE_RELAY_DELAY to 1 (second) and you should reproduce the bug.

During our experiments, the bug disappeared when we added a delay before 
calling MPI_Init. So, configurations where processes are launched slowly 
or take some time before MPI_Init should be immune to this bug.


We usually reproduce the bug with one ppn (faster to spawn).

Sylvain

On Thu, 19 Nov 2009, Ralph Castain wrote:


Hi Sylvain

I've spent several hours trying to replicate the behavior you described on 
clusters up to a couple of hundred nodes (all running slurm), without success. 
I'm becoming increasingly convinced that this is a configuration issue as 
opposed to a code issue.

I have enclosed the platform file I use below. Could you compare it to your 
configuration? I'm wondering if there is something critical about the config 
that may be causing the problem (perhaps we have a problem in our default 
configuration).

Also, is there anything else you can tell us about your configuration? How many 
ppn triggers it, or do you always get the behavior every time you launch over a 
certain number of nodes?

Meantime, I will look into this further. I am going to introduce a "slow down" 
param that will force the situation you encountered - i.e., will ensure that the relay is 
still being sent when the daemon receives the first collective input. We can then use 
that to try and force replication of the behavior you are encountering.

Thanks
Ralph

enable_dlopen=no
enable_pty_support=no
with_blcr=no
with_openib=yes
with_memory_manager=no
enable_mem_debug=yes
enable_mem_profile=no
enable_debug_symbols=yes
enable_binaries=yes
with_devel_headers=yes
enable_heterogeneous=no
enable_picky=yes
enable_debug=yes
enable_shared=yes
enable_static=yes
with_slurm=yes
enable_contrib_no_build=libnbc,vt
enable_visibility=yes
enable_memchecker=no
enable_ipv6=no
enable_mpi_f77=no
enable_mpi_f90=no
enable_mpi_cxx=no
enable_mpi_cxx_seek=no
enable_mca_no_build=pml-dr,pml-crcp2,crcp
enable_io_romio=no

On Nov 19, 2009, at 8:08 AM, Ralph Castain wrote:



On Nov 19, 2009, at 7:52 AM, Sylvain Jeaugey wrote:


Thank you Ralph for this precious help.

I setup a quick-and-dirty patch basically postponing process_msg (hence 
daemon_collective) until the launch is done. In process_msg, I therefore 
requeue a process_msg handler and return.


That is basically the idea I proposed, just done in a slightly different place



In this "all-must-be-non-blocking-and-done-through-opal_progress" algorithm, I 
don't think that blocking calls like the one in daemon_collective should be allowed. This 
also applies to the blocking one in send_relay. [Well, actually, one is okay, 2 may lead 
to interlocking.]


Well, that would be problematic - you will find "progressed_wait" used 
repeatedly in the code. Removing them all would take a -lot- of effort and a major 
rewrite. I'm not yet convinced it is required. There may be something strange in how you 
are setup, or your cluster - like I said, this is the first report of a problem we have 
had, and people with much bigger slurm clusters have been running this code every day for 
over a year.



If you have time doing a nicer patch, it would be great and I would be happy to 
test it. Otherwise, I will try to implement your idea properly next week (with 
my limited knowledge of orted).


Either way is fine - I'll see if I can get to it.

Thanks
Ralph



For the record, here is the patch I'm currently testing at large scale :

diff -r ec68298b3169 -r b622b9e8f1ac orte/mca/grpcomm/bad/grpcomm_bad_module.c
--- a/orte/mca/grpcomm/bad/grpcomm_bad_module.c Mon Nov 09 13:29:16 2009 +0100
+++ b/orte/mca/grpcomm/bad/grpcomm_bad_module.c Wed Nov 18 09:27:55 2009 +0100
@@ -687,14 +687,6 @@
   opal_list_append(_local_jobdata, >super);
   }

-/* it may be possible to get here prior to having actually finished 
processing our
- * local launch msg due to the race condition between different nodes and 
when
- * they start their individual procs. Hence, we have to first ensure that 
we
- * -have- finished processing the launch msg, or else we won't know whether
- * or not to wait before sending this on
- */
-ORTE_PROGRESSED_WAIT(jobdat->launch_msg_processed, 0, 1);

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-19 Thread Sylvain Jeaugey


Thank you Ralph for this precious help.

I setup a quick-and-dirty patch basically postponing process_msg (hence 
daemon_collective) until the launch is done. In process_msg, I therefore 
requeue a process_msg handler and return.


In this "all-must-be-non-blocking-and-done-through-opal_progress" 
algorithm, I don't think that blocking calls like the one in 
daemon_collective should be allowed. This also applies to the blocking one 
in send_relay. [Well, actually, one is okay, 2 may lead to interlocking.]


If you have time doing a nicer patch, it would be great and I would be 
happy to test it. Otherwise, I will try to implement your idea properly 
next week (with my limited knowledge of orted).


For the record, here is the patch I'm currently testing at large scale :

diff -r ec68298b3169 -r b622b9e8f1ac orte/mca/grpcomm/bad/grpcomm_bad_module.c
--- a/orte/mca/grpcomm/bad/grpcomm_bad_module.c Mon Nov 09 13:29:16 2009 +0100
+++ b/orte/mca/grpcomm/bad/grpcomm_bad_module.c Wed Nov 18 09:27:55 2009 +0100
@@ -687,14 +687,6 @@
 opal_list_append(_local_jobdata, >super);
 }

-/* it may be possible to get here prior to having actually finished 
processing our
- * local launch msg due to the race condition between different nodes and 
when
- * they start their individual procs. Hence, we have to first ensure that 
we
- * -have- finished processing the launch msg, or else we won't know whether
- * or not to wait before sending this on
- */
-ORTE_PROGRESSED_WAIT(jobdat->launch_msg_processed, 0, 1);
-
 /* unpack the collective type */
 n = 1;
 if (ORTE_SUCCESS != (rc = opal_dss.unpack(data, >collective_type, 
, ORTE_GRPCOMM_COLL_T))) {
@@ -894,6 +886,28 @@

 proc = >sender;
 buf = mev->buffer;
+
+jobdat = NULL;
+for (item = opal_list_get_first(_local_jobdata);
+ item != opal_list_get_end(_local_jobdata);
+ item = opal_list_get_next(item)) {
+jobdat = (orte_odls_job_t*)item;
+
+/* is this the specified job? */
+if (jobdat->jobid == proc->jobid) {
+break;
+}
+}
+if (NULL == jobdat || jobdat->launch_msg_processed != 1) {
+/* it may be possible to get here prior to having actually finished 
processing our
+ * local launch msg due to the race condition between different nodes 
and when
+ * they start their individual procs. Hence, we have to first ensure 
that we
+ * -have- finished processing the launch msg. Requeue this event until 
it is done.
+ */
+int tag = >tag;
+ORTE_MESSAGE_EVENT(proc, buf, tag, process_msg);
+return;
+}

 /* is the sender a local proc, or a daemon relaying the collective? */
 if (ORTE_PROC_MY_NAME->jobid == proc->jobid) {

Sylvain

On Thu, 19 Nov 2009, Ralph Castain wrote:

Very strange. As I said, we routinely launch jobs spanning several 
hundred nodes without problem. You can see the platform files for that 
setup in contrib/platform/lanl/tlcc


That said, it is always possible you are hitting some kind of race 
condition we don't hit. In looking at the code, one possibility would be 
to make all the communications flow through the daemon cmd processor in 
orte/orted_comm.c. This is the way it used to work until I reorganized 
the code a year ago for other reasons that never materialized.


Unfortunately, the daemon collective has to wait until the local launch 
cmd has been completely processed so it can know whether or not to wait 
for contributions from local procs before sending along the collective 
message, so this kinda limits our options.


About the only other thing you could do would be to not send the relay 
at all until -after- processing the local launch cmd. You can then 
remove the "wait" in the daemon collective as you will know how many 
local procs are involved, if any.


I used to do it that way and it guarantees it will work. The negative is 
that we lose some launch speed as the next nodes in the tree don't get 
the launch message until this node finishes launching all its procs.


The way around that, of course, would be to:

1.  process the launch message, thus extracting the number of any local 
procs and setting up all data structures...but do -not- launch the procs 
at this time (as this is what takes all the time)


2. send the relay - the daemon collective can now proceed without a 
"wait" in it


3. now launch the local procs

It would be a fairly simple reorganization of the code in the 
orte/mca/odls area. I can do it this weekend if you like, or you can do 
it - either way is fine, but if you do it, please contribute it back to 
the trunk.


Ralph


On Nov 19, 2009, at 1:39 AM, Sylvain Jeaugey wrote:


I would say I use the default settings, i.e. I don't set anything "special" at 
configure.

I'm launching my processes with SLURM (salloc + mpirun).

Sylvain

On Wed, 18 Nov 2009,

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-19 Thread Sylvain Jeaugey

I would say I use the default settings, i.e. I don't set anything 
"special" at configure.


I'm launching my processes with SLURM (salloc + mpirun).

Sylvain

On Wed, 18 Nov 2009, Ralph Castain wrote:


How did you configure OMPI?

What launch mechanism are you using - ssh?

On Nov 17, 2009, at 9:01 AM, Sylvain Jeaugey wrote:


I don't think so, and I'm not doing it explicitely at least. How do I know ?

Sylvain

On Tue, 17 Nov 2009, Ralph Castain wrote:


We routinely launch across thousands of nodes without a problem...I have never 
seen it stick in this fashion.

Did you build and/or are using ORTE threaded by any chance? If so, that 
definitely won't work.

On Nov 17, 2009, at 9:27 AM, Sylvain Jeaugey wrote:


Hi all,

We are currently experiencing problems at launch on the 1.5 branch on 
relatively large number of nodes (at least 80). Some processes are not spawned 
and orted processes are deadlocked.

When MPI processes are calling MPI_Init before send_relay is complete, the 
send_relay function and the daemon_collective function are doing a nice 
interlock :

Here is the scenario :

send_relay

performs the send tree :

orte_rml_oob_send_buffer

> orte_rml_oob_send
  > opal_wait_condition
Waiting on completion from send thus calling opal_progress()
> opal_progress()
But since a collective request arrived from the network, entered :
  > daemon_collective
However, daemon_collective is waiting for the job to be initialized (wait on 
jobdat->launch_msg_processed) before continuing, thus calling :
> opal_progress()

At this time, the send may complete, but since we will never go back to 
orte_rml_oob_send, we will never perform the launch (setting 
jobdat->launch_msg_processed to 1).

I may try to solve the bug (this is quite a top priority problem for me), but 
maybe people who are more familiar with orted than I am may propose a nice and 
clean solution ...

For those who like real (and complete) gdb stacks, here they are :
#0  0x003b7fed4f38 in poll () from /lib64/libc.so.6
#1  0x7fd0de5d861a in poll_dispatch (base=0x930230, arg=0x91a4b0, 
tv=0x7fff0d977880) at poll.c:167
#2  0x7fd0de5d586f in opal_event_base_loop (base=0x930230, flags=1) at 
event.c:823
#3  0x7fd0de5d556b in opal_event_loop (flags=1) at event.c:746
#4  0x7fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
#5  0x7fd0dd340a02 in daemon_collective (sender=0x97af50, data=0x97b010) at 
grpcomm_bad_module.c:696
#6  0x7fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x97af20) at 
grpcomm_bad_module.c:901
#7  0x7fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
#8  0x7fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at 
event.c:839
#9  0x7fd0de5d556b in opal_event_loop (flags=1) at event.c:746
#10 0x7fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
#11 0x7fd0dd340a02 in daemon_collective (sender=0x979700, data=0x9676e0) at 
grpcomm_bad_module.c:696
#12 0x7fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x9796d0) at 
grpcomm_bad_module.c:901
#13 0x7fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
#14 0x7fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at 
event.c:839
#15 0x7fd0de5d556b in opal_event_loop (flags=1) at event.c:746
#16 0x7fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
#17 0x7fd0dd340a02 in daemon_collective (sender=0x97b420, data=0x97b4e0) at 
grpcomm_bad_module.c:696
#18 0x7fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x97b3f0) at 
grpcomm_bad_module.c:901
#19 0x7fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
#20 0x7fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at 
event.c:839
#21 0x7fd0de5d556b in opal_event_loop (flags=1) at event.c:746
#22 0x7fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
#23 0x7fd0dd969a8a in opal_condition_wait (c=0x97b210, m=0x97b1a8) at 
../../../../opal/threads/condition.h:99
#24 0x7fd0dd96a4bf in orte_rml_oob_send (peer=0x7fff0d9785a0, 
iov=0x7fff0d978530, count=1, tag=1, flags=16) at rml_oob_send.c:153
#25 0x7fd0dd96ac4d in orte_rml_oob_send_buffer (peer=0x7fff0d9785a0, 
buffer=0x7fff0d9786b0, tag=1, flags=0) at rml_oob_send.c:270
#26 0x7fd0de86ed2a in send_relay (buf=0x7fff0d9786b0) at 
orted/orted_comm.c:127
#27 0x7fd0de86f6de in orte_daemon_cmd_processor (fd=-1, opal_event=1, 
data=0x965fc0) at orted/orted_comm.c:308
#28 0x7fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
#29 0x7fd0de5d597a in opal_event_base_loop (base=0x930230, flags=0) at 
event.c:839
#30 0x7fd0de5d556b in opal_event_loop (flags=0) at event.c:746
#31 0x7fd0de5d5418 in opal_event_dispatch () at event.c:682
#32 0x7fd0de86e339 in orte_daemon (argc=19, argv=0x7fff0d979ca8) at 
orted/orted_main.c:769
#33 0x004008e2 in main (argc=19, argv=0x7fff0d979ca8) a

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-17 Thread Sylvain Jeaugey

I don't think so, and I'm not doing it explicitely at least. How do I 
know ?


Sylvain

On Tue, 17 Nov 2009, Ralph Castain wrote:


We routinely launch across thousands of nodes without a problem...I have never 
seen it stick in this fashion.

Did you build and/or are using ORTE threaded by any chance? If so, that 
definitely won't work.

On Nov 17, 2009, at 9:27 AM, Sylvain Jeaugey wrote:


Hi all,

We are currently experiencing problems at launch on the 1.5 branch on 
relatively large number of nodes (at least 80). Some processes are not spawned 
and orted processes are deadlocked.

When MPI processes are calling MPI_Init before send_relay is complete, the 
send_relay function and the daemon_collective function are doing a nice 
interlock :

Here is the scenario :

send_relay

performs the send tree :
> orte_rml_oob_send_buffer
  > orte_rml_oob_send
> opal_wait_condition
Waiting on completion from send thus calling opal_progress()
  > opal_progress()
But since a collective request arrived from the network, entered :
> daemon_collective
However, daemon_collective is waiting for the job to be initialized (wait on 
jobdat->launch_msg_processed) before continuing, thus calling :
  > opal_progress()

At this time, the send may complete, but since we will never go back to 
orte_rml_oob_send, we will never perform the launch (setting 
jobdat->launch_msg_processed to 1).

I may try to solve the bug (this is quite a top priority problem for me), but 
maybe people who are more familiar with orted than I am may propose a nice and 
clean solution ...

For those who like real (and complete) gdb stacks, here they are :
#0  0x003b7fed4f38 in poll () from /lib64/libc.so.6
#1  0x7fd0de5d861a in poll_dispatch (base=0x930230, arg=0x91a4b0, 
tv=0x7fff0d977880) at poll.c:167
#2  0x7fd0de5d586f in opal_event_base_loop (base=0x930230, flags=1) at 
event.c:823
#3  0x7fd0de5d556b in opal_event_loop (flags=1) at event.c:746
#4  0x7fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
#5  0x7fd0dd340a02 in daemon_collective (sender=0x97af50, data=0x97b010) at 
grpcomm_bad_module.c:696
#6  0x7fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x97af20) at 
grpcomm_bad_module.c:901
#7  0x7fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
#8  0x7fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at 
event.c:839
#9  0x7fd0de5d556b in opal_event_loop (flags=1) at event.c:746
#10 0x7fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
#11 0x7fd0dd340a02 in daemon_collective (sender=0x979700, data=0x9676e0) at 
grpcomm_bad_module.c:696
#12 0x7fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x9796d0) at 
grpcomm_bad_module.c:901
#13 0x7fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
#14 0x7fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at 
event.c:839
#15 0x7fd0de5d556b in opal_event_loop (flags=1) at event.c:746
#16 0x7fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
#17 0x7fd0dd340a02 in daemon_collective (sender=0x97b420, data=0x97b4e0) at 
grpcomm_bad_module.c:696
#18 0x7fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x97b3f0) at 
grpcomm_bad_module.c:901
#19 0x7fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
#20 0x7fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at 
event.c:839
#21 0x7fd0de5d556b in opal_event_loop (flags=1) at event.c:746
#22 0x7fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
#23 0x7fd0dd969a8a in opal_condition_wait (c=0x97b210, m=0x97b1a8) at 
../../../../opal/threads/condition.h:99
#24 0x7fd0dd96a4bf in orte_rml_oob_send (peer=0x7fff0d9785a0, 
iov=0x7fff0d978530, count=1, tag=1, flags=16) at rml_oob_send.c:153
#25 0x7fd0dd96ac4d in orte_rml_oob_send_buffer (peer=0x7fff0d9785a0, 
buffer=0x7fff0d9786b0, tag=1, flags=0) at rml_oob_send.c:270
#26 0x7fd0de86ed2a in send_relay (buf=0x7fff0d9786b0) at 
orted/orted_comm.c:127
#27 0x7fd0de86f6de in orte_daemon_cmd_processor (fd=-1, opal_event=1, 
data=0x965fc0) at orted/orted_comm.c:308
#28 0x7fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
#29 0x7fd0de5d597a in opal_event_base_loop (base=0x930230, flags=0) at 
event.c:839
#30 0x7fd0de5d556b in opal_event_loop (flags=0) at event.c:746
#31 0x7fd0de5d5418 in opal_event_dispatch () at event.c:682
#32 0x7fd0de86e339 in orte_daemon (argc=19, argv=0x7fff0d979ca8) at 
orted/orted_main.c:769
#33 0x004008e2 in main (argc=19, argv=0x7fff0d979ca8) at orted.c:62

Thanks in advance,
Sylvain
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] Deadlocks with new (routed) orted launch algorithm

2009-11-17 Thread Sylvain Jeaugey


Hi all,

We are currently experiencing problems at launch on the 1.5 branch on 
relatively large number of nodes (at least 80). Some processes are not 
spawned and orted processes are deadlocked.


When MPI processes are calling MPI_Init before send_relay is complete, the 
send_relay function and the daemon_collective function are doing a nice 
interlock :


Here is the scenario :

send_relay

performs the send tree :
  > orte_rml_oob_send_buffer
> orte_rml_oob_send
  > opal_wait_condition
Waiting on completion from send thus calling opal_progress()
> opal_progress()
But since a collective request arrived from the network, entered :
  > daemon_collective
However, daemon_collective is waiting for the job to be initialized 
(wait on jobdat->launch_msg_processed) before continuing, thus calling :

> opal_progress()

At this time, the send may complete, but since we will never go back to 
orte_rml_oob_send, we will never perform the launch (setting 
jobdat->launch_msg_processed to 1).


I may try to solve the bug (this is quite a top priority problem for me), 
but maybe people who are more familiar with orted than I am may propose a 
nice and clean solution ...


For those who like real (and complete) gdb stacks, here they are :
#0  0x003b7fed4f38 in poll () from /lib64/libc.so.6
#1  0x7fd0de5d861a in poll_dispatch (base=0x930230, arg=0x91a4b0, 
tv=0x7fff0d977880) at poll.c:167
#2  0x7fd0de5d586f in opal_event_base_loop (base=0x930230, flags=1) at 
event.c:823
#3  0x7fd0de5d556b in opal_event_loop (flags=1) at event.c:746
#4  0x7fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
#5  0x7fd0dd340a02 in daemon_collective (sender=0x97af50, data=0x97b010) at 
grpcomm_bad_module.c:696
#6  0x7fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x97af20) at 
grpcomm_bad_module.c:901
#7  0x7fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
#8  0x7fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at 
event.c:839
#9  0x7fd0de5d556b in opal_event_loop (flags=1) at event.c:746
#10 0x7fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
#11 0x7fd0dd340a02 in daemon_collective (sender=0x979700, data=0x9676e0) at 
grpcomm_bad_module.c:696
#12 0x7fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x9796d0) at 
grpcomm_bad_module.c:901
#13 0x7fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
#14 0x7fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at 
event.c:839
#15 0x7fd0de5d556b in opal_event_loop (flags=1) at event.c:746
#16 0x7fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
#17 0x7fd0dd340a02 in daemon_collective (sender=0x97b420, data=0x97b4e0) at 
grpcomm_bad_module.c:696
#18 0x7fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x97b3f0) at 
grpcomm_bad_module.c:901
#19 0x7fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
#20 0x7fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at 
event.c:839
#21 0x7fd0de5d556b in opal_event_loop (flags=1) at event.c:746
#22 0x7fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189
#23 0x7fd0dd969a8a in opal_condition_wait (c=0x97b210, m=0x97b1a8) at 
../../../../opal/threads/condition.h:99
#24 0x7fd0dd96a4bf in orte_rml_oob_send (peer=0x7fff0d9785a0, 
iov=0x7fff0d978530, count=1, tag=1, flags=16) at rml_oob_send.c:153
#25 0x7fd0dd96ac4d in orte_rml_oob_send_buffer (peer=0x7fff0d9785a0, 
buffer=0x7fff0d9786b0, tag=1, flags=0) at rml_oob_send.c:270
#26 0x7fd0de86ed2a in send_relay (buf=0x7fff0d9786b0) at 
orted/orted_comm.c:127
#27 0x7fd0de86f6de in orte_daemon_cmd_processor (fd=-1, opal_event=1, 
data=0x965fc0) at orted/orted_comm.c:308
#28 0x7fd0de5d5334 in event_process_active (base=0x930230) at event.c:667
#29 0x7fd0de5d597a in opal_event_base_loop (base=0x930230, flags=0) at 
event.c:839
#30 0x7fd0de5d556b in opal_event_loop (flags=0) at event.c:746
#31 0x7fd0de5d5418 in opal_event_dispatch () at event.c:682
#32 0x7fd0de86e339 in orte_daemon (argc=19, argv=0x7fff0d979ca8) at 
orted/orted_main.c:769
#33 0x004008e2 in main (argc=19, argv=0x7fff0d979ca8) at orted.c:62

Thanks in advance,
Sylvain

Re: [OMPI devel] [OMPI users] cartofile

2009-10-13 Thread Sylvain Jeaugey


We worked a bit on it and yes, there is some work to do :

* The syntax used to describe the various components is far from being 
consistent from one usage to another ("SOCKET", "NODE", ...). We manage to 
make things reading the various not up to date example files - but mainly 
the code.


* The auto-detect component does not seem to do anything. We implemented 
it, and planned to release it. For now the code is heavily based on linux 
kernel functionalities, but missing the needed ifdefs.


Also, we did a patch to dump in graphviz format the detected (or read) 
topology.


Not much time to work on this right now, but if anyone wants to work on 
it, we may help.


Sylvain

On Tue, 13 Oct 2009, Ralph Castain wrote:


Here is where OMPI uses it:

ompi/mca/btl/openib/btl_openib_component.c:1918:static opal_carto_graph_t 
*host_topo;
ompi/mca/btl/openib/btl_openib_component.c:1923:opal_carto_base_node_t 
*device_node;
ompi/mca/btl/openib/btl_openib_component.c:1931:device_node = 
opal_carto_base_find_node(host_topo, device);
ompi/mca/btl/openib/btl_openib_component.c:1941: 
opal_carto_base_node_t *slot_node;
ompi/mca/btl/openib/btl_openib_component.c:1951:slot_node = 
opal_carto_base_find_node(host_topo, slot);
ompi/mca/btl/openib/btl_openib_component.c:1958:distance = 
opal_carto_base_spf(host_topo, slot_node, device_node);
ompi/mca/btl/openib/btl_openib_component.c:1989: 
opal_carto_base_get_host_graph(_topo, "Infiniband");
ompi/mca/btl/openib/btl_openib_component.c:1998: 
opal_carto_base_free_graph(host_topo);

ompi/mca/btl/sm/btl_sm.c:118:opal_carto_graph_t *topo;
ompi/mca/btl/sm/btl_sm.c:123:opal_carto_node_distance_t *dist;
ompi/mca/btl/sm/btl_sm.c:124:opal_carto_base_node_t *slot_node;
ompi/mca/btl/sm/btl_sm.c:129:if (OMPI_SUCCESS != 
opal_carto_base_get_host_graph(, "Memory")) {
ompi/mca/btl/sm/btl_sm.c:134: opal_value_array_init(, 
sizeof(opal_carto_node_distance_t));
ompi/mca/btl/sm/btl_sm.c:157: slot_node = opal_carto_base_find_node(topo, 
myslot);
ompi/mca/btl/sm/btl_sm.c:163: opal_carto_base_get_nodes_distance(topo, 
slot_node, "Memory", );
ompi/mca/btl/sm/btl_sm.c:168: dist = (opal_carto_node_distance_t *) 
opal_value_array_get_item(, 0);

ompi/mca/btl/sm/btl_sm.c:175: opal_carto_base_free_graph(topo);

No idea if it is of any value or not. I don't know of anyone who has ever 
written a carto file for a system, has any idea how to do so, or why they 
should. Looking at the code, it wouldn't appear to have any value on any of 
the machines at LANL, but I may be missing something - not a lot of help 
around to understand it.


On Oct 13, 2009, at 7:08 AM, Terry Dontje wrote:

After rereading the manpage for the umpteenth time I agree with Eugene that 
the information provided on cartofile is next to useless.   Ok, so you 
describe what your node looks like but what does mpirun or libmpi do with 
that information?  Other than the option to provide the cartofile it isn't 
obvious how a user or libmpi uses this information.


I've looked on the faq and wiki and have not found anything yet on how one 
"current" uses cartofile.


--td

Eugene Loh wrote:
This e-mail was on the users alias... see 
http://www.open-mpi.org/community/lists/users/2009/09/10710.php


There wasn't much response, so let me ask another question.  How about if 
we remove the cartofile section from the DESCRIPTION section of the OMPI 
mpirun man page?  It's a lot of text that illustrates how to create a 
cartofile without saying anything about why one would want to go to the 
trouble.  What does this impact?  What does it change?  What's the 
motivation for doing this stuff?  What's this stuff good for?


Another alternative could be to move the cartofile description to a FAQ 
page.


The mpirun man page is rather long and I was thinking that if we could 
remove some "low impact" stuff out, we could improve the overall 
signal-to-noise ratio of the page.


In any case, I personally would like to know what cartofiles are good for.

Eugene Loh wrote:
Thank you, but I don't understand who is consuming this information for 
what.  E.g., the mpirun man page describes the carto file, but doesn't 
give users any indication whether they should be worrying about this.


Lenny Verkhovsky wrote:

Hi Eugene,
carto file is a file with a staic graph topology of your node.
in the opal/mca/carto/file/carto_file.h you can see example.
( yes I know that , it should be help/man list :) )
Basically it describes a map of your node and inside interconnection.
Hopefully it will be discovered automatically someday,
but for now you can describe your node manually.
Best regards Lenny.

On Thu, Sep 17, 2009 at 12:38 AM, Eugene Loh > wrote:


  I feel like I should know, but what's a cartofile?  I guess you
  supply "topological" information about a host, but I can't tell
  how this information is used by, say, mpirun.

Re: [OMPI devel] Deadlock with comm_create since cid allocator change

2009-09-21 Thread Sylvain Jeaugey


You were faster to fix the bug than I was to send my bug report :-)

So I confirm : this fixes the problem.

Thanks !
Sylvain

On Mon, 21 Sep 2009, Edgar Gabriel wrote:

what version of OpenMPI did you use? Patch #21970 should have fixed this 
issue on the trunk...


Thanks
Edgar

Sylvain Jeaugey wrote:

Hi list,

We are currently experiencing deadlocks when using communicators other than 
MPI_COMM_WORLD. So we made a very simple reproducer (Comm_create then 
MPI_Barrier on the communicator - see end of e-mail).


We can reproduce the deadlock only with openib and with at least 8 cores 
(no success with sm) and after ~20 runs average. Using larger number of 
cores greatly increases the occurence of the deadlock. When the deadlock 
occurs, every even process is stuck in MPI_Finalize and every odd process 
is in MPI_Barrier.


So we tracked the bug in the changesets and found out that this patch seem 
to have introduced the bug :


user:brbarret
date:Tue Aug 25 15:13:31 2009 +
summary: Per discussion in ticket #2009, temporarily disable the block 
CID allocation

algorithms until they properly reuse CIDs.

Reverting to the non multi-thread cid allocator makes the deadlock 
disappear.


I tried to dig further and understand why this makes a difference, with no 
luck.


If anyone can figure out what's happening, that would be great ...

Thanks,
Sylvain

#include 
#include 

int main(int argc, char **argv) {
int rank, numTasks;
int range[3];
MPI_Comm testComm, dupComm;
MPI_Group orig_group, new_group;

MPI_Init(, );
MPI_Comm_rank(MPI_COMM_WORLD, );
MPI_Comm_size(MPI_COMM_WORLD, );
MPI_Comm_group(MPI_COMM_WORLD, _group);
range[0] = 0; /* first rank */
range[1] = numTasks - 1; /* last rank */
range[2] = 1; /* stride */
MPI_Group_range_incl(orig_group, 1, , _group);
MPI_Comm_create(MPI_COMM_WORLD, new_group, );
MPI_Barrier(testComm);
MPI_Finalize();
return 0;
}

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] Deadlock with comm_create since cid allocator change

2009-09-21 Thread Sylvain Jeaugey


Hi list,

We are currently experiencing deadlocks when using communicators other 
than MPI_COMM_WORLD. So we made a very simple reproducer (Comm_create then 
MPI_Barrier on the communicator - see end of e-mail).


We can reproduce the deadlock only with openib and with at least 8 cores 
(no success with sm) and after ~20 runs average. Using larger number of 
cores greatly increases the occurence of the deadlock. When the deadlock 
occurs, every even process is stuck in MPI_Finalize and every odd process 
is in MPI_Barrier.


So we tracked the bug in the changesets and found out that this patch seem 
to have introduced the bug :


user:brbarret
date:Tue Aug 25 15:13:31 2009 +
summary: Per discussion in ticket #2009, temporarily disable the block CID 
allocation
algorithms until they properly reuse CIDs.

Reverting to the non multi-thread cid allocator makes the deadlock 
disappear.


I tried to dig further and understand why this makes a difference, with no 
luck.


If anyone can figure out what's happening, that would be great ...

Thanks,
Sylvain

#include 
#include 

int main(int argc, char **argv) {
int rank, numTasks;
int range[3];
MPI_Comm testComm, dupComm;
MPI_Group orig_group, new_group;

MPI_Init(, );
MPI_Comm_rank(MPI_COMM_WORLD, );
MPI_Comm_size(MPI_COMM_WORLD, );
MPI_Comm_group(MPI_COMM_WORLD, _group);
range[0] = 0; /* first rank */
range[1] = numTasks - 1; /* last rank */
range[2] = 1; /* stride */
MPI_Group_range_incl(orig_group, 1, , _group);
MPI_Comm_create(MPI_COMM_WORLD, new_group, );
MPI_Barrier(testComm);
MPI_Finalize();
return 0;
}

Re: [OMPI devel] Deadlock on openib when using hindexed types

2009-09-04 Thread Sylvain Jeaugey


Ok, I was wrong, the fix works.

Actually, I rebuilt with the latest trunk but openib support was somehow 
dropped. I was running on tcp.


Which brings us to the next issue : tcp is actually not working (I don't 
know why I was convinced that tcp worked). The fix fixed the problem for 
openib, but if I'm not mistaken (again !) tcp still hangs.


Sylvain

On Fri, 4 Sep 2009, Sylvain Jeaugey wrote:


Hi Rolf,

I was indeed running a more than 4 weeks old trunk, but after pulling the 
latest version (and checking the patch was in the code), it seems to make no 
difference.


However, I know where to look at now, thanks !

Sylvain

On Fri, 4 Sep 2009, Rolf Vandevaart wrote:

I think you are running into a bug that we saw also and we recently fixed. 
We would see a hang when we were sending from a contiguous type to a 
non-contiguous type using a single port over openib.  The problem was that 
the state of the request on the sending side was not being properly updated 
in that case. The reason we see it with only one port vs two is because 
different protocols are used depending on the number of ports.


Don Kerr found and fixed the problem in both the trunk and the branch.

Trunk:  https://svn.open-mpi.org/trac/ompi/changeset/21775
1.3 Branch: https://svn.open-mpi.org/trac/ompi/changeset/21833

If you are running the latest bits and still seeing the problem, then I 
guess it is something else.


Rolf

On 09/04/09 04:40, Sylvain Jeaugey wrote:

Hi all,

We're currently working with romio and we hit a problem when exchanging 
data with hindexed types with the openib btl.


The attached reproducer (adapted from romio) is working fine on tcp, 
blocks on openib when using 1 port but works if we use 2 ports (!). I 
tested it against the trunk and the 1.3.3 release with the same 
conclusions.


The basic idea is : processes 0..3 send contiguous data to process 0. 0 
receives these buffers with an hindexed datatype which scatters data at 
different offsets.


Receiving in a contiguous manner works, but receiving with an hindexed 
datatype makes the remote sends block. Yes, the remote send, not the 
receive. The receive is working fine and data is correctly scattered on 
the buffer, but the senders on the other node are stuck in the Wait().


I tried not using MPI_BOTTOM, which changed nothing. It seems that the 
problem only occurs when STRIPE*NB (the size of the send) is higher than 
12k -namely the RDMA threshold- but I didn't manage to remove the deadlock 
by increasing the RDMA threshold.


I've tried to do some debugging, but I'm a bit lost on where the 
non-contiguous types are handled and how they affect btl communication.


So, if anyone has a clue on where I should look at, I'm interested !

Thanks,
Sylvain




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--

=
rolf.vandeva...@sun.com
781-442-3043
=
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] RFC - "system-wide-only" MCA parameters

2009-09-04 Thread Sylvain Jeaugey

Understood. So, let's say that we're only implementing a hurdle to 
discourage users from doing things wrong. I guess the efficiency of this 
will reside in the message displayed to the user ("You are about to break 
the entire machine and you will be fined if you try to circumvent this in 
any way").


Maybe the warning message should be set by administrators 
($OMPI/.../no-override.txt). It would certainly be more efficient :)


Sylvain

On Fri, 4 Sep 2009, Ralph Castain wrote:

I fear you all misunderstood me. This isn't a case of sabotage or nasty 
users, but simply people who do something that they don't realize can cause a 
problem.


Our example is quite simple. We have IB network for MPI messages, and several 
Ethernet NICs that are dedicated to system-level functions (e.g., RM 
communications, file system support). If the users access the TCP BTL, that 
code will utilize whatever TCP interface it wants - including the 
system-level ones.


The obvious solution is to set the btl_tcp_include param in the default MCA 
param file. However, in their ignorance, users will do an ompi_info, see that 
param, see the available interfaces, and set it improperly.


Your solution won't solve that problem. When users encounter such obstacles, 
it is because they are trying to run normally (i.e., using defaults) and 
encountering problems - which usually have nothing to do with constraints 
imposed in the default params. They talk to each other and discover that 
"joe" built his own version of OMPI and was able to run. Presto - they use 
his, which doesn't have the same protections as the production version.


And now they make a mistake that causes a problem.

So this isn't a security issue, nor a problem where somebody is trying to be 
stupid or do bad things. It is an inherent "problem" in OMPI that is caused 
by our desire to provide "flexibility" and "control" to the users, as opposed 
to deliberately restricting "control" to the sys admins.


My intent was not to argue that this isn't worth doing, but simply to warn 
you that similar attempts met with failure to fully achieve the desired goal.



On Sep 4, 2009, at 7:59 AM, Nadia Derbey wrote:


On Fri, 2009-09-04 at 07:50 -0600, Ralph Castain wrote:

Let me point out the obvious since this has plagued us at LANL with
regard to this concept. If a user wants to do something different, all
they have to do is download and build their own copy of OMPI.

Amazingly enough, that is exactly what they do. When we build our
production versions, we actually "no-build" modules we don't want them
using (e.g., certain BTL's that use privileged network interfaces) so
even MCA params won't let them do something undesirable.

No good - they just try until they realize it won't work, then
download and build their own version...and merrily hose the system.

My point here: this concept can help, but it should in no way be
viewed as a solution to the problem you are trying to solve. It is at
best a minor obstacle as we made it very simple for a user to
circumvent such measures.

Which is why I never made the effort to actually implement what was in
that ticket. It was decided that it really wouldn't help us here, and
would only result in further encouraging user-owned builds.


Ralph,

Let's forget those people who intentionally do bad things: it's true
that they will always find a way to bypass whatever has been done...

We are not talking about security here, but there are client sites where
people do not want to care about some mca params values and where those
system-wide params should not be *unintentionally* set to different
values.

Regards,
Nadia




:-(


On Sep 4, 2009, at 12:42 AM, Jeff Squyres wrote:


On Sep 4, 2009, at 8:26 AM, Nadia Derbey wrote:


Can the file name ( openmpi-priv-mca-params.conf ) also be

configurable ?

No, it isn't, presently, but this can be changed if needed.




If it's configurable, it must be configurable at configure time --
not run time -- otherwise, a user could just give a different
filename at runtime and get around all the "privileged" values.

--
Jeff Squyres
jsquy...@cisco.com

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Nadia Derbey 

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] RFC - "system-wide-only" MCA parameters

2009-09-04 Thread Sylvain Jeaugey


Looks like users at LANL are not very nice ;)

Indeed, this is no hard security. Only a way to prevent users from doing 
mistakes. We often give users special tuning for their application and 
when they see their application is going faster, they start messing with 
every parameter hoping that it will go even faster.


So, this feature is to prevent the dumb user from breaking everything, not 
to prevent real sabotage.


Sylvain

On Fri, 4 Sep 2009, Ralph Castain wrote:

Let me point out the obvious since this has plagued us at LANL with regard to 
this concept. If a user wants to do something different, all they have to do 
is download and build their own copy of OMPI.


Amazingly enough, that is exactly what they do. When we build our production 
versions, we actually "no-build" modules we don't want them using (e.g., 
certain BTL's that use privileged network interfaces) so even MCA params 
won't let them do something undesirable.


No good - they just try until they realize it won't work, then download and 
build their own version...and merrily hose the system.


My point here: this concept can help, but it should in no way be viewed as a 
solution to the problem you are trying to solve. It is at best a minor 
obstacle as we made it very simple for a user to circumvent such measures.


Which is why I never made the effort to actually implement what was in that 
ticket. It was decided that it really wouldn't help us here, and would only 
result in further encouraging user-owned builds.


:-(


On Sep 4, 2009, at 12:42 AM, Jeff Squyres wrote:


On Sep 4, 2009, at 8:26 AM, Nadia Derbey wrote:


Can the file name ( openmpi-priv-mca-params.conf ) also be configurable ?


No, it isn't, presently, but this can be changed if needed.




If it's configurable, it must be configurable at configure time -- not run 
time -- otherwise, a user could just give a different filename at runtime 
and get around all the "privileged" values.


--
Jeff Squyres
jsquy...@cisco.com

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Deadlock on openib when using hindexed types

2009-09-04 Thread Sylvain Jeaugey


Hi Rolf,

I was indeed running a more than 4 weeks old trunk, but after pulling the 
latest version (and checking the patch was in the code), it seems to make 
no difference.


However, I know where to look at now, thanks !

Sylvain

On Fri, 4 Sep 2009, Rolf Vandevaart wrote:

I think you are running into a bug that we saw also and we recently fixed. 
We would see a hang when we were sending from a contiguous type to a 
non-contiguous type using a single port over openib.  The problem was that 
the state of the request on the sending side was not being properly updated 
in that case. The reason we see it with only one port vs two is because 
different protocols are used depending on the number of ports.


Don Kerr found and fixed the problem in both the trunk and the branch.

Trunk:  https://svn.open-mpi.org/trac/ompi/changeset/21775
1.3 Branch: https://svn.open-mpi.org/trac/ompi/changeset/21833

If you are running the latest bits and still seeing the problem, then I guess 
it is something else.


Rolf

On 09/04/09 04:40, Sylvain Jeaugey wrote:

Hi all,

We're currently working with romio and we hit a problem when exchanging 
data with hindexed types with the openib btl.


The attached reproducer (adapted from romio) is working fine on tcp, blocks 
on openib when using 1 port but works if we use 2 ports (!). I tested it 
against the trunk and the 1.3.3 release with the same conclusions.


The basic idea is : processes 0..3 send contiguous data to process 0. 0 
receives these buffers with an hindexed datatype which scatters data at 
different offsets.


Receiving in a contiguous manner works, but receiving with an hindexed 
datatype makes the remote sends block. Yes, the remote send, not the 
receive. The receive is working fine and data is correctly scattered on the 
buffer, but the senders on the other node are stuck in the Wait().


I tried not using MPI_BOTTOM, which changed nothing. It seems that the 
problem only occurs when STRIPE*NB (the size of the send) is higher than 
12k -namely the RDMA threshold- but I didn't manage to remove the deadlock 
by increasing the RDMA threshold.


I've tried to do some debugging, but I'm a bit lost on where the 
non-contiguous types are handled and how they affect btl communication.


So, if anyone has a clue on where I should look at, I'm interested !

Thanks,
Sylvain




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--

=
rolf.vandeva...@sun.com
781-442-3043
=
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] RFC - "system-wide-only" MCA parameters

2009-09-04 Thread Sylvain Jeaugey


On Fri, 4 Sep 2009, Jeff Squyres wrote:

I haven't looked at the code deeply, so forgive me if I'm parsing this wrong: 
is the code actually reading the file into one list and then moving the 
values to another list?  If so, that seems a little hackish.  Can't it just 
read directly to the target list?
On the basic approach, I would have another suggestion, reducing parsing 
and maybe a bit less hackish : do not introduce another file but only a 
keyword indicating that further overriding is disabled ("fixed", 
"restricted", "read-only" ?).


You would therefore write in your configuration file something like:
notifier_threshold_severity=notice fixed
or more generally :
key=value flags

Maybe we don't have a way to differenciate flags at the end with the 
current parser, so maybe a leading "!" or "%" or any other strong 
character would be simpler to implement while still ensuring 
retro-compatibility.


Sylvain

Re: [OMPI devel] RFC - "system-wide-only" MCA parameters

2009-09-04 Thread Sylvain Jeaugey


On Fri, 4 Sep 2009, Jeff Squyres wrote:


--
*** Checking versions
checking for SVN version... done
checking Open MPI version... 1.4a1hgf11244ed72b5
up to changeset c4b117c5439b
checking Open MPI release date... Unreleased developer copy
checking Open MPI Subversion repository version... hgf11244ed72b5
up to changeset c4b117c5439b
checking for SVN version... done
...etc.
--

Do you see this, or do you get a single-line version number?

I get the same. The reason is simple :

$ hg tip
changeset:   9:f11244ed72b5
tag: tip
user:Nadia Derbey 
date:Thu Sep 03 14:21:47 2009 +0200
summary: up to changeset c4b117c5439b

$ hg -v tip | grep changeset | cut -d: -f3 # done by configure
f11244ed72b5
up to changeset c4b117c5439b

So yes, if anyone includes the word "changeset" in the commit message, 
you'll have the same bug :-)


So,
hg -R "$srcdir" tip | head -1 | grep "^changeset:" | cut -d: -f3
would certainly be safer.

Sylvain

[OMPI devel] Deadlock on openib when using hindexed types

2009-09-04 Thread Sylvain Jeaugey


Hi all,

We're currently working with romio and we hit a problem when exchanging 
data with hindexed types with the openib btl.


The attached reproducer (adapted from romio) is working fine on tcp, 
blocks on openib when using 1 port but works if we use 2 ports (!). I 
tested it against the trunk and the 1.3.3 release with the same 
conclusions.


The basic idea is : processes 0..3 send contiguous data to process 0. 0 
receives these buffers with an hindexed datatype which scatters data at 
different offsets.


Receiving in a contiguous manner works, but receiving with an hindexed 
datatype makes the remote sends block. Yes, the remote send, not the 
receive. The receive is working fine and data is correctly scattered on 
the buffer, but the senders on the other node are stuck in the Wait().


I tried not using MPI_BOTTOM, which changed nothing. It seems that the 
problem only occurs when STRIPE*NB (the size of the send) is higher than 
12k -namely the RDMA threshold- but I didn't manage to remove the 
deadlock by increasing the RDMA threshold.


I've tried to do some debugging, but I'm a bit lost on where the 
non-contiguous types are handled and how they affect btl communication.


So, if anyone has a clue on where I should look at, I'm interested !

Thanks,
Sylvain#include "mpi.h"
#include 
#include 

typedef struct {
long long *offsets;
int *lens;
MPI_Aint *mem_ptrs;
int count;
} ADIOI_Access;

#define STRIDE	190
#define NB	129
#define NPROCS	4
#define SIZE	(STRIDE*NB*NPROCS)
char buf1[SIZE], buf2[SIZE], my_procname[MPI_MAX_PROCESSOR_NAME];

int main(int argc, char **argv) {
int myrank, nprocs, i, j, k, my_procname_len, value, buf_idx = 0, nprocs_recv, nprocs_send;
ADIOI_Access *others_req;
MPI_Datatype *recv_types;
MPI_Request *requests;
MPI_Status *statuses;

MPI_Init(,);
MPI_Comm_size(MPI_COMM_WORLD, );
MPI_Comm_rank(MPI_COMM_WORLD, );
if (nprocs != NPROCS) {
printf("This program must be run with exactly 4 processes\n");
goto exit;
}

MPI_Get_processor_name(my_procname, _procname_len);
printf("Process %d running on %s\n", myrank, my_procname);
MPI_Barrier(MPI_COMM_WORLD);

for (i=0; i

Re: [OMPI devel] RFC: convert send to ssend

2009-08-24 Thread Sylvain Jeaugey


For the record, I see an big interest in this.

Sometimes, you have to answer calls for tender featuring applications that 
must work with no code change, even if the code is completely not 
MPI-compliant.


That's sad, but true (no pun intended :-))

Sylvain

On Mon, 24 Aug 2009, George Bosilca wrote:

Do people know that there exist tools for checking MPI code correctness? 
Many, many tools and most of them are freely available.


Personally I don't see any interest of doing this, absolutely no interest. 
There is basically no added value to our MPI, except for a very limited 
number of users, and these users if they manage to write a parallel 
application that need this checking I'm sure they will greatly benefit from a 
real tool to help them correct their MPI code.


As a side note, a very similar effect can be obtained by decreasing the eager 
size of the BTLs to be equal to the size of the match header, which is about 
24 bytes.


george.

On Aug 24, 2009, at 11:11 , Samuel K. Gutierrez wrote:


Hi Jeff,

Sounds good to me.

Samuel K. Gutierrez


Jeff Squyres wrote:
The debug builds already have quite a bit of performance overhead.  It 
might be desirable to change this RFC to have a similar tri-state as the 
MPI parameter checking:


- compiled out
- compiled in, always check
- compiled in, use MCA parameter to determine whether to check

Adapting that to this RFC, perhaps something like this:

- compiled out
- compiled in, always convert standard send to sync send
- compiled in, use MCA parameter to determine whether to convert standard 
-> sync


And we can leave the default as "compiled out".

Howzat?


On Aug 23, 2009, at 9:07 PM, Samuel K. Gutierrez wrote:


Hi all,

How about exposing this functionality as a run-time parameter that is 
only

available in debug builds?  This will make debugging easier and won't
impact the performance of optimized builds.  Just an idea...

Samuel K. Gutierrez



- "Jeff Squyres"  wrote:


Does anyone have any suggestions?  Or are we stuck
with compile-time checking?


I didn't see this until now, but I'd be happy with
just a compile time option so we could produce an
install just for debugging purposes and have our
users explicitly select it with modules.

I have to say that this is of interest to us as we're
trying to help a researcher at one of our member uni's
to track down a bug where a message appears to go missing.

cheers!
Chris
--
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Improvement of openmpi.spec

2009-08-06 Thread Sylvain Jeaugey


Hi Jeff,

Thanks for reviewing my changes !

On Thu, 6 Aug 2009, Jeff Squyres wrote:


-Source: openmpi-%{version}.tar.$EXTENSION
+Source: %{name}-%{version}.tar.$EXTENSION

Does this mean that you're looking for a different tarball name?  I'm not 
sure that's good; the tarball should be an openmpi tarball, regardless of 
what name it gets installed under (e.g., OFED builds an OMPI tarball 3-4 
different ways [one for each compiler] and changes %name, but uses the same 
tarball.  How about another param (hey, we've got something like 100, so 
what's 101? ;-) ) for the tarball that defaults to "openmpi"?  They if you 
want to have a differently-named tarball, you can.
Well, maybe we could live with an openmpi tarball ... it was just to be 
consistent. When I build bullmpi-a.b.c.src.rpm, I somehow expect the tar 
file to be bullmpi-a.b.c.tar.gz.



-%setup -q -n openmpi-%{version}
+%setup -q -n %{name}-%{version}

Ditto for this.

-%dir %{_libdir}/openmpi
+%dir %{_libdir}/%{name}

Hmm -- is this right?  I thought that the name "openmpi" in this directory 
path came from OMPI's configure script, not from the RPM spec...?  Or is the 
RPM build command passing --pkgname or somesuch to OMPI's configure to 
override the built-in name?
Hum, I guess you're right, this is indeed not something to change. Sorry 
about that.


Sylvain


On Jul 31, 2009, at 11:51 AM, Sylvain Jeaugey wrote:


Hi all,

We had to apply a little set of modifications to the openmpi.spec file to 
help us integrate openmpi in our cluster distribution.


So here is a patch which, as the changelog suggests, does a couple of 
"improvements" :

- Fix a typo in Summary
- Replace openmpi by %{name} in a couple of places
- Add an %{opt_prefix} option to be able to install in a specific path 
(e.g. in /opt//mpi/-/ instead of 
/opt/-)


The patch is done with "hg extract" but should apply on the SVN trunk.

Sylvain___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
jsquy...@cisco.com

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] Improvement of openmpi.spec

2009-07-31 Thread Sylvain Jeaugey


Hi all,

We had to apply a little set of modifications to the openmpi.spec file to 
help us integrate openmpi in our cluster distribution.


So here is a patch which, as the changelog suggests, does a couple of 
"improvements" :

 - Fix a typo in Summary
 - Replace openmpi by %{name} in a couple of places
 - Add an %{opt_prefix} option to be able to install in a specific path 
(e.g. in /opt//mpi/-/ instead of 
/opt/-)


The patch is done with "hg extract" but should apply on the SVN trunk.

Sylvain# HG changeset patch
# User Sylvain Jeaugey <sylvain.jeau...@bull.net>
# Date 1249043994 -7200
# Node ID c0ba098845e0d93abeb0e3915cb8aa41a73525cf
# Parent  d5402dd00ab21be9afedc590c9e2f2f7da5d2ba8
Fixed typo, replaced hardcoded openmpi by %{name} and added %{opt_prefix} for vendor specific install paths.

diff -r d5402dd00ab2 -r c0ba098845e0 contrib/dist/linux/openmpi.spec
--- a/contrib/dist/linux/openmpi.spec	Thu Jul 30 06:40:10 2009 +0200
+++ b/contrib/dist/linux/openmpi.spec	Fri Jul 31 14:39:54 2009 +0200
@@ -58,6 +58,8 @@
 # instead of the default /usr/
 # type: bool (0/1)
 %{!?install_in_opt: %define install_in_opt 0}
+# type: string (prefix for installation)
+%{!?opt_prefix: %define opt_prefix /opt}
 
 # Define this if you want this RPM to install environment setup
 # shell scripts.
@@ -170,7 +172,6 @@
 %define use_mpi_selector 1
 %endif
 
-
 #
 #
 # Configuration Logic
@@ -178,22 +179,22 @@
 #
 
 %if %{install_in_opt}
-%define _prefix /opt/%{name}/%{version}
-%define _sysconfdir /opt/%{name}/%{version}/etc
-%define _libdir /opt/%{name}/%{version}/lib
-%define _includedir /opt/%{name}/%{version}/include
-%define _mandir /opt/%{name}/%{version}/man
+%define _prefix %{opt_prefix}/%{name}/%{version}
+%define _sysconfdir %{opt_prefix}/%{name}/%{version}/etc
+%define _libdir %{opt_prefix}/%{name}/%{version}/lib
+%define _includedir %{opt_prefix}/%{name}/%{version}/include
+%define _mandir %{opt_prefix}/%{name}/%{version}/man
 # Note that the name "openmpi" is hard-coded in
 # opal/mca/installdirs/config for pkgdatadir; there is currently no
 # easy way to have OMPI change this directory name internally.  So we
 # just hard-code that name here as well (regardless of the value of
 # %{name} or %{_name}).
-%define _pkgdatadir /opt/%{name}/%{version}/share/openmpi
+%define _pkgdatadir %{opt_prefix}/%{name}/%{version}/share/openmpi
 # Per advice from Doug Ledford at Red Hat, docdir is supposed to be in
 # a fixed location.  But if you're installing a package in /opt, all
 # bets are off.  So feel free to install it anywhere in your tree.  He
 # suggests $prefix/doc.
-%define _defaultdocdir /opt/%{name}/%{version}/doc
+%define _defaultdocdir %{opt_prefix}/%{name}/%{version}/doc
 %endif
 
 %if !%{build_debuginfo_rpm}
@@ -229,19 +230,19 @@
 #
 #
 
-Summary: A powerful implementaion of MPI
+Summary: A powerful implementation of MPI
 Name: %{?_name:%{_name}}%{!?_name:openmpi}
 Version: $VERSION
 Release: 1
 License: BSD
 Group: Development/Libraries
-Source: openmpi-%{version}.tar.$EXTENSION
+Source: %{name}-%{version}.tar.$EXTENSION
 Packager: %{?_packager:%{_packager}}%{!?_packager:%{_vendor}}
 Vendor: %{?_vendorinfo:%{_vendorinfo}}%{!?_vendorinfo:%{_vendor}}
 Distribution: %{?_distribution:%{_distribution}}%{!?_distribution:%{_vendor}}
 Prefix: %{_prefix}
 Provides: mpi
-BuildRoot: /var/tmp/%{name}-%{version}-%{release}-root
+BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root
 %if %{disable_auto_requires}
 AutoReq: no
 %endif
@@ -345,7 +346,7 @@
 # there that are not meant to be packaged.
 rm -rf $RPM_BUILD_ROOT
 
-%setup -q -n openmpi-%{version}
+%setup -q -n %{name}-%{version}
 
 #
 #
@@ -616,11 +617,11 @@
 %{_sysconfdir}
 %endif
 # If %{instal_in_opt}, then we're instaling OMPI to
-# /opt/openmpi/.  But be sure to also explicitly mention
-# /opt/openmpi so that it can be removed by RPM when everything under
+# %{opt_prefix}/openmpi/.  But be sure to also explicitly mention
+# %{opt_prefix}/openmpi so that it can be removed by RPM when everything under
 # there is also removed.
 %if %{install_in_opt}
-%dir /opt/%{name}
+%dir %{opt_prefix}/%{name}
 %endif
 # If we're installing the modulefile, get that, too
 %if %{install_modulefile}
@@ -652,13 +653,13 @@
 %{_sysconfdir}
 %endif
 # If %{instal_in_opt}, then we're instaling OMPI to
-# /opt/openmpi/.  But be sure to also explicitly mention
-# /opt/openmpi so that it can be removed by RPM when everything under
-# there is also removed.  Also list /opt/openmpi//share so
+# %{opt_prefix}/.  But be sure to also explicitly mention
+# %{opt_prefix}/openmpi so that it can be removed by RPM when everything under
+# there is also remov

Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support

2009-07-22 Thread Sylvain Jeaugey


Hi Jeff,

I'm interested in joining the effort, since we will likely have the same 
problem with SLURM's cpuset support.


On Wed, 22 Jul 2009, Jeff Squyres wrote:

But as to why it's getting EINVAL, that could be wonky.  We might want to 
take this to the PLPA list and have you run some small, non-MPI examples to 
ensure that PLPA is parsing your /sys tree properly, etc.
I don't see the /sys implication here. Can you be more precise on which 
files are read to determine placement ?


IIRC, when you are inside a cpuset, you can see all cpus (/sys should be 
unmodified) but calling set_schedaffinity with a mask containing a cpu 
outside the cpuset will return EINVAL. The only solution I see to solve 
this would be to get the "allowed" cpus with sched_getaffinity, 
which should be set according to the cpuset mask.


Sylvain

Re: [OMPI devel] Use of OPAL_PREFIX to relocate a lib

2009-06-19 Thread Sylvain Jeaugey


On Thu, 18 Jun 2009, Jeff Squyres wrote:


On Jun 18, 2009, at 11:25 AM, Sylvain Jeaugey wrote:


My problem seems related to library generation through RPM, not with
1.3.2, nor the patch.



I'm not sure I understand -- is there something we need to fix in our SRPM?


I need to dig a bit, but here is the thing : I generated an RPM from the 
official openmpi-1.3.2-1.src.rpm (with some defines like install-in-opt, 
...) and the OPAL_PREFIX trick doesn't seem to work with it.


But don't take too much time on this, I'll find out why and maybe this is 
just me building it the wrong way.


Sylvain

Re: [OMPI devel] Use of OPAL_PREFIX to relocate a lib

2009-06-18 Thread Sylvain Jeaugey


Ok, never mind.

My problem seems related to library generation through RPM, not with 
1.3.2, nor the patch.


Sylvain

On Thu, 18 Jun 2009, Sylvain Jeaugey wrote:


Hi all,

Until Open MPI 1.3 (maybe 1.3.1), I used to find it convenient to be able to 
move a library from its "normal" place (either /usr or /opt) to somewhere 
else (i.e. my NFS home account) to be able to try things only on my account.


So, I used to set OPAL_PREFIX to the root of the Open MPI directory and all 
went fine.


I don't know if relocation was intended in the first place, but with 1.3.2, 
this seems to be broken.


It may have something to do with this patch (and maybe others) :

# HG changeset patch
# User bosilca
# Date 1159647750 0
# Node ID c7152b893f1ce1bc54eea2dc3f06c7e359011fdd
# Parent  676a8fbdbb161f0b84a1c6bb12e2324c8a749c56
All the OPAL_ defines from the install_dirs.h contain ABSOLUTE path. 
Therefore,

there is no need to prepend OPAL_PREFIX to them.

diff -r 676a8fbdbb16 -r c7152b893f1c opal/tools/wrappers/opal_wrapper.c
--- a/opal/tools/wrappers/opal_wrapper.cFri Sep 29 23:58:58 2006 
+
+++ b/opal/tools/wrappers/opal_wrapper.cSat Sep 30 20:22:30 2006 
+

@@ -561,9 +561,9 @@
if (0 != strcmp(OPAL_INCLUDEDIR, "/usr/include")) {
char *line;
#if defined(__WINDOWS__)
-asprintf(, OPAL_INCLUDE_PATTERN OPAL_PREFIX "\"\\%s\"", 
OPAL_INCLUDEDIR);
+asprintf(, OPAL_INCLUDE_PATTERN "\"\\%s\"", 
OPAL_INCLUDEDIR);

#else
-asprintf(, OPAL_INCLUDE_PATTERN OPAL_PREFIX"/%s", 
OPAL_INCLUDEDIR);

+asprintf(, OPAL_INCLUDE_PATTERN "/%s", OPAL_INCLUDEDIR);
#endif  /* defined(__WINDOWS__) */
opal_argv_append_nosize(_flags, line);
free(line);

George, is there a rationale behind this patch for disabling relocation of 
libraries ? Do you think reverting only this patch would bring back the 
relocation functionality ?


TIA,

Sylvain

[OMPI devel] Use of OPAL_PREFIX to relocate a lib

2009-06-18 Thread Sylvain Jeaugey


Hi all,

Until Open MPI 1.3 (maybe 1.3.1), I used to find it convenient to be able 
to move a library from its "normal" place (either /usr or /opt) to 
somewhere else (i.e. my NFS home account) to be able to try things only on 
my account.


So, I used to set OPAL_PREFIX to the root of the Open MPI directory and 
all went fine.


I don't know if relocation was intended in the first place, but with 
1.3.2, this seems to be broken.


It may have something to do with this patch (and maybe others) :

# HG changeset patch
# User bosilca
# Date 1159647750 0
# Node ID c7152b893f1ce1bc54eea2dc3f06c7e359011fdd
# Parent  676a8fbdbb161f0b84a1c6bb12e2324c8a749c56
All the OPAL_ defines from the install_dirs.h contain ABSOLUTE path. 
Therefore,

there is no need to prepend OPAL_PREFIX to them.

diff -r 676a8fbdbb16 -r c7152b893f1c opal/tools/wrappers/opal_wrapper.c
--- a/opal/tools/wrappers/opal_wrapper.cFri Sep 29 23:58:58 2006 +
+++ b/opal/tools/wrappers/opal_wrapper.cSat Sep 30 20:22:30 2006 +
@@ -561,9 +561,9 @@
 if (0 != strcmp(OPAL_INCLUDEDIR, "/usr/include")) {
 char *line;
 #if defined(__WINDOWS__)
-asprintf(, OPAL_INCLUDE_PATTERN OPAL_PREFIX "\"\\%s\"", 
OPAL_INCLUDEDIR);
+asprintf(, OPAL_INCLUDE_PATTERN "\"\\%s\"", OPAL_INCLUDEDIR);
 #else
-asprintf(, OPAL_INCLUDE_PATTERN OPAL_PREFIX"/%s", 
OPAL_INCLUDEDIR);
+asprintf(, OPAL_INCLUDE_PATTERN "/%s", OPAL_INCLUDEDIR);
 #endif  /* defined(__WINDOWS__) */
 opal_argv_append_nosize(_flags, line);
 free(line);

George, is there a rationale behind this patch for disabling relocation of 
libraries ? Do you think reverting only this patch would bring back the 
relocation functionality ?


TIA,

Sylvain

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-12 Thread Sylvain Jeaugey


Hi Ralph,

I managed to have a deadlock after a whole night, but not the same you 
have : after a quick analysis, process 0 seems to be blocked in the very 
first send through shared memory. Still maybe a bug, but not the same as 
yours IMO.


I also figured out that libnuma support was not in my library, so I 
rebuilt the lib and this doesn't seem to change anything : same execution 
speed, same memory footprint, and of course same the-bug-does-not-appear 
:-(.


So, no luck so far in reproducing your problem. I guess you're the only 
one to be able to progress on this (since you seem to have a real 
reproducer).


Sylvain

On Wed, 10 Jun 2009, Sylvain Jeaugey wrote:

Hum, very glad that padb works with Open MPI, I couldn't live without it. In 
my opinion, the best debug tool for parallel applications, and more 
importantly, the only one that scales.


About the issue, I couldn't reproduce it on my platform (tried 2 nodes with 2 
to 8 processes each, nodes are twin 2.93 GHz Nehalem, IB is Mellanox QDR).


So my feeling about that is that is may be very hardware related. Especially 
if you use the hierarch component, some transactions will be done through 
RDMA on one side and read directly through shared memory on the other side, 
which can, depending on the hardware, produce very different timings and 
bugs. Did you try with a different collective component (i.e. not hierarch) ? 
Or with another interconnect ? [Yes, of course, if it is a race condition, we 
might well avoid the bug because timings will be different, but that's still 
information]


Perhaps all what I'm saying makes no sense or you already thought about this, 
anyway, if you want me to try different things, just let me know.


Sylvain

On Wed, 10 Jun 2009, Ralph Castain wrote:


Hi Ashley

Thanks! I would definitely be interested and will look at the tool. 
Meantime, I have filed a bunch of data on this in
ticket #1944, so perhaps you might take a glance at that and offer some 
thoughts?


https://svn.open-mpi.org/trac/ompi/ticket/1944

Will be back after I look at the tool.

Thanks again
Ralph


On Wed, Jun 10, 2009 at 8:51 AM, Ashley Pittman <ash...@pittman.co.uk> 
wrote:


  Ralph,

  If I may say this is exactly the type of problem the tool I have been
  working on recently aims to help with and I'd be happy to help you
  through it.

  Firstly I'd say of the three collectives you mention, MPI_Allgather,
  MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a 
many-to-one

  and the last a many-to-one communication pattern.  The scenario of a
  root process falling behind and getting swamped in comms is a 
plausible
  one for MPI_Reduce only but doesn't hold water with the other two. 
 You

  also don't mention if the loop is over a single collective or if you
  have loop calling a number of different collectives each iteration.

  padb, the tool I've been working on has the ability to look at 
parallel
  jobs and report on the state of collective comms and should help 
narrow

  you down on erroneous processes and those simply blocked waiting for
  comms.  I'd recommend using it to look at maybe four or five 
instances
  where the application has hung and look for any common features 
between

  them.

  Let me know if you are willing to try this route and I'll talk, the 
code
  is downloadable from http://padb.pittman.org.uk and if you want the 
full
  collective functionality you'll need to patch openmp with the patch 
from

  http://padb.pittman.org.uk/extensions.html

  Ashley,

  --

  Ashley Pittman, Bath, UK.

  Padb - A parallel job inspection tool for cluster computing
  http://padb.pittman.org.uk

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Sylvain Jeaugey

Hum, very glad that padb works with Open MPI, I couldn't live without it. 
In my opinion, the best debug tool for parallel applications, and more 
importantly, the only one that scales.


About the issue, I couldn't reproduce it on my platform (tried 2 nodes 
with 2 to 8 processes each, nodes are twin 2.93 GHz Nehalem, IB is 
Mellanox QDR).


So my feeling about that is that is may be very hardware related. 
Especially if you use the hierarch component, some transactions will be 
done through RDMA on one side and read directly through shared memory on 
the other side, which can, depending on the hardware, produce very 
different timings and bugs. Did you try with a different collective 
component (i.e. not hierarch) ? Or with another interconnect ? [Yes, of 
course, if it is a race condition, we might well avoid the bug because 
timings will be different, but that's still information]


Perhaps all what I'm saying makes no sense or you already thought about 
this, anyway, if you want me to try different things, just let me know.


Sylvain

On Wed, 10 Jun 2009, Ralph Castain wrote:


Hi Ashley

Thanks! I would definitely be interested and will look at the tool. Meantime, I 
have filed a bunch of data on this in
ticket #1944, so perhaps you might take a glance at that and offer some 
thoughts?

https://svn.open-mpi.org/trac/ompi/ticket/1944

Will be back after I look at the tool.

Thanks again
Ralph


On Wed, Jun 10, 2009 at 8:51 AM, Ashley Pittman  wrote:

  Ralph,

  If I may say this is exactly the type of problem the tool I have been
  working on recently aims to help with and I'd be happy to help you
  through it.

  Firstly I'd say of the three collectives you mention, MPI_Allgather,
  MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a many-to-one
  and the last a many-to-one communication pattern.  The scenario of a
  root process falling behind and getting swamped in comms is a plausible
  one for MPI_Reduce only but doesn't hold water with the other two.  You
  also don't mention if the loop is over a single collective or if you
  have loop calling a number of different collectives each iteration.

  padb, the tool I've been working on has the ability to look at parallel
  jobs and report on the state of collective comms and should help narrow
  you down on erroneous processes and those simply blocked waiting for
  comms.  I'd recommend using it to look at maybe four or five instances
  where the application has hung and look for any common features between
  them.

  Let me know if you are willing to try this route and I'll talk, the code
  is downloadable from http://padb.pittman.org.uk and if you want the full
  collective functionality you'll need to patch openmp with the patch from
  http://padb.pittman.org.uk/extensions.html

  Ashley,

  --

  Ashley Pittman, Bath, UK.

  Padb - A parallel job inspection tool for cluster computing
  http://padb.pittman.org.uk

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [RFC] Low pressure OPAL progress

2009-06-10 Thread Sylvain Jeaugey

 there are other response options 
than hardwiring putting the process to sleep. You could let someone know so 
a human can decide what, if anything, to do about it, or provide a hook so 
that people can explore/utilize different response strategies...or both!


HTH
Ralph


On Tue, Jun 9, 2009 at 6:52 AM, Sylvain Jeaugey <sylvain.jeau...@bull.net> 
wrote:

I understand your point of view, and mostly share it.

I think the biggest point in my example is that sleep occurs only after (I 
was wrong in my previous e-mail) 10 minutes of inactivity, and this value 
is fully configurable. I didn't intend to call sleep after 2 seconds. Plus, 
as said before, I planned to have the library do show_help() when this 
happens (something like : "Open MPI couldn't receive a message for 10 
minutes, lowering pressure") so that the application that really needs more 
than 10 minutes to receive a message can increase it.


Looking at the tick rate code, I couldn't see how changing it would make 
CPU usage drop. If I understand correctly your e-mail, you block in the 
kernel using poll(), is that right ? So, you may well loose 10 us because 
of that kernel call, but this is a lot less than the 1 ms I'm currently 
loosing with usleep. This makes sense - although being hard to implement 
since all btl must have this ability.


Thanks for your comments, I will continue to think about it.

Sylvain


On Tue, 9 Jun 2009, Ralph Castain wrote:

My concern with any form of sleep is with the impact on the proc - since 
opal_progress might not be running in a separate thread, won't the sleep 
apply to the process as a whole? In that case, the process isn't free to 
continue computing.


I can envision applications that might call down into the MPI library and 
have opal_progress not find anything, but there is nothing wrong. The 
application could continue computations just fine. I would hate to see us 
put the process to sleep just because the MPI library wasn't busy enough.


Hence my suggestion to just change the tick rate. It would definitely cause 
a higher latency for the first message that arrived while in this state, 
which is bothersome, but would meet the stated objective without 
interfering with the process itself.


LANL has also been looking at this problem of stalled jobs, but from a 
different approach. We monitor (using a separate job) progress in terms of 
output files changing in size plus other factors as specified by the user. 
If we don't see any progress in those terms over some time, then we kill 
the job. We chose that path because of the concerns expressed above - e.g., 
on our RR machine, intense computations can be underway on the Cell blades 
while the Opteron MPI processes wait for us to reach a communication point. 
We -want- those processes spinning away so that, when the comm starts, it 
can proceed as quickly as possible.


Just some thoughts...
Ralph


On Jun 9, 2009, at 5:28 AM, Terry Dontje wrote:

Sylvain Jeaugey wrote:
Hi Ralph,

I'm entirely convinced that MPI doesn't have to save power in a normal 
scenario. The idea is just that if an MPI process is blocked (i.e. has not 
performed progress for -say- 5 minutes (default in my implementation), we 
stop busy polling and have the process drop from 100% CPU usage to 0%.


I do not call sleep() but usleep(). The result if quite the same, but is 
less hurting performance in case of (unexpected) restart.


However, the goal of my RFC was also to know if there was a more clean way 
to achieve my goal, and from what I read, I guess I should look at the 
"tick" rate instead of trying to do my own delaying.


One way around this is to make all blocked communications (even SM) to use 
poll to block for incoming messages.  Jeff and I have discussed this and 
had many false starts on it.  The biggest issue is coming up with a way to 
have blocks on the SM btl converted to the system poll call without 
requiring a socket write for every packet.


The usleep solution works but is kind of ugly IMO.  I think when I looked 
at doing that the overhead increased signifcantly for certain 
communications.  Maybe not for toy benchmarks but for less synchronized 
processes I saw the usleep adding overhead where I didn't want it too.


--td
Don't worry, I was quite expecting the configure-in requirement. However, I 
don't think my patch is good for inclusion, it is only an example to 
describe what I want to achieve.


Thanks a lot for your comments,
Sylvain

On Mon, 8 Jun 2009, Ralph Castain wrote:

I'm not entirely convinced this actually achieves your goals, but I can see 
some potential benefits. I'm also not sure that power consumption is that 
big of an issue that MPI needs to begin chasing "power saver" modes of 
operation, but that can be a separate debate some day.


I'm assuming you don't mean that you actually call "sleep()" as this would 
be very bad - I'm assuming you just change the opal_progress "tick" r

Re: [OMPI devel] [RFC] Low pressure OPAL progress

2009-06-09 Thread Sylvain Jeaugey


On Tue, 9 Jun 2009, Ralph Castain wrote:


2. instead of putting things to sleep or even adjusting the loop rate, you 
might want to consider using the orte_notifier
capability and notify the system that the job may be stalled. Or perhaps adding 
an API to the orte_errmgr framework to
notify it that nothing has been received for awhile, and let people implement 
different strategies for detecting what might
be "wrong" and what they want to do about it.
Great remark. What is really needed here is the information of "nothing 
received for X minutes". Just having the information somewhere should be 
sufficient. We often see users asking if their application is still 
progressing, and this should answer their questions. This would also 
address the need of administrators to stop deadlocked runs during the 
night.


I guess I'll redirect my work on this and couple it with our current 
effort on logging and administration tools coupling.


Thanks a lot guys !

Sylvain


My point with this second bullet is that there are other response options than 
hardwiring putting the process to sleep. You
could let someone know so a human can decide what, if anything, to do about it, 
or provide a hook so that people can
explore/utilize different response strategies...or both!

HTH
Ralph


On Tue, Jun 9, 2009 at 6:52 AM, Sylvain Jeaugey <sylvain.jeau...@bull.net> 
wrote:
  I understand your point of view, and mostly share it.

  I think the biggest point in my example is that sleep occurs only after 
(I was wrong in my previous e-mail) 10
  minutes of inactivity, and this value is fully configurable. I didn't 
intend to call sleep after 2 seconds.
  Plus, as said before, I planned to have the library do show_help() when this 
happens (something like : "Open
  MPI couldn't receive a message for 10 minutes, lowering pressure") so 
that the application that really needs
  more than 10 minutes to receive a message can increase it.

  Looking at the tick rate code, I couldn't see how changing it would make 
CPU usage drop. If I understand
  correctly your e-mail, you block in the kernel using poll(), is that 
right ? So, you may well loose 10 us
  because of that kernel call, but this is a lot less than the 1 ms I'm 
currently loosing with usleep. This makes
  sense - although being hard to implement since all btl must have this 
ability.

  Thanks for your comments, I will continue to think about it.

  Sylvain


On Tue, 9 Jun 2009, Ralph Castain wrote:

  My concern with any form of sleep is with the impact on the proc - since 
opal_progress might not be
  running in a separate thread, won't the sleep apply to the process as a 
whole? In that case, the process
  isn't free to continue computing.

  I can envision applications that might call down into the MPI library and 
have opal_progress not find
  anything, but there is nothing wrong. The application could continue 
computations just fine. I would hate
  to see us put the process to sleep just because the MPI library wasn't 
busy enough.

  Hence my suggestion to just change the tick rate. It would definitely 
cause a higher latency for the
  first message that arrived while in this state, which is bothersome, but 
would meet the stated objective
  without interfering with the process itself.

  LANL has also been looking at this problem of stalled jobs, but from a 
different approach. We monitor
  (using a separate job) progress in terms of output files changing in size 
plus other factors as specified
  by the user. If we don't see any progress in those terms over some time, 
then we kill the job. We chose
  that path because of the concerns expressed above - e.g., on our RR 
machine, intense computations can be
  underway on the Cell blades while the Opteron MPI processes wait for us 
to reach a communication point.
  We -want- those processes spinning away so that, when the comm starts, it 
can proceed as quickly as
  possible.

  Just some thoughts...
  Ralph


  On Jun 9, 2009, at 5:28 AM, Terry Dontje wrote:

Sylvain Jeaugey wrote:
  Hi Ralph,

  I'm entirely convinced that MPI doesn't have to save power in 
a normal scenario.
  The idea is just that if an MPI process is blocked (i.e. has 
not performed
  progress for -say- 5 minutes (default in my implementation), 
we stop busy polling
  and have the process drop from 100% CPU usage to 0%.

  I do not call sleep() but usleep(). The result if quite the 
same, but is less
  hurting performance in case of (unexpected) restart.

  However, the goal of my RFC was also to know if there was a 
more clean way to
  achieve my goal, and from what I read, I guess I should look at the 
"tick" rate

Re: [OMPI devel] [RFC] Low pressure OPAL progress

2009-06-09 Thread Sylvain Jeaugey


I understand your point of view, and mostly share it.

I think the biggest point in my example is that sleep occurs only after (I 
was wrong in my previous e-mail) 10 minutes of inactivity, and this value 
is fully configurable. I didn't intend to call sleep after 2 seconds. 
Plus, as said before, I planned to have the library do show_help() when 
this happens (something like : "Open MPI couldn't receive a message for 10 
minutes, lowering pressure") so that the application that really needs 
more than 10 minutes to receive a message can increase it.


Looking at the tick rate code, I couldn't see how changing it would make 
CPU usage drop. If I understand correctly your e-mail, you block in the 
kernel using poll(), is that right ? So, you may well loose 10 us because 
of that kernel call, but this is a lot less than the 1 ms I'm currently 
loosing with usleep. This makes sense - although being hard to implement 
since all btl must have this ability.


Thanks for your comments, I will continue to think about it.

Sylvain

On Tue, 9 Jun 2009, Ralph Castain wrote:

My concern with any form of sleep is with the impact on the proc - since 
opal_progress might not be running in a separate thread, won't the sleep 
apply to the process as a whole? In that case, the process isn't free to 
continue computing.


I can envision applications that might call down into the MPI library and 
have opal_progress not find anything, but there is nothing wrong. The 
application could continue computations just fine. I would hate to see us put 
the process to sleep just because the MPI library wasn't busy enough.


Hence my suggestion to just change the tick rate. It would definitely cause a 
higher latency for the first message that arrived while in this state, which 
is bothersome, but would meet the stated objective without interfering with 
the process itself.


LANL has also been looking at this problem of stalled jobs, but from a 
different approach. We monitor (using a separate job) progress in terms of 
output files changing in size plus other factors as specified by the user. If 
we don't see any progress in those terms over some time, then we kill the 
job. We chose that path because of the concerns expressed above - e.g., on 
our RR machine, intense computations can be underway on the Cell blades while 
the Opteron MPI processes wait for us to reach a communication point. We 
-want- those processes spinning away so that, when the comm starts, it can 
proceed as quickly as possible.


Just some thoughts...
Ralph


On Jun 9, 2009, at 5:28 AM, Terry Dontje wrote:


Sylvain Jeaugey wrote:

Hi Ralph,

I'm entirely convinced that MPI doesn't have to save power in a normal 
scenario. The idea is just that if an MPI process is blocked (i.e. has not 
performed progress for -say- 5 minutes (default in my implementation), we 
stop busy polling and have the process drop from 100% CPU usage to 0%.


I do not call sleep() but usleep(). The result if quite the same, but is 
less hurting performance in case of (unexpected) restart.


However, the goal of my RFC was also to know if there was a more clean way 
to achieve my goal, and from what I read, I guess I should look at the 
"tick" rate instead of trying to do my own delaying.


One way around this is to make all blocked communications (even SM) to use 
poll to block for incoming messages.  Jeff and I have discussed this and 
had many false starts on it.  The biggest issue is coming up with a way to 
have blocks on the SM btl converted to the system poll call without 
requiring a socket write for every packet.


The usleep solution works but is kind of ugly IMO.  I think when I looked 
at doing that the overhead increased signifcantly for certain 
communications.  Maybe not for toy benchmarks but for less synchronized 
processes I saw the usleep adding overhead where I didn't want it too.


--td
Don't worry, I was quite expecting the configure-in requirement. However, 
I don't think my patch is good for inclusion, it is only an example to 
describe what I want to achieve.


Thanks a lot for your comments,
Sylvain

On Mon, 8 Jun 2009, Ralph Castain wrote:

I'm not entirely convinced this actually achieves your goals, but I can 
see some potential benefits. I'm also not sure that power consumption is 
that big of an issue that MPI needs to begin chasing "power saver" modes 
of operation, but that can be a separate debate some day.


I'm assuming you don't mean that you actually call "sleep()" as this 
would be very bad - I'm assuming you just change the opal_progress "tick" 
rate instead. True? If not, and you really call "sleep", then I would 
have to oppose adding this to the code base pending discussion with 
others who can corroborate that this won't cause problems.


Either way, I could live with this so long as it was done as a 
"configure-in" capability. Just having the params default to a value that

Re: [OMPI devel] Multi-rail on openib

2009-06-09 Thread Sylvain Jeaugey


On Mon, 8 Jun 2009, NiftyOMPI Tom Mitchell wrote:

??? dual rail does double the number of switch ports. If you want to 
address switch failure each rail must connect to a different switch. 
If you do not want to have isolated fabrics you must have some 
additional ports on all switches to connect the two fabrics and enough 
of them to maintain sufficient bandwidth and connectivity when a switch 
fails. Thus, You are doubling the fabric unless I am missing something.
Well, it is pretty much research for now. But yes, we want each port to be 
connected to a different switch so that both cable and switch failures can 
be survived.


Open MPI currently needs to have connected fabrics, but maybe that's 
something we will like to change in the future, having two separate rails. 
(Btw Pasha, will your current work enable this ?)


Is your second set of switches so minimally connected that the second 
tree can be installed with a small switch count.
That's the idea, yes. For example, you could have a primary QDR fat-tree 
network and a failover non fat-tree DDR one (potentially recycled from a 
previous machine).



What are the odds when port 1 fails that port 2 is going to
be live.  Cable/ connector errors would be the most likely
case where port 2 would be live.  In general if port 1 fails
I would expect port 2 to have issues too.
Well, depending on the errors you want to be able to survive, you may have 
2 cards, in which case there is no reason why port1 failure would cause 
port2 to fail too. But in all cases, switches and cable errors are a 
concern to us.


Sylvain

Re: [OMPI devel] [RFC] Low pressure OPAL progress

2009-06-09 Thread Sylvain Jeaugey


Hi Ralph,

I'm entirely convinced that MPI doesn't have to save power in a normal 
scenario. The idea is just that if an MPI process is blocked (i.e. has not 
performed progress for -say- 5 minutes (default in my implementation), we 
stop busy polling and have the process drop from 100% CPU usage to 0%.


I do not call sleep() but usleep(). The result if quite the same, but is 
less hurting performance in case of (unexpected) restart.


However, the goal of my RFC was also to know if there was a more clean way 
to achieve my goal, and from what I read, I guess I should look at the 
"tick" rate instead of trying to do my own delaying.


Don't worry, I was quite expecting the configure-in requirement. However, 
I don't think my patch is good for inclusion, it is only an example to 
describe what I want to achieve.


Thanks a lot for your comments,
Sylvain

On Mon, 8 Jun 2009, Ralph Castain wrote:

I'm not entirely convinced this actually achieves your goals, but I can see 
some potential benefits. I'm also not sure that power consumption is that big 
of an issue that MPI needs to begin chasing "power saver" modes of operation, 
but that can be a separate debate some day.


I'm assuming you don't mean that you actually call "sleep()" as this would be 
very bad - I'm assuming you just change the opal_progress "tick" rate 
instead. True? If not, and you really call "sleep", then I would have to 
oppose adding this to the code base pending discussion with others who can 
corroborate that this won't cause problems.


Either way, I could live with this so long as it was done as a "configure-in" 
capability. Just having the params default to a value that causes the system 
to behave similarly to today isn't enough - we still wind up adding logic 
into a very critical timing loop for no reason. A simple configure option of 
--enable-mpi-progress-monitoring would be sufficient to protect the code.


HTH
Ralph


On Jun 8, 2009, at 9:50 AM, Sylvain Jeaugey wrote:

What : when nothing has been received for a very long time - e.g. 5 
minutes, stop busy polling in opal_progress and switch to a usleep-based 
one.


Why : when we have long waits, and especially when an application is 
deadlock'ed, detecting it is not easy and a lot of power is wasted until 
the end of the time slice (if there is one).


Where : an example of how it could be implemented is available at 
http://bitbucket.org/jeaugeys/low-pressure-opal-progress/


Principle
=

opal_progress() ensures the progression of MPI communication. The current 
algorithm is a loop calling progress on all registered components. If the 
program is blocked, the loop will busy-poll indefinetely.


Going to sleep after a certain amount of time with nothing received is 
interesting for two things :
- Administrator can easily detect whether a job is deadlocked : all the 
processes are in sleep(). Currently, all processors are using 100% cpu and 
it is very hard to know if progression is still happening or not.

- When there is nothing to receive, power usage is highly reduced.

However, it could hurt performance in some cases, typically if we go to 
sleep just before the message arrives. This will highly depend on the 
parameters you give to the sleep mechanism.


At first, we can start with the following assumption : if the sleep takes T 
usec, then sleeping after 1xT should slow down Receives by a factor 
less than 0.01 %.


However, other processes may suffer from you being late, and be delayed by 
T usec (which may represent more than 0.01% for them).


So, the goal of this mechanism is mainly to detect far-too-long-waits and 
should quite never be used in normal MPI jobs. It could also trigger a 
warning message when starting to sleep, or at least a trace in the 
notifier.


Details of Implementation
=

Three parameters fully control the behaviour of this mechanism :
* opal_progress_sleep_count : number of unsuccessful opal_progress() calls 
before we start the timer (to prevent latency impact). It defaults to -1, 
which completely deactivates the sleep (and is therefore equivalent to the 
former code). A value of 1000 can be thought of as a starting point to 
enable this mechanism.
* opal_progress_sleep_trigger : time to wait before going to 
low-pressure-powersave mode. Default : 600 (in seconds) = 10 minutes.
* opal_progress_sleep_duration : time we sleep at each further unsuccessful 
call to opal_progress(). Default : 1000 (in us) = 1 ms.


The duration is big enough to make the process show 0% CPU in top, but low 
enough to preserve a good trigger/duration ratio.


The trigger is voluntary high to keep a good trigger/duration ratio. 
Indeed, to prevent delays from causing chain reactions, trigger should be 
higher than duration * numprocs.


Possible Improvements & Pitfalls


* Trigger could be set automatically at max(trigger, duration

Re: [OMPI devel] problem in the ORTE notifier framework

2009-06-08 Thread Sylvain Jeaugey


Ralph,

Sorry for answering on this old thread, but it seems that my answer was 
blocked in the "postponed" folder.


About the if-then, I thought it was 1 cycle. I mean, if you don't break 
the pipeline, i.e. use likely() or builtin_expect() or something like that 
to be sure that the compiler will generate assembly in the right way, it 
shouldn't be more than 1 cycle, perhaps less on some architectures like 
Itanium [however, my multi-architecture view is somewhat limited to x86 
and ia64, so I may be wrong].


So, in these if-then cases where we know which branch is the more likely 
to be used, I don't think that 1 CPU cycle is really a problem, especially 
if we are already in a slow code path.


Is there a multi-compiler,multi-arch,multi-os reason not to use likely() 
directives ?


Sylvain

On Wed, 27 May 2009, Ralph Castain wrote:


While that is a good way of minimizing the impact of the counter, you still have to do an 
"if-then" to check if the counter
exceeds the threshold. This "if-then" also has to get executed every time, and 
generally consumes more than a few cycles.

To be clear: it isn't the output that is the concern. The output only occurs as 
an exception case, essentially equivalent
to dealing with an error, so it can be "slow". The concern is with the impact 
of testing to see if the output needs to be
generated as this testing occurs every time we transit the code.

I think Jeff and I are probably closer to agreement on design than it might 
seem, and may be close to what you might also
have had in mind. Basically, I was thinking of a macro like this:

ORTE_NOTIFIER_VERBOSE(api, counter, threshold,...)

#if WANT_NOTIFIER_VERBOSE
opal_atomic_increment(counter);
if (counter > threshold) {
    orte_notifier.api(.)
}
#endif

You would set the specific thresholds for each situation via MCA params, so 
this could be tuned to fit specific needs.
Those who don't want the penalty can just build normally - those who want this 
level of information can enable it.

We can then see just how much penalty is involved in real world situations. My 
guess is that it won't be that big, but it's
hard to know without seeing how frequently we actually insert this code.

Hope that makes sense
Ralph


On Wed, May 27, 2009 at 1:25 AM, Sylvain Jeaugey <sylvain.jeau...@bull.net> 
wrote:
  About performance, I may miss something, but our first goal was to track 
already slow pathes.

  We imagined that it could be possible to add at the beginning (or end) of this 
"bad path" just one line that
  would basically do an atomic inc. So, in terms of CPU cycles, something 
like 1 for the inc and maybe 1 jump
  before. Are a couple of cycles really an issue in slow pathes (which take 
at least hundreds of cycles), or do
  you fear out-of-cache memory accesses - or something else ?

  As for outputs, they indeed are slow (and can slow down considerably an 
application if not synchronized), but
  aggregation on the head node should solve our problems. And if not, we 
can also disable outputs at runtime.

  So, in my opinion, no application should notice a difference (unless you 
tune the framework to output every
  warning).

  Sylvain


On Tue, 26 May 2009, Jeff Squyres wrote:

  Nadia --

  Sorry I didn't get to jump in on the other thread earlier.

  We have made considerable changes to the notifier framework in a branch to better 
support "SOS"
  functionality:

   https://www.open-mpi.org/hg/auth/hgwebdir.cgi/jsquyres/opal-sos

  Cisco and Indiana U. have been working on this branch for a while.  A 
description of the SOS stuff is
  here:

   https://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages

  As for setting up an external web server with hg, don't bother -- just 
get an account at bitbucket.org.
   They're free and allow you to host hg repositories there.  I've used 
bitbucket to collaborate on code
  before it hits OMPI's SVN trunk with both internal and external OMPI 
developers.

  We can certainly move the opal-sos repo to bitbucket (or branch again off 
opal-sos to bitbucket --
  whatever makes more sense) to facilitate collaborating with you.

  Back on topic...

  I'd actually suggest a combination of what has been discussed in the 
other thread.  The notifier can be
  the mechanism that actually sends the output message, but it doesn't have 
to be the mechanism that tracks
  the stats and decides when to output a message.  That can be separate 
logic, and therefore be more
  fine-grained (and potentially even specific to the MPI layer).

  The Big Question will how to do this with zero performance impact when it 
is not being used. This has
  always been the difficult issue when trying to implement any kind of 
monitoring inside the core OMPI
  performance-sensitive paths.  Even adding individual branch

Re: [OMPI devel] Multi-rail on openib

2009-06-08 Thread Sylvain Jeaugey


Hi Tom,

Yes, there is a goal in mind, and definetly not performance : we are 
working on device failover, i.e when a network adapter or switch fails, 
use the remaining one. We don't intend to improve performance with 
multi-rail (which as you said, will not happen unless you have a DDR card 
with PCI Exp 8x Gen2 and a very nice routing - and money to pay for the 
doubled network :)).


The goal here is to use port 1 of each card as a primary way of 
communication with a fat tree and port 2 as a failover solution with a 
very light network, just to avoid aborting the MPI app or at least reach a 
checkpoint.


Don't worry, another team is working on opensm, so that routing stays 
optimal.


Thanks for your warnings however, it's true that a lot of people see these 
"double port IB cards" as "doubled performance".


Sylvain

On Fri, 5 Jun 2009, Nifty Tom Mitchell wrote:


On Fri, Jun 05, 2009 at 09:52:39AM -0400, Jeff Squyres wrote:


See this FAQ entry for a description:

http://www.open-mpi.org/faq/?category=openfabrics#ofa-port-wireup

Right now, there's no way to force a particular connection pattern on
the openib btl at run-time.  The startup sequence has gotten
sufficiently complicated / muddied over the years that it would be quite
difficult to do so.  Pasha is in the middle of revamping parts of the
openib startup (see http://bitbucket.org/pasha/ompi-ofacm/); it *may* be
desirable to fully clean up the full openib btl startup sequence when
he's all finished.


On Jun 5, 2009, at 9:48 AM, Mouhamed Gueye wrote:


Hi all,

I am working on  multi-rail IB and I was wondering how connections are
established between ports.  I have two hosts, each with 2 ports on a
same IB card, connected to the same switch.



Is there a goal in mind?

In general multi-rail cards run into bandwidth and congestion issues
with the host bus.  If your card's system side interface cannot support
the bandwidth of twin IB links then it is possible that bandwidth would
be reduced by the interaction.

If the host bus and memory system is fast enough then
work with the vendor.

In addition to system bandwidth the subnet manager may need to be enhanced
to be multi-port card aware.   Since IB fabric routes are static it is possible
to route or use pairs of links in an identical enough way that there is
little bandwidth gain when multiple switches are involved.

Your two host case case may be simple enoughto explore
and/or generate illuminating or misleading results.
It is a good place to start.

Start with a look at opensm and the fabric then watch how Open MPI
or your applications use the resulting LIDs.  If you are using IB directly
and not MPI then the list of protocol choices grows dramatically but still
centers on LIDs as assigned by the subnet manager (see opensm).

How man CPU cores (ranks) are you working with?

Do be specific about the IB hardware and associated firmware
there are multiple choices out there and the vendor may be able to help...

--
T o m  M i t c h e l l
Found me a new hat, now what?

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] problem in the ORTE notifier framework

2009-05-28 Thread Sylvain Jeaugey

To be more complete, we pull Hg from 
http://www.open-mpi.org/hg/hgwebdir.cgi/ompi-svn-mirror/ ; are we 
mistaken ?


If not, the code in v1.3 seems to be different from the code in the trunk 
...


Sylvain

On Thu, 28 May 2009, Nadia Derbey wrote:


On Tue, 2009-05-26 at 17:24 -0600, Ralph Castain wrote:

First, to answer Nadia's question: you will find that the init
function for the module is already called when it is selected - see
the code in orte/mca/base/notifier_base_select.c, lines 72-76 (in the
trunk.


Strange? Our repository is a clone of the trunk?



It's true that if I "hg update" to v1.3 I see that the fix is there.

Regards,
Nadia


It would be a good idea to tie into the sos work to avoid conflicts
when it all gets merged back together, assuming that isn't a big
problem for you.

As for Jeff's suggestion: dealing with the performance hit problem is
why I suggested ORTE_NOTIFIER_VERBOSE, modeled after the
OPAL_OUTPUT_VERBOSE model. The idea was to compile it in -only- when
the system is built for it - maybe using a --with-notifier-verbose
configuration option. Frankly, some organizations would happily pay a
small performance penalty for the benefits.

I would personally recommend that the notifier framework keep the
stats so things can be compact and self-contained. We still get
atomicity by allowing each framework/component/whatever specify the
threshold. Creating yet another system to do nothing more than track
error/warning frequencies to decide whether or not to notify seems
wasteful.

Perhaps worth a phone call to decide path forward?


On Tue, May 26, 2009 at 1:06 PM, Jeff Squyres 
wrote:
Nadia --

Sorry I didn't get to jump in on the other thread earlier.

We have made considerable changes to the notifier framework in
a branch to better support "SOS" functionality:


 https://www.open-mpi.org/hg/auth/hgwebdir.cgi/jsquyres/opal-sos

Cisco and Indiana U. have been working on this branch for a
while.  A description of the SOS stuff is here:

   https://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages

As for setting up an external web server with hg, don't bother
-- just get an account at bitbucket.org.  They're free and
allow you to host hg repositories there.  I've used bitbucket
to collaborate on code before it hits OMPI's SVN trunk with
both internal and external OMPI developers.

We can certainly move the opal-sos repo to bitbucket (or
branch again off opal-sos to bitbucket -- whatever makes more
sense) to facilitate collaborating with you.

Back on topic...

I'd actually suggest a combination of what has been discussed
in the other thread.  The notifier can be the mechanism that
actually sends the output message, but it doesn't have to be
the mechanism that tracks the stats and decides when to output
a message.  That can be separate logic, and therefore be more
fine-grained (and potentially even specific to the MPI layer).

The Big Question will how to do this with zero performance
impact when it is not being used. This has always been the
difficult issue when trying to implement any kind of
monitoring inside the core OMPI performance-sensitive paths.
 Even adding individual branches has met with resistance (in
performance-critical code paths)...





On May 26, 2009, at 10:59 AM, Nadia Derbey wrote:



Hi,

While having a look at the notifier framework under
orte, I noticed that
the way it is written, the init routine for the
selected module cannot
be called.

Attached is a small patch that fixes this issue.

Regards,
Nadia





--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Nadia Derbey 

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] problem in the ORTE notifier framework

2009-05-27 Thread Sylvain Jeaugey

I thought an if-then was 1 cycle. I mean, if you don't break the pipeline, 
i.e. use likely() or builtin_expect() or something like that to be sure 
that the compiler will generate assembly in the right way, it shouldn't be 
more than 1 cycle, perhaps less on some architectures like Itanium. But my 
multi-architecture view is somewhat limited to x86 and ia64, so I may be 
wrong. I'm personally much more sensitive to cache misses which can easily 
make the atomic-inc take hundreds of cycles if the event is out of the 
cache.


ORTE_NOTIFIER_VERBOSE(api, counter, threshold,...) is also very close to 
what we had in mind : a one-line, single call to track events. Good.


We will continue to dig in this direction using the opal-sos branch. 
Thanks a lot,

Sylvain

On Wed, 27 May 2009, Ralph Castain wrote:


While that is a good way of minimizing the impact of the counter, you still have to do an 
"if-then" to check if the counter
exceeds the threshold. This "if-then" also has to get executed every time, and 
generally consumes more than a few cycles.

To be clear: it isn't the output that is the concern. The output only occurs as 
an exception case, essentially equivalent
to dealing with an error, so it can be "slow". The concern is with the impact 
of testing to see if the output needs to be
generated as this testing occurs every time we transit the code.

I think Jeff and I are probably closer to agreement on design than it might 
seem, and may be close to what you might also
have had in mind. Basically, I was thinking of a macro like this:

ORTE_NOTIFIER_VERBOSE(api, counter, threshold,...)

#if WANT_NOTIFIER_VERBOSE
opal_atomic_increment(counter);
if (counter > threshold) {
    orte_notifier.api(.)
}
#endif

You would set the specific thresholds for each situation via MCA params, so 
this could be tuned to fit specific needs.
Those who don't want the penalty can just build normally - those who want this 
level of information can enable it.

We can then see just how much penalty is involved in real world situations. My 
guess is that it won't be that big, but it's
hard to know without seeing how frequently we actually insert this code.

Hope that makes sense
Ralph


On Wed, May 27, 2009 at 1:25 AM, Sylvain Jeaugey <sylvain.jeau...@bull.net> 
wrote:
  About performance, I may miss something, but our first goal was to track 
already slow pathes.

  We imagined that it could be possible to add at the beginning (or end) of this 
"bad path" just one line that
  would basically do an atomic inc. So, in terms of CPU cycles, something 
like 1 for the inc and maybe 1 jump
  before. Are a couple of cycles really an issue in slow pathes (which take 
at least hundreds of cycles), or do
  you fear out-of-cache memory accesses - or something else ?

  As for outputs, they indeed are slow (and can slow down considerably an 
application if not synchronized), but
  aggregation on the head node should solve our problems. And if not, we 
can also disable outputs at runtime.

  So, in my opinion, no application should notice a difference (unless you 
tune the framework to output every
  warning).

  Sylvain


On Tue, 26 May 2009, Jeff Squyres wrote:

  Nadia --

  Sorry I didn't get to jump in on the other thread earlier.

  We have made considerable changes to the notifier framework in a branch to better 
support "SOS"
  functionality:

   https://www.open-mpi.org/hg/auth/hgwebdir.cgi/jsquyres/opal-sos

  Cisco and Indiana U. have been working on this branch for a while.  A 
description of the SOS stuff is
  here:

   https://svn.open-mpi.org/trac/ompi/wiki/ErrorMessages

  As for setting up an external web server with hg, don't bother -- just 
get an account at bitbucket.org.
   They're free and allow you to host hg repositories there.  I've used 
bitbucket to collaborate on code
  before it hits OMPI's SVN trunk with both internal and external OMPI 
developers.

  We can certainly move the opal-sos repo to bitbucket (or branch again off 
opal-sos to bitbucket --
  whatever makes more sense) to facilitate collaborating with you.

  Back on topic...

  I'd actually suggest a combination of what has been discussed in the 
other thread.  The notifier can be
  the mechanism that actually sends the output message, but it doesn't have 
to be the mechanism that tracks
  the stats and decides when to output a message.  That can be separate 
logic, and therefore be more
  fine-grained (and potentially even specific to the MPI layer).

  The Big Question will how to do this with zero performance impact when it 
is not being used. This has
  always been the difficult issue when trying to implement any kind of 
monitoring inside the core OMPI
  performance-sensitive paths.  Even adding individual branches has met 
with resistance (in
  performance

Re: [OMPI devel] Device failover in dr pml (fwd)

2009-04-16 Thread Sylvain Jeaugey

Well, if reviving means making device failover work, then yes, in a way we 
revived it ;)


We are currently making mostly experiments to figure out how to have device 
failover working. No big fixes for now, and that's why we are posting here 
before going further.


From what I understand, Rolf's work seems very close to what we want to do 
and we'd better work with him on making ob1 able to do device failover 
rather than trying to work on dr.


This sound good to me : there is no reason why ob1 couldn't invalidate a 
device (e.g. if we send a signal). However, replaying lost sends still 
seems to be needed if we want to be able to handle a network failure. 
Clearly, ob1 doesn't support this yet.


Thanks a lot for your advices, we will continue to think about it and come 
back to you.


Sylvain

On Wed, 15 Apr 2009, Ralph Castain wrote:

Last anyone knew, the dr pml was dead - way out of date and unmaintained. I 
gather that you folks have revived it and sync'd it back up to the current 
ob1 module?


I don't think anyone really cares what is done with the dr module itself. 
There are others working on failover modules, and there is a new separate 
checksum module that just aborts if it detects an error.


So I would guess you are welcome to do whatever you want to it. I suspect the 
others working on failover may speak up here too.



On Apr 15, 2009, at 6:47 AM, Mouhamed Gueye wrote:


Hi all,

We are currently working on the dr pml component and specifically on device 
failover. The failover mecanism seems to work fine on different components, 
but if we want to do it on different modules of the same component - say 2 
Infiniband rails - the code seems to be broken.


Actually, when the first openib module fails, the progress function of the 
openib component is deregistered and progress is no longer made on any 
openib module. We managed to circumvent this by keeping the progress 
function as long as an openib module might be using it and it seems to work 
fine.


So I have a few questions :

1. Is there already work in progress to support multi-module failover on the 
dr pml ?

2. Do you think this is the correct way to handle multi-module failover ?

Also, the fact that the "dr" component includes many things like checksuming 
bothers us a bit (we'd like to lower performance overhead as far as possible 
when including device failover). So,


3. Do you plan to fork this component to a "df (device failover) only" one ? 
(we would like to, but maybe this is not the right way to go)


That's all for now,
Mouhamed
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] SM init failures

2009-03-31 Thread Sylvain Jeaugey

Sorry to continue off-topic but going to System V shm would be for me
like going back in the past.

System V shared memory used to be the main way to do shared memory on
MPICH and from my (little) experience, this was truly painful :
- Cleanup issues : does shmctl(IPC_RMID) solve _all_ cases ? (even kill
-9 ?)
- Naming issues : shm segments identified as 32 bits key potentially
causing conflicts between applications or layers of the same application
on one node
- Space issues : the total shm size on a system is bound to
/proc/sys/kernel/shmmax, needing admin configuration and causing conflicts
between MPI applications running on the same node

Mmap'ed files can have a comprehensive name like -: Opal>--, preventing naming issues. If we are on linux, they
can be allocated in /dev/shm to prevent filesystem trafic, and space is
not limited.

Sylvain

On Mon, 30 Mar 2009, Tim Mattox wrote:

I've been lurking on this conversation, and I am again left with the impression
that the underlying shared memory configuration based on sharing a file
is flawed. Why not use a System V shared memory segment without a
backing file as I described in ticket #1320?

On Mon, Mar 30, 2009 at 1:34 PM, George Bosilca wrote:

Then it looks like the safest solution is the use either ftruncate or the
lseek method and then touch the first byte of all memory pages.
Unfortunately, I see two problems with this. First, there is a clear
performance hit on the startup time. And second, we will have to find a
pretty smart way to do this or we will completely break the memory affinity
stuff.

george.

On Mar 30, 2009, at 13:24 , Iain Bason wrote:

On Mar 30, 2009, at 12:05 PM, Jeff Squyres wrote:

But don't we need the whole area to be zero filled?

It will be zero-filled on demand using the lseek/touch method. However,
the OS may not reserve space for the skipped pages or disk blocks. Thus one
could still get out of memory or file system full errors at arbitrary
points. Presumably one could also get segfaults from an mmap'ed segment
whose pages couldn't be allocated when the demand came.

Iain

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
tmat...@gmail.com || timat...@open-mpi.org
I'm a bright... http://www.the-brights.net/

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

95 matches

Mail list logo