Re: [OMPI devel] Intel C (icc) 11.0.083 compile problem

2009-06-17 Thread Paul H. Hargrove

Jeff Squyres wrote:
[snip]
Erm -- that's weird.  So when you extract the tarballs, 
atomic-amd64-linux.s is non-empty (as it should be), but after a 
failed build, it's file length 0?


Notice that during the build process, we sym link atomic-amd64-linux.s 
to atomic-asm.S (I see that happening in your build output as well).  
So if the compiler is barfing when compiling atomic-asm.S, perhaps 
it's also wiping out the file...?  That would be darn weird, though...

[snip]

Hmm.  Not a solution to the original problem, but might I suggest that 
any case where the build might over-write a source file is a serious 
problem.  Two possible ways come to mind to address that:
1) Either the configure or make process might write-protect the source 
file at some time prior to making the symlink.
2) The make process could copy, rather than symlink, the file (w/ a 
dependency that would trigger a re-copy if the source file is updated).


The write-protect approach has the advantage that it would let us see a 
make failure at the point that something is trying (erroneously) to 
write/truncate the file.


-Paul

--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group Tel: +1-510-495-2352
HPC Research Department   Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory 



Re: [OMPI devel] Intel C (icc) 11.0.083 compile problem

2009-06-17 Thread Jeff Squyres

I am trying to build Open MPI 1.3.2 with ifort 11.0.074 and icc/icpc
11.0.083 (the Intel compilers) on a quad-core AMD Opteron workstation
running CentOS 4.4. I have no problems on this same machine if I use
ifort with gcc/g++ instead of icc/icpc. Configure seems to work ok  
even

though icc and icpc are detected as GNU compilers.

CC=icc CXX=icpc FC=ifort F77=ifort ./configure --disable-shared
--enable-static --prefix=/opt/intelsoft/openmpi/openmpi-1.3.2


Greetings Dave.  I tried this configuration earlier this morning and  
had no problem.  :-(


(also, I'm not sure what happened, but somehow your attachments came  
through as uncompressed and inline, meaning everyone on the list got a  
3MB+ email)



However, when I run 'make' it has trouble in the opal/asm directory:

libtool: compile:  icc -DHAVE_CONFIG_H -I. -I../../opal/include
-I../../orte/include -I../../ompi/include
-I../../opal/mca/paffinity/linux/plpa/src/libplpa -I../.. -O3 -DNDEBUG
-finline-functions -fno-strict-aliasing -restrict -MT atomic-asm.lo - 
MD

-MP -MF .deps/atomic-asm.Tpo -c atomic-asm.S -o atomic-asm.o
Unknown flag -x
Unknown flag -a
Unknown flag -s
Unknown flag -s
Unknown flag -e
Unknown flag -m
Unknown flag -b
Unknown flag -l
Unknown flag -e
Unknown flag -r
Unknown flag --
Unknown flag -w
Unknown flag -i
Unknown flag -t
Unknown flag -h
Unknown flag --
Unknown flag -c
Unknown flag -p
Unknown flag -p
Unknown flag -F


Hmm.  I find it odd that that -xassembler... flag does not appear in  
OMPI's output -- it leads me to believe that it's somehow being  
inserted under the covers by icc (or something else?).  When I built  
with icc 11.0 v083 this morning, here's the relevant parts from my  
"make" output:


libtool: compile:  icc -DHAVE_CONFIG_H -I. -I../../opal/include - 
I../../orte/include -I../../ompi/include -I../../opal/mca/paffinity/ 
linux/plpa/src/libplpa -I../.. -O3 -DNDEBUG -finline-functions -fno- 
strict-aliasing -restrict -pthread -fvisibility=hidden -MT asm.lo -MD - 
MP -MF .deps/asm.Tpo -c asm.c -o asm.o

rm -f atomic-asm.S
ln -s "../../opal/asm/generated/atomic-amd64-linux.s" atomic-asm.S
depbase=`echo atomic-asm.lo | sed 's|[^/]*$|.deps/&|;s|\.lo$||'`;\
/bin/sh ../../libtool   --mode=compile icc -DHAVE_CONFIG_H -I. -I../../ 
opal/include -I../../orte/include -I../../ompi/include -I../../opal/ 
mca/paffinity/linux/plpa/src/libplpa   -I../..-O3 -DNDEBUG - 
finline-functions -fno-strict-aliasing -restrict -MT atomic-asm.lo -MD  
-MP -MF $depbase.Tpo -c -o atomic-asm.lo atomic-asm.S &&\

mv -f $depbase.Tpo $depbase.Plo
libtool: compile:  icc -DHAVE_CONFIG_H -I. -I../../opal/include - 
I../../orte/include -I../../ompi/include -I../../opal/mca/paffinity/ 
linux/plpa/src/libplpa -I../.. -O3 -DNDEBUG -finline-functions -fno- 
strict-aliasing -restrict -MT atomic-asm.lo -MD -MP -MF .deps/atomic- 
asm.Tpo -c atomic-asm.S -o atomic-asm.o
/bin/sh ../../libtool --tag=CC   --mode=link icc  -O3 -DNDEBUG - 
finline-functions -fno-strict-aliasing -restrict -pthread - 
fvisibility=hidden  -export-dynamic   -o libasm.la  asm.lo atomic- 
asm.lo  -lnsl -lutil

libtool: link: ar cru .libs/libasm.a  asm.o atomic-asm.o
libtool: link: ranlib .libs/libasm.a
libtool: link: ( cd ".libs" && rm -f "libasm.la" && ln -s "../ 
libasm.la" "libasm.la" )



I can't find any hint of the reported "Unknown flags". What's more is
the opal/asm/generated/atomic-amd64-linux.s file is now empty (file  
size

is zero) thus breaking subsequent builds (i.e. with gcc). In order to
get the file back I have to re-extract from the source tarball. If I
execute 'make' again (no 'make clean') the compilation will complete
successfully but will make an empty libasm.a:


Erm -- that's weird.  So when you extract the tarballs, atomic-amd64- 
linux.s is non-empty (as it should be), but after a failed build, it's  
file length 0?


Notice that during the build process, we sym link atomic-amd64-linux.s  
to atomic-asm.S (I see that happening in your build output as well).   
So if the compiler is barfing when compiling atomic-asm.S, perhaps  
it's also wiping out the file...?  That would be darn weird, though...



However, even after I get Open MPI to compile, 'make check' will give
the following results:

libtool: link: icc -DOMPI_DISABLE_INLINE_ASM -O3 -DNDEBUG
-finline-functions -fno-strict-aliasing -restrict -pthread
-fvisibility=hidden -o atomic_barrier_noinline
atomic_barrier_noinline-atomic_barrier_noinline.o -Wl,--export-dynamic
../../opal/asm/.libs/libasm.a -lnsl -lutil -pthread
ipo: warning #11010: file format not recognized for
../../opal/asm/.libs/libasm.a, possible linker script
atomic_barrier_noinline-atomic_barrier_noinline.o(.text+0x29): In
function `main':
: undefined reference to `opal_atomic_mb'


Yes, this is not surprising if the .s file is empty -- it makes an  
empty .o file, and therefore those symbols just aren't defined.


--
Jeff Squyres
Cisco Systems



[OMPI devel] Intel C (icc) 11.0.083 compile problem

2009-06-17 Thread David Robertson

Hello,

I am trying to build Open MPI 1.3.2 with ifort 11.0.074 and icc/icpc 
11.0.083 (the Intel compilers) on a quad-core AMD Opteron workstation 
running CentOS 4.4. I have no problems on this same machine if I use 
ifort with gcc/g++ instead of icc/icpc. Configure seems to work ok even 
though icc and icpc are detected as GNU compilers.


CC=icc CXX=icpc FC=ifort F77=ifort ./configure --disable-shared 
--enable-static --prefix=/opt/intelsoft/openmpi/openmpi-1.3.2


However, when I run 'make' it has trouble in the opal/asm directory:

libtool: compile:  icc -DHAVE_CONFIG_H -I. -I../../opal/include 
-I../../orte/include -I../../ompi/include 
-I../../opal/mca/paffinity/linux/plpa/src/libplpa -I../.. -O3 -DNDEBUG 
-finline-functions -fno-strict-aliasing -restrict -MT atomic-asm.lo -MD 
-MP -MF .deps/atomic-asm.Tpo -c atomic-asm.S -o atomic-asm.o

Unknown flag -x
Unknown flag -a
Unknown flag -s
Unknown flag -s
Unknown flag -e
Unknown flag -m
Unknown flag -b
Unknown flag -l
Unknown flag -e
Unknown flag -r
Unknown flag --
Unknown flag -w
Unknown flag -i
Unknown flag -t
Unknown flag -h
Unknown flag --
Unknown flag -c
Unknown flag -p
Unknown flag -p
Unknown flag -F
Cannot open source file .deps/atomic-asm.Tpo
Extra name /tmp/icc2ioudZ.s ignored
No input file for -M flag
mv: cannot stat `.deps/atomic-asm.Tpo': No such file or directory
make[2]: *** [atomic-asm.lo] Error 1
make[2]: Leaving directory `/usr/local/src/openmpi-1.3.2/opal/asm'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/usr/local/src/openmpi-1.3.2/opal'
make: *** [all-recursive] Error 1

I can't find any hint of the reported "Unknown flags". What's more is 
the opal/asm/generated/atomic-amd64-linux.s file is now empty (file size 
is zero) thus breaking subsequent builds (i.e. with gcc). In order to 
get the file back I have to re-extract from the source tarball. If I 
execute 'make' again (no 'make clean') the compilation will complete 
successfully but will make an empty libasm.a:



bash-3.00$ nm opal/asm/.libs/libasm.a

asm.o:

atomic-asm.o:


However, even after I get Open MPI to compile, 'make check' will give 
the following results:


libtool: link: icc -DOMPI_DISABLE_INLINE_ASM -O3 -DNDEBUG 
-finline-functions -fno-strict-aliasing -restrict -pthread 
-fvisibility=hidden -o atomic_barrier_noinline 
atomic_barrier_noinline-atomic_barrier_noinline.o -Wl,--export-dynamic 
../../opal/asm/.libs/libasm.a -lnsl -lutil -pthread
ipo: warning #11010: file format not recognized for 
../../opal/asm/.libs/libasm.a, possible linker script
atomic_barrier_noinline-atomic_barrier_noinline.o(.text+0x29): In 
function `main':

: undefined reference to `opal_atomic_mb'
atomic_barrier_noinline-atomic_barrier_noinline.o(.text+0x2e): In 
function `main':

: undefined reference to `opal_atomic_rmb'
atomic_barrier_noinline-atomic_barrier_noinline.o(.text+0x33): In 
function `main':

: undefined reference to `opal_atomic_wmb'
make[3]: *** [atomic_barrier_noinline] Error 1
make[3]: Leaving directory `/usr/local/src/openmpi-1.3.2/test/asm'
make[2]: *** [check-am] Error 2
make[2]: Leaving directory `/usr/local/src/openmpi-1.3.2/test/asm'
make[1]: *** [check-recursive] Error 1
make[1]: Leaving directory `/usr/local/src/openmpi-1.3.2/test'
make: *** [check-recursive] Error 1

I have attached the output of 'configure' (conf.log1), config.log, the 
output of the first 'make' (m.log1), and the output of 'make check' 
(check.log1).


NOTE: If I use the CFLAG '-no-gcc' configure fails.

Thanks,
Dave
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking how to create a ustar tar archive... gnutar


== Configuring Open MPI


*** Checking versions
checking Open MPI version... 1.3.2
checking Open MPI release date... Apr 21, 2009
checking Open MPI Subversion repository version... r21054
checking Open Run-Time Environment version... 1.3.2
checking Open Run-Time Environment release date... Apr 21, 2009
checking Open Run-Time Environment Subversion repository version... r21054
checking Open Portable Access Layer version... 1.3.2
checking Open Portable Access Layer release date... Apr 21, 2009
checking Open Portable Access Layer Subversion repository version... r21054

*** Initialization, setup
configure: builddir: /usr/local/src/openmpi-1.3.2
configure: srcdir: /usr/local/src/openmpi-1.3.2
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
installing to directory "/opt/intelsoft/openmpi/openmpi-1.3.2"

*** Configuration options
checking whether to run code coverage... no
checking whether to compile with branch probabilities... no
checking whether to debug memory usage... no
ch

Re: [OMPI devel] connect management for multirail (Open-)MX

2009-06-17 Thread George Bosilca
Yes, in Open MPI the connections are usually created on demand. As far  
as I know there are few devices that do not abide to this "law", but  
MX is not one of them.


To be more precise on how the connections are established, if we say  
that each node has two rails and we're doing a ping-pong, the first  
message from p0 to p1 will connect the first NIC, and the second  
message the second NIC (here I made the assumption that both network  
are similar). Moreover in MX, the connection is not symmetric, so your  
(1) and (2) might happens simultaneously.


Does the code contain an MPI_Barrier ? If yes, this might be why you  
see the sequence (1), (2), (3) and (4) ...


  george.

On Jun 17, 2009, at 12:13 , Brice Goglin wrote:


Thanks for the answer. So if I understand correctly, the connection
order is decided dynamically depending on when each peer has some
messages to send and how the upper level load-balances them. There
shouldn't be anything preventing (1) and (2) from happening at the  
same
time then. And I wonder why I always see 1,2,3,4 with MX (using IMB)  
and

not with Open-MX...

Brice



George Bosilca wrote:

Brice,

The connection mechanism in the MX BTL suffers from a big problem on
multi-rail (if all NICS are identical). If the rails are connected
using the same mapper, they will have identical ID. Unfortunately,
these ID are supposed to be unique in order to guarantee the
connection ordering (0 to 0, 1 to 1 and so on based on the mapper's
MAC). However, the outcome I saw in the past in this case is not a
deadlock but a poorly distribution of the data over the two NICS: one
will be over-loaded while the other will not be used at all.

There is no answer from a peer when we connect the MX BTLs. If the
steps are the ones you described in your email, then I guess both of
the peers try to connect to the other simultaneously. Now, when you
have multiple rails, we treat them at the upper level as independent
devices, and we will try to load balance the messages over all of
them. The step (3) seems to indicate that another message (MPI) has
been sent, and because of the load balancing scheme we try to connect
the second device (rail in this context). In MX this works because we
use the blocking function (mx_connect).

 george.

On Jun 17, 2009, at 08:23 , Brice Goglin wrote:


Hello,

I am debugging some sort of deadlock when doing multirail over  
Open-MX.
What I am seeing with 2 processes and 2 boards per node with *MX*  
is:

1) process 0 rail 0 connects to process 1 rail 0
2) p1r0 connects back to p0r0
3) p0 rail 1 connects to p1 rail 1
4) p1r1 connects back to p0r1
For some reason, with *Open-MX*, process 0 seems to start (3) before
process 1 has finished (2). It probably causes a deadlock because  
p1 is
polling on rail 0 for (2), while (3) needs somebody to poll on  
rail 1

for the connect handshake.

So, the question is: is there anything in OMPI (1.3) guarantying  
that
the above 4 steps will occur in some specified order? If so, Open- 
MX is

probably doing something wrong breaking the order. If not, adding a
progression thread to Open-MX might be the only solution...

thanks,
Brice

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] connect management for multirail (Open-)MX

2009-06-17 Thread Brice Goglin
Thanks for the answer. So if I understand correctly, the connection
order is decided dynamically depending on when each peer has some
messages to send and how the upper level load-balances them. There
shouldn't be anything preventing (1) and (2) from happening at the same
time then. And I wonder why I always see 1,2,3,4 with MX (using IMB) and
not with Open-MX...

Brice



George Bosilca wrote:
> Brice,
>
> The connection mechanism in the MX BTL suffers from a big problem on
> multi-rail (if all NICS are identical). If the rails are connected
> using the same mapper, they will have identical ID. Unfortunately,
> these ID are supposed to be unique in order to guarantee the
> connection ordering (0 to 0, 1 to 1 and so on based on the mapper's
> MAC). However, the outcome I saw in the past in this case is not a
> deadlock but a poorly distribution of the data over the two NICS: one
> will be over-loaded while the other will not be used at all.
>
> There is no answer from a peer when we connect the MX BTLs. If the
> steps are the ones you described in your email, then I guess both of
> the peers try to connect to the other simultaneously. Now, when you
> have multiple rails, we treat them at the upper level as independent
> devices, and we will try to load balance the messages over all of
> them. The step (3) seems to indicate that another message (MPI) has
> been sent, and because of the load balancing scheme we try to connect
> the second device (rail in this context). In MX this works because we
> use the blocking function (mx_connect).
>
>   george.
>
> On Jun 17, 2009, at 08:23 , Brice Goglin wrote:
>
>> Hello,
>>
>> I am debugging some sort of deadlock when doing multirail over Open-MX.
>> What I am seeing with 2 processes and 2 boards per node with *MX* is:
>> 1) process 0 rail 0 connects to process 1 rail 0
>> 2) p1r0 connects back to p0r0
>> 3) p0 rail 1 connects to p1 rail 1
>> 4) p1r1 connects back to p0r1
>> For some reason, with *Open-MX*, process 0 seems to start (3) before
>> process 1 has finished (2). It probably causes a deadlock because p1 is
>> polling on rail 0 for (2), while (3) needs somebody to poll on rail 1
>> for the connect handshake.
>>
>> So, the question is: is there anything in OMPI (1.3) guarantying that
>> the above 4 steps will occur in some specified order? If so, Open-MX is
>> probably doing something wrong breaking the order. If not, adding a
>> progression thread to Open-MX might be the only solution...
>>
>> thanks,
>> Brice
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] connect management for multirail (Open-)MX

2009-06-17 Thread George Bosilca

Brice,

The connection mechanism in the MX BTL suffers from a big problem on  
multi-rail (if all NICS are identical). If the rails are connected  
using the same mapper, they will have identical ID. Unfortunately,  
these ID are supposed to be unique in order to guarantee the  
connection ordering (0 to 0, 1 to 1 and so on based on the mapper's  
MAC). However, the outcome I saw in the past in this case is not a  
deadlock but a poorly distribution of the data over the two NICS: one  
will be over-loaded while the other will not be used at all.


There is no answer from a peer when we connect the MX BTLs. If the  
steps are the ones you described in your email, then I guess both of  
the peers try to connect to the other simultaneously. Now, when you  
have multiple rails, we treat them at the upper level as independent  
devices, and we will try to load balance the messages over all of  
them. The step (3) seems to indicate that another message (MPI) has  
been sent, and because of the load balancing scheme we try to connect  
the second device (rail in this context). In MX this works because we  
use the blocking function (mx_connect).


  george.

On Jun 17, 2009, at 08:23 , Brice Goglin wrote:


Hello,

I am debugging some sort of deadlock when doing multirail over Open- 
MX.

What I am seeing with 2 processes and 2 boards per node with *MX* is:
1) process 0 rail 0 connects to process 1 rail 0
2) p1r0 connects back to p0r0
3) p0 rail 1 connects to p1 rail 1
4) p1r1 connects back to p0r1
For some reason, with *Open-MX*, process 0 seems to start (3) before
process 1 has finished (2). It probably causes a deadlock because p1  
is

polling on rail 0 for (2), while (3) needs somebody to poll on rail 1
for the connect handshake.

So, the question is: is there anything in OMPI (1.3) guarantying that
the above 4 steps will occur in some specified order? If so, Open-MX  
is

probably doing something wrong breaking the order. If not, adding a
progression thread to Open-MX might be the only solution...

thanks,
Brice

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] 1.3.3 Release Schedule

2009-06-17 Thread Brad Benton
On Wed, Jun 17, 2009 at 6:45 AM, Jeff Squyres  wrote:

> Looks good to me.  Brad -- can you add this to the wiki in the 1.3 series
> page?


done: https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3.3

--brad



>
>
> On Jun 16, 2009, at 10:37 PM, Brad Benton wrote:
>
>  All:
>>
>> We are close to releasing 1.3.3.  This is the current plan:
>>  - Evening of 6/16: collect MTT runs on the current branch w/the current
>> 1.3.3 features & fixes
>>  - If all goes well with the overnight MTT runs, roll a release candidate
>> on 6/17
>>  - Put 1.3.3rc1 through its paces over the next couple of days
>>  - If all goes well with rc1, release 1.3.3 on Friday, June 19
>>
>> 1.3.3 will include support for Windows as its major new feature, as well
>> as a number of defect fixes.
>>
>> 1.3.3 will be the final feature release in the 1.3 series.  As such, with
>> the new feature/stable numbering
>> scheme, the next release in the series will contain defect fixes only and
>> will transition to 1.4.  This
>> will be the stable/maintenance branch.  The plan is for it to follow the
>> 1.3.3 release by a fairly short time
>> (4-6 weeks), and subsequent releases in the series will take place as need
>> be depending on the bug
>> fix volume & criticality.
>>
>> Thanks,
>> --brad
>> 1.3/1.4 co-release mgr
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] Just a suggestion about a formation of new openMPI student mailing list

2009-06-17 Thread Leo P.
Hi Eugene,

I was just thinking about Ubuntu's MOTU initiative. 
[https://wiki.ubuntu.com/MOTU/Mentoring]  when i talked about mentoring program 
for openMPI. 

Also i thought the user mailing list was for talking about user's level program 
not the things related to core openMPI functions and soon.

And yes i have observed how adhoc relationship spring up in openMPI community.  
:)

Regards,
Leo. P





From: Eugene Loh 
To: Open MPI Developers 
Sent: Wednesday, 17 June, 2009 8:44:07 PM
Subject: Re: [OMPI devel] Just a suggestion about a formation of new openMPI 
student mailing list

 Leo P. wrote: 
I
found openMPI community filled with co-operative and helpful people.
And would like to thank them through this email [Nik, Eugene, Ralph,
Mitchel and others]. 

You are very gracious.

Also
I would like to suggest one or may be two things.

1. First of all i would like to suggest a different mailing list for
students like me who wants to learn about openMPI. Since questions from
someone like is going to be simpler than those of other professional
developers. Maybe the students from the  student mailing list can solve
it. If not we can post in the developers mailing list. I think this
will limit the email in the developers list. 

I think there is already such a list.  It's the "users" (rather than
"devel") list.

2.
Secondly if the developer could volunteer to become mentors for student
(particularly thesis student like m e :) ). I think they would benefit
a lot. 

Perhaps some of those relationships spring up "ad hoc" on the mail
list, as you have already observed.



  Love Cricket? Check out live scores, photos, video highlights and more. 
Click here http://cricket.yahoo.com

Re: [OMPI devel] Just a suggestion about a formation of new openMPI student mailing list

2009-06-17 Thread Eugene Loh




Leo P. wrote:

  
  I
found openMPI community filled with co-operative and helpful people.
And would like to thank them through this email [Nik, Eugene, Ralph,
Mitchel and others]. 
  

You are very gracious.

  Also
I would like to suggest one or may be two things.
  
1. First of all i would like to suggest a different mailing list for
students like me who wants to learn about openMPI. Since questions from
someone like is going to be simpler than those of other professional
developers. Maybe the students from the  student mailing list can solve
it. If not we can post in the developers mailing list. I think this
will limit the email in the developers list. 
  

I think there is already such a list.  It's the "users" (rather than
"devel") list.

  2.
Secondly if the developer could volunteer to become mentors for student
(particularly thesis student like m e :) ). I think they would benefit
a lot. 
  

Perhaps some of those relationships spring up "ad hoc" on the mail
list, as you have already observed.




[OMPI devel] Just a suggestion about a formation of new openMPI student mailing list

2009-06-17 Thread Leo P.
Hi everyone, 

I found openMPI community filled with co-operative and helpful people. And 
would like to thank them through this email [Nik, Eugene, Ralph, Mitchel and 
others]. 

Also I would like to suggest one or may be two things.

1. First of all i would like to suggest a different mailing list for students 
like me who wants to learn about openMPI. Since questions from someone like is 
going to be simpler than those of other professional developers. Maybe the 
students from the  student mailing list can solve it. If not we can post in the 
developers mailing list. I think this will limit the email in the developers 
list. 

2. Secondly if the developer could volunteer to become mentors for student 
(particularly thesis student like m e :) ). I think they would benefit a lot. 


Regards,
Leo P.


  Cricket on your mind? Visit the ultimate cricket website. Enter 
http://cricket.yahoo.com

[OMPI devel] Fault Tolerant OpenMPI

2009-06-17 Thread 刚 王
Hi All,

I'm studying fault tolerant MPI. Does OpenMPI support failure autodetecting, 
notifying and MPI library rebuilding like Harness+FT-MPI?

Many thanks.

Gang Wang



  ___ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/

Re: [OMPI devel] [RFC] Low pressure OPAL progress

2009-06-17 Thread Ashley Pittman
On Tue, 2009-06-09 at 07:28 -0400, Terry Dontje wrote:
> The biggest issue is coming up with a 
> way to have blocks on the SM btl converted to the system poll call 
> without requiring a socket write for every packet.

For what it's worth you don't need a socket write every (local) packet,
all you need to send your local peers a message when you are about to
sleep.  This can be implemented with a shared memory word so no implicit
comms is required.  The sender can then send a message using whatever
means it does currently, check if the bit is set send a "wakeup" message
via a socket if the remote process is sleeping.

You need to be careful to get the ordering right or you end up with
deadlocks and you need to establish a "remote wakeup" mechanism although
this is easily done with sockets.  You don't even need to communicate
over the socket, all it's for is to cause your peer to return from
poll/select so it can query the shared memory state.  Signals would also
likely work however they tend to present other problems in my
experience.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



[OMPI devel] connect management for multirail (Open-)MX

2009-06-17 Thread Brice Goglin
Hello,

I am debugging some sort of deadlock when doing multirail over Open-MX.
What I am seeing with 2 processes and 2 boards per node with *MX* is:
1) process 0 rail 0 connects to process 1 rail 0
2) p1r0 connects back to p0r0
3) p0 rail 1 connects to p1 rail 1
4) p1r1 connects back to p0r1
For some reason, with *Open-MX*, process 0 seems to start (3) before
process 1 has finished (2). It probably causes a deadlock because p1 is
polling on rail 0 for (2), while (3) needs somebody to poll on rail 1
for the connect handshake.

So, the question is: is there anything in OMPI (1.3) guarantying that
the above 4 steps will occur in some specified order? If so, Open-MX is
probably doing something wrong breaking the order. If not, adding a
progression thread to Open-MX might be the only solution...

thanks,
Brice



Re: [OMPI devel] 1.3.3 Release Schedule

2009-06-17 Thread Jeff Squyres
Looks good to me.  Brad -- can you add this to the wiki in the 1.3  
series page?


On Jun 16, 2009, at 10:37 PM, Brad Benton wrote:


All:

We are close to releasing 1.3.3.  This is the current plan:
  - Evening of 6/16: collect MTT runs on the current branch w/the  
current 1.3.3 features & fixes
  - If all goes well with the overnight MTT runs, roll a release  
candidate on 6/17

  - Put 1.3.3rc1 through its paces over the next couple of days
  - If all goes well with rc1, release 1.3.3 on Friday, June 19

1.3.3 will include support for Windows as its major new feature, as  
well as a number of defect fixes.


1.3.3 will be the final feature release in the 1.3 series.  As such,  
with the new feature/stable numbering
scheme, the next release in the series will contain defect fixes  
only and will transition to 1.4.  This
will be the stable/maintenance branch.  The plan is for it to follow  
the 1.3.3 release by a fairly short time
(4-6 weeks), and subsequent releases in the series will take place  
as need be depending on the bug

fix volume & criticality.

Thanks,
--brad
1.3/1.4 co-release mgr
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems