[OMPI devel] Insufficient lockable memory leads to osu_bibw hang using OpenIB BTL

2008-07-15 Thread Mark Debbage
The osu_bibw micro-benchmark from Ohio State's OMB 3.1 suite hangs when
run over OpenMPI 1.2.5 from OFED 1.3 using the OpenIB BTL if there is
insufficient lockable memory. 128MB of lockable memory gives a hang
when the test gets to 4MB messages, while 512MB is sufficient for it
to pass. I observed this with InfiniPath and Mellanox adapter cards,
and see the same behavior with 1.2.6 too. I know the general advice 
is to use an unlimited or very large setting (per the FAQ), but there
are reasons for clusters to set finite user limits.

For each message size in the loop, osu_bibw posts 64 non-blocking
sends followed by 64 non-blocking receives on both ranks followed 
by a wait for them all to complete. 64 is the default value for
the window size (number of concurrent messages). For 4MB messages
this is 256MB of memory to be sent which is more than exhausting
the 128MB of lockable memory on these systems. The OpenIB BTL
does ib_reg_mr for as many of the sends as it can and the rest
wait on a pending list. Then the ib_reg_mr for all the 
posted receives all fail as well due to the ulimit check,
and all of them have to wait on a pending list too. This means
that neither rank actually gets to do an ib_post_recv, neither
side can make progress and the benchmark hangs without completing
a single 4MB message! This contrasts with the uni-directional 
osu_bw where one side does sends and the other does receives 
and progress can be made.

This is admittedly a hard problem to solve in the general case.
It is unfortunate that this leads to a hang, rather than a
message advising the user to check ulimits. Maybe there should
be a warning the first time that the ulimit is exceeded to
alert the user to the problem. One solution would be to divide
the ulimit up into separate limits for sending and receiving,
so that excessive sending does not block all receiving. This
would require OpenMPI to keep track of the ulimit usage
separately for send and receive.

In this particular synthetic benchmark there turns out to be
a straightforward workaround. The benchmark actually sends
from the same buffer 64 times over, and receives into another
buffer 64 times over (all posted concurrently). Thus there are 
really only two 4MB buffers at play here, though the kernel IB 
code charges the user separately for all 64 registrations of 
each even though the user already has those pages locked. In fact, 
the linux implementation of mlock (over)charges in the same way 
so I guess that choice is intentional and that the additional
complexity in spotting the duplicated locked pages wasn't 
worthwhile.

This leads to the workaround of using --mca mpi_leave_pinned 1.
This turns on the code in the OpenIB BTL that caches the descriptors
so that there is only 1 ib_reg_mr for the send buffer and 1 ib_reg_mr
for the receive buffer, and all the others hit the descriptor
cache. This saves the day and the benchmark runs without problem.

If this was the default option then this might save much consternation
for the user. For this workaround, note that there isn't any need 
for the descriptors to be left pinned after the send/recv complete,
all that is needed is the caching while they are posted. So one could
default to having the descriptor caching mechanism enabled even when 
mpi_leave_pinned is off. Also note that this is still a workaround 
that happens to be sufficient for the osu_bibw case but isn't a 
general panacea. osu_bibw and osu_bw are "broken" anyway in that 
it is illegal to post multiple concurrent receives in the same 
receive buffer. I believe this is done to minimize CPU cache 
effects and maximize measured bandwidth. Anyway, having multiple
posted sends from the same send buffer is reasonable (eg. a broadcast)
so caching those descriptors and reducing lockable memory usage 
seems like a good idea to me. Although osu_bibw is very synthetic
it is conceivable that other real codes with large messages could
see the hangs (eg. just MPI_Sendrecv a message larger than ulimit -l?).

Cheers,

Mark.


Re: [OMPI devel] Segfault in 1.3 branch

2008-07-15 Thread Jeff Squyres
To be clear -- this looks like a different issue than what Pasha was  
reporting.



On Jul 15, 2008, at 8:55 AM, Rolf vandeVaart wrote:



Lenny, I opened a ticket for something that looks the same as this.  
Maybe you can add your details to it.


https://svn.open-mpi.org/trac/ompi/ticket/1386

Rolf

Lenny Verkhovsky wrote:


I guess it should be here, sorry.

/home/USERS/lenny/OMPI_ORTE_18850/bin/mpirun -np 2 -H  
witch2,witch3 ./IMB-MPI1_18850 PingPong

#---
# Intel (R) MPI Benchmark Suite V3.0v modified by Voltaire, MPI-1  
part

#---
# Date : Tue Jul 15 15:11:30 2008
# Machine : x86_64
# System : Linux
# Release : 2.6.16.46-0.12-smp
# Version : #1 SMP Thu May 17 14:00:09 UTC 2007
# MPI Version : 2.0
# MPI Thread Environment: MPI_THREAD_SINGLE

#
# Minimum message length in bytes: 0
# Maximum message length in bytes: 67108864
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong
[witch3:32461] *** Process received signal ***
[witch3:32461] Signal: Segmentation fault (11)
[witch3:32461] Signal code: Address not mapped (1)
[witch3:32461] Failing at address: 0x20
[witch3:32461] [ 0] /lib64/libpthread.so.0 [0x2b514fcedc10]
[witch3:32461] [ 1] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/ 
mca_pml_ob1.so [0x2b51510b416a]
[witch3:32461] [ 2] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/ 
mca_pml_ob1.so [0x2b51510b4661]
[witch3:32461] [ 3] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/ 
mca_pml_ob1.so [0x2b51510b180e]
[witch3:32461] [ 4] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/ 
mca_btl_openib.so [0x2b5151811c22]
[witch3:32461] [ 5] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/ 
mca_btl_openib.so [0x2b51518132e9]
[witch3:32461] [ 6] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/ 
mca_bml_r2.so [0x2b51512c412f]
[witch3:32461] [ 7] /home/USERS/lenny/OMPI_ORTE_18850/lib/libopen- 
pal.so.0(opal_progress+0x5a) [0x2b514f71268a]
[witch3:32461] [ 8] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/ 
mca_pml_ob1.so [0x2b51510af0f5]
[witch3:32461] [ 9] /home/USERS/lenny/OMPI_ORTE_18850/lib/libmpi.so. 
0(PMPI_Recv+0x13b) [0x2b514f47941b]

[witch3:32461] [10] ./IMB-MPI1_18850(IMB_pingpong+0x1a1) [0x4073cd]
[witch3:32461] [11] ./IMB-MPI1_18850(IMB_warm_up+0x2d) [0x405e49]
[witch3:32461] [12] ./IMB-MPI1_18850(main+0x394) [0x4034d4]
[witch3:32461] [13] /lib64/libc.so.6(__libc_start_main+0xf4)  
[0x2b514fe14154]

[witch3:32461] [14] ./IMB-MPI1_18850 [0x4030a9]
[witch3:32461] *** End of error message ***
mpirun: killing job...

--
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--
 witch2
 witch3


On 7/15/08, *Pavel Shamis (Pasha)* > wrote:



   It looks like a new issue to me, Pasha. Possibly a side
   consequence of the
   IOF change made by Jeff and I the other day. From what I can
   see, it looks
   like you app was a simple "hello" - correct?

   Yep, it is simple hello application.

   If you look at the error, the problem occurs when mpirun is
   trying to route
   a message. Since the app is clearly running at this time, the
   problem is
   probably in the IOF. The error message shows that mpirun is
   attempting to
   route a message to a jobid that doesn't exist. We have a test
   in the RML
   that forces an "abort" if that occurs.

   I would guess that there is either a race condition or memory
   corruption
   occurring somewhere, but I have no idea where.

   This may be the "new hole in the dyke" I cautioned about in
   earlier notes
   regarding the IOF... :-)

   Still, given that this hits rarely, it probably is a more
   acceptable bug to
   leave in the code than the one we just fixed (duplicated  
stdin)...


   It is not so rare issue, 19 failures in my MTT run
   (http://www.open-mpi.org/mtt/index.php?do_redir=765).

   Pasha

   Ralph



   On 7/14/08 1:11 AM, "Pavel Shamis (Pasha)"
   >
   wrote:


   Please see http://www.open-mpi.org/mtt/index.php?do_redir=764

   The error is not consistent. It takes a lot of iteration
   to reproduce it.
   In my MTT testing I seen it few times.

   Is it know issue ?

   Regards,
   Pasha
   ___
   devel mailing list
   de...@open-mpi.org 
   http://www.open-mpi.org/mailman/listinfo.cgi/devel



   

[OMPI devel] New trac milestone: v1.4

2008-07-15 Thread Jeff Squyres

I created a new trac milestone today: v1.4.

So if you have stuff that you know won't make it for v1.3, but you  
definitely want it in the next version, go ahead and label it for v1.4  
(as opposed to the amorphous "Future" milestone, which means "we'll do  
it someday -- possibly v1.4, possibly later").


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Segfault in 1.3 branch

2008-07-15 Thread Pavel Shamis (Pasha)

I opened ticket for the bug:
https://svn.open-mpi.org/trac/ompi/ticket/1389

Ralph Castain wrote:

It looks like a new issue to me, Pasha. Possibly a side consequence of the
IOF change made by Jeff and I the other day. From what I can see, it looks
like you app was a simple "hello" - correct?

If you look at the error, the problem occurs when mpirun is trying to route
a message. Since the app is clearly running at this time, the problem is
probably in the IOF. The error message shows that mpirun is attempting to
route a message to a jobid that doesn't exist. We have a test in the RML
that forces an "abort" if that occurs.

I would guess that there is either a race condition or memory corruption
occurring somewhere, but I have no idea where.

This may be the "new hole in the dyke" I cautioned about in earlier notes
regarding the IOF... :-)

Still, given that this hits rarely, it probably is a more acceptable bug to
leave in the code than the one we just fixed (duplicated stdin)...

Ralph



On 7/14/08 1:11 AM, "Pavel Shamis (Pasha)"  wrote:

  

Please see http://www.open-mpi.org/mtt/index.php?do_redir=764

The error is not consistent. It takes a lot of iteration to reproduce it.
In my MTT testing I seen it few times.

Is it know issue ?

Regards,
Pasha
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

  




Re: [OMPI devel] ompi_ignore dr pml?

2008-07-15 Thread Andrew Friedley
The UD BTL currently relies on DR for reliability; though in the near 
future the UD BTL is planned to have its own reliability -- so I'm fine 
with DR going away.


Andrew

Jeff Squyres wrote:

Should we .ompi_ignore dr?

It's not complete and no one wants to support it.  I'm thinking that we 
shouldn't even include it in v1.3.


Thoughts?





Re: [OMPI devel] Segfault in 1.3 branch

2008-07-15 Thread Rolf vandeVaart


Lenny, I opened a ticket for something that looks the same as this. 
Maybe you can add your details to it.


https://svn.open-mpi.org/trac/ompi/ticket/1386

Rolf

Lenny Verkhovsky wrote:


I guess it should be here, sorry.

/home/USERS/lenny/OMPI_ORTE_18850/bin/mpirun -np 2 -H witch2,witch3 
./IMB-MPI1_18850 PingPong

#---
# Intel (R) MPI Benchmark Suite V3.0v modified by Voltaire, MPI-1 part
#---
# Date : Tue Jul 15 15:11:30 2008
# Machine : x86_64
# System : Linux
# Release : 2.6.16.46-0.12-smp
# Version : #1 SMP Thu May 17 14:00:09 UTC 2007
# MPI Version : 2.0
# MPI Thread Environment: MPI_THREAD_SINGLE

#
# Minimum message length in bytes: 0
# Maximum message length in bytes: 67108864
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong
[witch3:32461] *** Process received signal ***
[witch3:32461] Signal: Segmentation fault (11)
[witch3:32461] Signal code: Address not mapped (1)
[witch3:32461] Failing at address: 0x20
[witch3:32461] [ 0] /lib64/libpthread.so.0 [0x2b514fcedc10]
[witch3:32461] [ 1] 
/home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/mca_pml_ob1.so 
[0x2b51510b416a]
[witch3:32461] [ 2] 
/home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/mca_pml_ob1.so 
[0x2b51510b4661]
[witch3:32461] [ 3] 
/home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/mca_pml_ob1.so 
[0x2b51510b180e]
[witch3:32461] [ 4] 
/home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/mca_btl_openib.so 
[0x2b5151811c22]
[witch3:32461] [ 5] 
/home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/mca_btl_openib.so 
[0x2b51518132e9]
[witch3:32461] [ 6] 
/home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/mca_bml_r2.so 
[0x2b51512c412f]
[witch3:32461] [ 7] 
/home/USERS/lenny/OMPI_ORTE_18850/lib/libopen-pal.so.0(opal_progress+0x5a) 
[0x2b514f71268a]
[witch3:32461] [ 8] 
/home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/mca_pml_ob1.so 
[0x2b51510af0f5]
[witch3:32461] [ 9] 
/home/USERS/lenny/OMPI_ORTE_18850/lib/libmpi.so.0(PMPI_Recv+0x13b) 
[0x2b514f47941b]

[witch3:32461] [10] ./IMB-MPI1_18850(IMB_pingpong+0x1a1) [0x4073cd]
[witch3:32461] [11] ./IMB-MPI1_18850(IMB_warm_up+0x2d) [0x405e49]
[witch3:32461] [12] ./IMB-MPI1_18850(main+0x394) [0x4034d4]
[witch3:32461] [13] /lib64/libc.so.6(__libc_start_main+0xf4) 
[0x2b514fe14154]

[witch3:32461] [14] ./IMB-MPI1_18850 [0x4030a9]
[witch3:32461] *** End of error message ***
mpirun: killing job...

--
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--
  witch2
  witch3


On 7/15/08, *Pavel Shamis (Pasha)* > wrote:



It looks like a new issue to me, Pasha. Possibly a side
consequence of the
IOF change made by Jeff and I the other day. From what I can
see, it looks
like you app was a simple "hello" - correct?
 


Yep, it is simple hello application.

If you look at the error, the problem occurs when mpirun is
trying to route
a message. Since the app is clearly running at this time, the
problem is
probably in the IOF. The error message shows that mpirun is
attempting to
route a message to a jobid that doesn't exist. We have a test
in the RML
that forces an "abort" if that occurs.

I would guess that there is either a race condition or memory
corruption
occurring somewhere, but I have no idea where.

This may be the "new hole in the dyke" I cautioned about in
earlier notes
regarding the IOF... :-)

Still, given that this hits rarely, it probably is a more
acceptable bug to
leave in the code than the one we just fixed (duplicated stdin)...
 


It is not so rare issue, 19 failures in my MTT run
(http://www.open-mpi.org/mtt/index.php?do_redir=765).

Pasha

Ralph



On 7/14/08 1:11 AM, "Pavel Shamis (Pasha)"
>
wrote:

 


Please see http://www.open-mpi.org/mtt/index.php?do_redir=764

The error is not consistent. It takes a lot of iteration
to reproduce it.
In my MTT testing I seen it few times.

Is it know issue ?

Regards,
Pasha
___
devel mailing list
de...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/devel
   




___
devel mailing list

Re: [OMPI devel] IBCM error

2008-07-15 Thread Pavel Shamis (Pasha)



Guess what - we don't always put them out there because - tada - we don't
use them! What goes out on the backend is a stripped down version of
libraries we require. Given the huge number of libraries people provide
(looking at the bigger, beyond OMPI picture), it consumes a lot of limited
disk space to install every library on every node. So sometimes we build our
own rpm's to pick up only what we need.

As long as --without-rdmacm --without-ibcm are present, then we are happy.

  

FYI
I recently added options that allow enable/disable all the *cm stuff:

 --enable-openib-ibcmEnable Open Fabrics IBCM support in openib BTL
 (default: enabled)
 --enable-openib-rdmacm  Enable Open Fabrics RDMACM support in openib BTL
 (default: enabled)




Re: [OMPI devel] IBCM error

2008-07-15 Thread Ralph Castain



On 7/15/08 5:05 AM, "Jeff Squyres"  wrote:

> On Jul 14, 2008, at 3:04 PM, Ralph H. Castain wrote:
> 
>> I've been quietly following this discussion, but now feel a need to
>> jump
>> in here. I really must disagree with the idea of building either
>> IBCM or
>> RDMACM support by default. Neither of these has been proven to
>> reliably
>> work, or to be advantageous. Our own experiences in testing them
>> have been
>> slightly negative at best. When the did work, they were slower, didn't
>> scale well, and unreliable.
> 
> Minor clarification: we did not test RDMACM on RoadRunner.

Just for further clarification - I did, and it wasn't a particularly good
experience. Encountered several problems, none of them overwhelming, hence
my comments.

> 
> We only tested IBCM at scale (not RDMACM) and ran into a variety of
> issues -- most of which were bugs in Open MPI's use of IBCM -- that
> culminated in the ib_cm_listen() problem.  That problem is currently
> unsolved, and I agree that it unfortunately currently makes OMPI's
> IBCM support fairly useless.  Bonk.
> 
> IBCM was thought to be a nice thing: a cheap/fast way to make IB
> connections that would get OOB out of the picture.  If the
> ib_cm_listen() problem is fixed, it may still be (Sean had an
> interesting suggestion; we'll see where it goes).  But I totally agree
> that it is somewhat of an unknown quantity at this point.  I also
> agree that the IBCM support in OMPI is not *necessary* because OOB
> works just fine (especially with the scalability improvements in v1.3).
> 
> RDMACM, on the other hand, is *necessary* for iWARP connections.  We
> know it won't scale well because of ARP issues, to which the iWARP
> vendors are publishing their own solutions (pre-populating ARP caches,
> etc.).  Even when built and installed, RDMACM will not be used by
> default for IB hardware (you have to specifically ask for it).  Since
> it's necessary for iWARP, I think we need to build and install it by
> default.  Most importantly: production IB users won't be disturbed.

If it is necessary for iWARP, then fine - so long as it is only used if
specifically requested.

However, I would also ask that we be able to -not- build it upon request so
we can be certain a user doesn't attempt to use it by mistake ("gee, that
looks interesting - let Mikey try it!"). Ditto for ibcm support.

This way, we can experiment with it and continue to learn the problems
without forcing our production people to deal with problem tickets because a
user tried something that has known problems.

> 
>> I'm not trying to rain on anyone's parade. These are worthwhile in the
>> long term. However, they clearly need further work to be "ready for
>> prime
>> time".
>> 
>> Accordingly, I would recommend that they -only- be built if
>> specifically
>> requested. Remember, most of our users just build blindly. It makes no
>> sense to have them build support for what can only be classed as an
>> experimental capability at this time.
> 
> I defer to Mellanox for a decision about the IBCM CPC.
> 
> But for the RDMACM, per above, I am still in favor of building and
> installing it by default.

Like I said, no problem - but give me a configure option so I can -not-
build it too.

> 
>> Also, note that the OFED install is less-than-reliable wrt IBCM and
>> RDMACM.
> 
> True; the OFED install is less-than-reliable w.r.t. IBCM per the
> previously-discussed issue of not necessarily creating the /dev/
> infiniband/ucm* devices.  There's a ticket open on the OpenFabrics
> bugzilla about it.  I wish it would get fixed.  :-)
> 
> But I've not seen any problems with OFED's RDMACM installation.
> 
> The only issue I've seen with RDMACM is when sites consciously chose
> to put the RDMACM libraries and/or modules on the head node (and
> therefore OMPI built support for it), but then chose not put them out
> on back-end compute nodes.  Keep in mind that this is *not* the
> default OFED installation pattern -- a human has to go manually modify
> a file to make it do that (I don't believe that there's even a menu
> option for that installation mode; you have to go hand-edit an OFED
> installation configuration file or simply choose not to put / remove
> certain RPMs out on back-end nodes).

Guess what - we don't always put them out there because - tada - we don't
use them! What goes out on the backend is a stripped down version of
libraries we require. Given the huge number of libraries people provide
(looking at the bigger, beyond OMPI picture), it consumes a lot of limited
disk space to install every library on every node. So sometimes we build our
own rpm's to pick up only what we need.

As long as --without-rdmacm --without-ibcm are present, then we are happy.

> 
>> We have spent considerable time chasing down installation problems
>> that allowed the system to build, but then caused it to crash-and-
>> burn if
>> we attempted to use it. We have gained experience at knowing when/
>> where 

Re: [OMPI devel] IBCM error

2008-07-15 Thread Jeff Squyres

On Jul 14, 2008, at 3:04 PM, Ralph H. Castain wrote:

I've been quietly following this discussion, but now feel a need to  
jump
in here. I really must disagree with the idea of building either  
IBCM or
RDMACM support by default. Neither of these has been proven to  
reliably
work, or to be advantageous. Our own experiences in testing them  
have been

slightly negative at best. When the did work, they were slower, didn't
scale well, and unreliable.


Minor clarification: we did not test RDMACM on RoadRunner.

We only tested IBCM at scale (not RDMACM) and ran into a variety of  
issues -- most of which were bugs in Open MPI's use of IBCM -- that  
culminated in the ib_cm_listen() problem.  That problem is currently  
unsolved, and I agree that it unfortunately currently makes OMPI's  
IBCM support fairly useless.  Bonk.


IBCM was thought to be a nice thing: a cheap/fast way to make IB  
connections that would get OOB out of the picture.  If the  
ib_cm_listen() problem is fixed, it may still be (Sean had an  
interesting suggestion; we'll see where it goes).  But I totally agree  
that it is somewhat of an unknown quantity at this point.  I also  
agree that the IBCM support in OMPI is not *necessary* because OOB  
works just fine (especially with the scalability improvements in v1.3).


RDMACM, on the other hand, is *necessary* for iWARP connections.  We  
know it won't scale well because of ARP issues, to which the iWARP  
vendors are publishing their own solutions (pre-populating ARP caches,  
etc.).  Even when built and installed, RDMACM will not be used by  
default for IB hardware (you have to specifically ask for it).  Since  
it's necessary for iWARP, I think we need to build and install it by  
default.  Most importantly: production IB users won't be disturbed.



I'm not trying to rain on anyone's parade. These are worthwhile in the
long term. However, they clearly need further work to be "ready for  
prime

time".

Accordingly, I would recommend that they -only- be built if  
specifically

requested. Remember, most of our users just build blindly. It makes no
sense to have them build support for what can only be classed as an
experimental capability at this time.


I defer to Mellanox for a decision about the IBCM CPC.

But for the RDMACM, per above, I am still in favor of building and  
installing it by default.



Also, note that the OFED install is less-than-reliable wrt IBCM and
RDMACM.


True; the OFED install is less-than-reliable w.r.t. IBCM per the  
previously-discussed issue of not necessarily creating the /dev/ 
infiniband/ucm* devices.  There's a ticket open on the OpenFabrics  
bugzilla about it.  I wish it would get fixed.  :-)


But I've not seen any problems with OFED's RDMACM installation.

The only issue I've seen with RDMACM is when sites consciously chose  
to put the RDMACM libraries and/or modules on the head node (and  
therefore OMPI built support for it), but then chose not put them out  
on back-end compute nodes.  Keep in mind that this is *not* the  
default OFED installation pattern -- a human has to go manually modify  
a file to make it do that (I don't believe that there's even a menu  
option for that installation mode; you have to go hand-edit an OFED  
installation configuration file or simply choose not to put / remove  
certain RPMs out on back-end nodes).



We have spent considerable time chasing down installation problems
that allowed the system to build, but then caused it to crash-and- 
burn if
we attempted to use it. We have gained experience at knowing when/ 
where to
look now, but that doesn't lessen the reputation impact OMPI is  
getting as

a "buggy, cantankerous beast" according to our sys admins.


Isn't the whole point of pre-release test versions is to find and fix  
such bugs?  ;-)



Not a reputation we should be encouraging.

Turning this off by default allows those more adventurous souls to  
explore
this capability, while letting our production-oriented customers  
install

and run in peace.



Pasha was recommending that IBCM be built by default *but not used by  
default*.  So production users would still be able to run in peace --  
OOB will still be the default.  I see it pretty much like SLURM  
support: it's built by default, but it won't activate itself unless  
relevant.  But like I said above, I defer to Mellanox for IBCM.  :-)


Just my $0.002...

--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] IBCM error

2008-07-15 Thread Pavel Shamis (Pasha)


I need to check on this.  You may want to look at section A3.2.3 of 
the spec.
If you set the first byte (network order) to 0x00, and the 2nd byte 
to 0x01,
then you hit a 'reserved' range that probably isn't being used 
currently.


If you don't care what the service ID is, you can specify 0, and the 
kernel will
assign one.  The assigned value can be retrieved by calling 
ib_cm_attr_id().

(I'm assuming that you communicate the IDs out of band somehow.)



Ok; we'll need to check into this.  I don't remember the ordering -- 
we might actually be communicating the IDs before calling 
ib_cm_listen() (since we were simply using the PIDs, we could do that).


Thanks for the tip!  Pasha -- can you look into this?
It looks that th modex message we are preparing during query stage, so 
the order looks ok.
Unfortunately on my machines ibcm module does not create 
"/dev/infiniband/ucm*" and I can not thest the functionality.


Regards,
Pasha.



Re: [OMPI devel] Segfault in 1.3 branch

2008-07-15 Thread Pavel Shamis (Pasha)



It looks like a new issue to me, Pasha. Possibly a side consequence of the
IOF change made by Jeff and I the other day. From what I can see, it looks
like you app was a simple "hello" - correct?
  

Yep, it is simple hello application.

If you look at the error, the problem occurs when mpirun is trying to route
a message. Since the app is clearly running at this time, the problem is
probably in the IOF. The error message shows that mpirun is attempting to
route a message to a jobid that doesn't exist. We have a test in the RML
that forces an "abort" if that occurs.

I would guess that there is either a race condition or memory corruption
occurring somewhere, but I have no idea where.

This may be the "new hole in the dyke" I cautioned about in earlier notes
regarding the IOF... :-)

Still, given that this hits rarely, it probably is a more acceptable bug to
leave in the code than the one we just fixed (duplicated stdin)...
  
It is not so rare issue, 19 failures in my MTT run 
(http://www.open-mpi.org/mtt/index.php?do_redir=765).


Pasha

Ralph



On 7/14/08 1:11 AM, "Pavel Shamis (Pasha)"  wrote:

  

Please see http://www.open-mpi.org/mtt/index.php?do_redir=764

The error is not consistent. It takes a lot of iteration to reproduce it.
In my MTT testing I seen it few times.

Is it know issue ?

Regards,
Pasha
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel