Re: [OMPI devel] RFC: sm Latency

2009-01-20 Thread Patrick Geoffray
Eugene,

All my remarks are related to the receive side. I think the send side
optimizations are fine, but don't take my word for it.

Eugene Loh wrote:
> To recap:
> 1) The work is already done.

How do you do "directed polling" with ANY_TAG ? How do you ensure you
check all incoming queues from time to time to prevent flow control
(specially if the queues are small for scaling) ? What about the
one-sided that Brian mentioned where there is no corresponding receive
to tell you which queue to poll ?

If you want to handle all the constraints, a single-queue model is much
less work in the end, IMHO.

> 2) The single-queue model addresses only one of the RFC's issues.

The single-queue model addresses not only the latency overhead when
scaling, but also the exploding memory footprint. In many ways, these
problems are the same that plagued the RDMA QP model, and the only
solution was using shared receive queues.

By experience, the linear overhead of polling N queues very quickly
become greater than all the optimizations you can do on the send side.

> 3) I'm a fan of the single-queue model, but it's just a separate discussion.

No problem. You are the one doing the real work here, the rest is
armchair quarterbacking :-)

Patrick


Re: [OMPI devel] RFC: sm Latency

2009-01-20 Thread Richard Graham



On 1/20/09 8:53 PM, "Jeff Squyres"  wrote:

> This all sounds really great to me.  I agree with most of what has
> been said -- e.g., benchmarks *are* important.  Improving them can
> even sometimes have the side effect of improving real applications.  ;-)
> 
> My one big concern is the moving of architectural boundaries of making
> the btl understand MPI match headers.  But even there, I'm torn:
> 
> 1. I understand why it is better -- performance-wise -- to do this.
> And the performance improvement results are hard to argue with.  We
> took a similar approach with ORTE; ORTE is now OMPI-specific, and
> many, many things have become better (from the OMPI perspective, at
> least).
> 
> 2. We all have the knee-jerk reaction that we don't want to have the
> BTLs know anything about MPI semantics because they've always been
> that way and it has been a useful abstraction barrier.  Now there's
> even a project afoot to move the BTLs out into a separate later that
> cannot know about MPI (so that other things can be built upon it).
> But are we sacrificing potential MPI performance here?  I think that's
> one important question.
> 
> Eugene: you mentioned that there are other possibilities to having the
> BTL understand match headers, such as a callback into the PML.  Have
> you tried this approach to see what the performance cost would be,
> perchance?

How is this different from the way matching is done today ?

Rich

> 
> I'd like to see George's reaction to this RFC, and Brian's (if he has
> time).
> 
> 
> On Jan 20, 2009, at 8:04 PM, Eugene Loh wrote:
> 
>> Patrick Geoffray wrote:
>> 
>>> Eugene Loh wrote:
>>> 
>>> 
> replace the fifo¹s with a single link list per process in shared
> memory, with senders to this process adding match envelopes
> atomically, with each process reading its own link list (multiple
> 
> 
 *) Doesn't strike me as a "simple" change.
 
 
>>> Actually, it's much simpler than trying to optimize/scale the N^2
>>> implementation, IMHO.
>>> 
>>> 
>> 1) The version I talk about is already done. Check my putbacks.
>> "Already
>> done" is easier! :^)
>> 
>> 2) The two ideas are largely orthogonal. The RFC talks about a variety
>> of things: cleaning up the sendi function, moving the sendi call up
>> higher in the PML, bypassing the PML receive-request structure
>> (similar
>> to sendi), and stream-lining the data convertors in common cases. Only
>> one part of the RFC (directed polling) overlaps with having a single
>> FIFO per receiver.
>> 
 *) Not sure this addresses all-to-all well.  E.g., let's say you
 post a
 receive for a particular source.  Do you then wade through a long
 FIFO
 to look for your match?
 
 
>>> The tradeoff is between demultiplexing by the sender, which cost in
>>> time
>>> and in space, or by the receiver, which cost an atomic inc. ANY_TAG
>>> forces you to demultiplex on the receive side anyway. Regarding
>>> all-to-all, it won't be more expensive if the receives are pre-
>>> posted,
>>> and they should be.
>>> 
>>> 
>> Not sure I understand this paragraph. I do, however, think there are
>> great benefits to the single-receiver-queue model. It implies
>> congestion
>> on the receiver side in the many-to-one case, but if a single receiver
>> is reading all those messages anyhow, message-processing is already
>> going to throttle the message rate. The extra "bottleneck" at the FIFO
>> might never be seen.
>> 
 What the RFC talks about is not the last SM development we'll ever
 need.  It's only supposed to be one step forward from where we are
 today.  The "single queue per receiver" approach has many
 advantages,
 but I think it's a different topic.
 
 
>>> But is this intermediate step worth it or should we (well,
>>> you :-) ) go
>>> directly for the single queue model ?
>>> 
>> To recap:
>> 1) The work is already done.
>> 2) The single-queue model addresses only one of the RFC's issues.
>> 3) I'm a fan of the single-queue model, but it's just a separate
>> discussion.
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 




Re: [OMPI devel] RFC: sm Latency

2009-01-20 Thread Richard Graham



On 1/20/09 2:08 PM, "Eugene Loh"  wrote:

> Richard Graham wrote:
>>  Re: [OMPI devel] RFC: sm Latency First, the performance improvements look
>> really nice.
>> A few questions:
>>   - How much of an abstraction violation does this introduce?
> Doesn't need to be much of an abstraction violation at all if, by that, we
> mean teaching the BTL about the match header.  Just need to make some choices
> and I flagged that one for better visibility.
> 
>>> >> I really don¹t see how teaching the btl about matching will help much (it
>>> will save a subroutine call).  As I understand
>>> >> the proposal you aim to selectively pull items out of the fifo¹s ­ this
>>> will break the fifo¹s, as they assume contiguous
>>> >> entries.  Logic to manage holes will need to be added.
> 
>> This looks like the btl needs to start ³knowing² about MPI level semantics.
> That's one option.  There are other options.
> 
>>> >> Such as ?
> 
>> Currently, the btl purposefully is ulp agnostic.
> What's ULP?
>>>  >>  Upper Level Protocol
> 
>> I ask for 2 reasons
>>- you mention having the btl look at the match header (if I understood
>> correctly)
>>  
> Right, both to know if there is a match when the user had MPI_ANY_TAG and to
> extract values to populate the MPI_Status variable.  There are other
> alternatives, like calling back the PML.
>>- not clear to me what you mean by returning the header to the list if
>> the irecv does not complete.  If it does not complete, why not just pass the
>> header back for further processing, if all this is happening at the pml level
>> ?
>>  
> I was trying to read the FIFO to see what's on there.  If it's something we
> can handle, we take it and handle it.  If it's too complicated, then we just
> balk and tell the upper layer that we're declining any possible action.  That
> just seemed to me to be the cleanest approach.
> 
>>> >> see the note above.  The fifo logic would have to be changed to manage
>>> non-contiguous entries.
> 
> Here's an analogy.  Let's say you have a house problem.  You don't know how
> bad it is.  You think you might have to hire an expensive contractor to do
> lots of work, but some local handyman thinks he can do it quickly and cheaply.
> So, you have the handyman look at the job to decide how involved it is.  Let's
> say the handyman discovers that it is, indeed, a big job.  How would you like
> things left at this point? Two options:
> 
> *) Handyman says this is a big job.  Bring in the expensive contractor and big
> equipment.  I left everything as I found it.  Or,
> 
> *) Handyman says, "I took apart the this-and-this and I bought a bunch of
> supplies.  I ripped out the south wall.  The water to the house is turned off.
> Etc."  You (and whoever has to come in to actually do the work) would probably
> prefer that nothing had been started.
> 
> I thought it was cleaner to go the "do the whole job or don't do any of it"
> approach.
>>   - The measurements seem to be very dual process specific.  Have you looked
>> at the impact of these changes on other applications at the same process
>> count ?  ³Real² apps would be interesting, but even hpl would be a good
>> start. 
>>  
> Many measurements are for np=2.  There are also np>2 HPCC pingpong results
> though.  (HPCC pingpong measures pingpong between two processes while np-2
> process sit in wait loops.)  HPCC also measures "ring" results... these are
> point-to-point with all np processes work.
> 
> HPL is pretty insensitive to point-to-point performance.  It either shows
> basically DGEMM performance or something is broken.
> 
> I haven't looked at "real" apps.
> 
> Let me be blunt about one thing:  much of this is motivated by simplistic
> (HPCC) benchmarks.  This is for two reasons:
> 
> 1) These benchmarks are highly visible.
> 2) It's hard to tune real apps when you know the primitives need work.
> 
> Looking at real apps is important and I'll try to get to that.
> 
>>> >> don¹t disagree here at all.  Just want to make sure that aiming at these
>>> important benchmarks does not
>>> >> harm what is really more important ­ the day to day use.
> 
>>   The current sm implementation is aimed only at small smp node count, which
>> was really the only relevant type of systems when this code was written 5
>> years ago.  For large core counts there is a rather simple change that could
>> be put in that is simple to implement, and will give you flat scaling for the
>> sort of tests you are running.  If you replace the fifo¹s with a single link
>> list per process in shared memory, with senders to this process adding match
>> envelopes atomically, with each process reading its own link list (multiple
>> writers and single reader in non-threaded situation) there will be only one
>> place to poll, regardless of the number of procs involved in the run.  One
>> still needs other optimizations to lower the absolute latency ­ perhaps what
>> you have suggested.  If one really has all 

Re: [OMPI devel] RFC: sm Latency

2009-01-20 Thread Brian Barrett
I unfortunately don't have time to look in depth at the patch.  But my  
concern is that currently (today, not at some made up time in the  
future, maybe), we use the BTLs for more than just MPI point-to- 
point.  The rdma one-sided component (which was added for 1.3 and  
hopefully will be the default for 1.4) sends messages directly over  
the btls.  It would be interesting to know how that is handled.


Brian


On Jan 20, 2009, at 6:53 PM, Jeff Squyres wrote:

This all sounds really great to me.  I agree with most of what has  
been said -- e.g., benchmarks *are* important.  Improving them can  
even sometimes have the side effect of improving real  
applications.  ;-)


My one big concern is the moving of architectural boundaries of  
making the btl understand MPI match headers.  But even there, I'm  
torn:


1. I understand why it is better -- performance-wise -- to do this.   
And the performance improvement results are hard to argue with.  We  
took a similar approach with ORTE; ORTE is now OMPI-specific, and  
many, many things have become better (from the OMPI perspective, at  
least).


2. We all have the knee-jerk reaction that we don't want to have the  
BTLs know anything about MPI semantics because they've always been  
that way and it has been a useful abstraction barrier.  Now there's  
even a project afoot to move the BTLs out into a separate later that  
cannot know about MPI (so that other things can be built upon it).   
But are we sacrificing potential MPI performance here?  I think  
that's one important question.


Eugene: you mentioned that there are other possibilities to having  
the BTL understand match headers, such as a callback into the PML.   
Have you tried this approach to see what the performance cost would  
be, perchance?


I'd like to see George's reaction to this RFC, and Brian's (if he  
has time).



On Jan 20, 2009, at 8:04 PM, Eugene Loh wrote:


Patrick Geoffray wrote:


Eugene Loh wrote:



replace the fifo’s with a single link list per process in shared
memory, with senders to this process adding match envelopes
atomically, with each process reading its own link list (multiple



*) Doesn't strike me as a "simple" change.



Actually, it's much simpler than trying to optimize/scale the N^2
implementation, IMHO.


1) The version I talk about is already done. Check my putbacks.  
"Already

done" is easier! :^)

2) The two ideas are largely orthogonal. The RFC talks about a  
variety

of things: cleaning up the sendi function, moving the sendi call up
higher in the PML, bypassing the PML receive-request structure  
(similar
to sendi), and stream-lining the data convertors in common cases.  
Only

one part of the RFC (directed polling) overlaps with having a single
FIFO per receiver.

*) Not sure this addresses all-to-all well.  E.g., let's say you  
post a
receive for a particular source.  Do you then wade through a long  
FIFO

to look for your match?


The tradeoff is between demultiplexing by the sender, which cost  
in time

and in space, or by the receiver, which cost an atomic inc. ANY_TAG
forces you to demultiplex on the receive side anyway. Regarding
all-to-all, it won't be more expensive if the receives are pre- 
posted,

and they should be.



Not sure I understand this paragraph. I do, however, think there are
great benefits to the single-receiver-queue model. It implies  
congestion
on the receiver side in the many-to-one case, but if a single  
receiver

is reading all those messages anyhow, message-processing is already
going to throttle the message rate. The extra "bottleneck" at the  
FIFO

might never be seen.


What the RFC talks about is not the last SM development we'll ever
need.  It's only supposed to be one step forward from where we are
today.  The "single queue per receiver" approach has many  
advantages,

but I think it's a different topic.


But is this intermediate step worth it or should we (well,  
you :-) ) go

directly for the single queue model ?


To recap:
1) The work is already done.
2) The single-queue model addresses only one of the RFC's issues.
3) I'm a fan of the single-queue model, but it's just a separate  
discussion.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel






Re: [OMPI devel] RFC: sm Latency

2009-01-20 Thread Jeff Squyres

On Jan 20, 2009, at 8:53 PM, Jeff Squyres wrote:

This all sounds really great to me.  I agree with most of what has  
been said -- e.g., benchmarks *are* important.  Improving them can  
even sometimes have the side effect of improving real  
applications.  ;-)


My one big concern is the moving of architectural boundaries of  
making the btl understand MPI match headers.  But even there, I'm  
torn:


1. I understand why it is better -- performance-wise -- to do this.   
And the performance improvement results are hard to argue with.  We  
took a similar approach with ORTE; ORTE is now OMPI-specific, and  
many, many things have become better (from the OMPI perspective, at  
least).


2. We all have the knee-jerk reaction that we don't want to have the  
BTLs know anything about MPI semantics because they've always been  
that way and it has been a useful abstraction barrier.  Now there's  
even a project afoot to move the BTLs out into a separate later that  
cannot know about MPI (so that other things can be built upon it).   
But are we sacrificing potential MPI performance here?  I think  
that's one important question.


Eugene: you mentioned that there are other possibilities to having  
the BTL understand match headers, such as a callback into the PML.   
Have you tried this approach to see what the performance cost would  
be, perchance?


I neglected to say: the point of asking this question is an attempt to  
quantify the performance gain of having the BTL understand the match  
header.  Specifically: is it a noticeable/important performance gain  
to have change our age-old abstraction barrier?  Or is another  
approach just as good, performance-wise?


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] RFC: sm Latency

2009-01-20 Thread Jeff Squyres
This all sounds really great to me.  I agree with most of what has  
been said -- e.g., benchmarks *are* important.  Improving them can  
even sometimes have the side effect of improving real applications.  ;-)


My one big concern is the moving of architectural boundaries of making  
the btl understand MPI match headers.  But even there, I'm torn:


1. I understand why it is better -- performance-wise -- to do this.   
And the performance improvement results are hard to argue with.  We  
took a similar approach with ORTE; ORTE is now OMPI-specific, and  
many, many things have become better (from the OMPI perspective, at  
least).


2. We all have the knee-jerk reaction that we don't want to have the  
BTLs know anything about MPI semantics because they've always been  
that way and it has been a useful abstraction barrier.  Now there's  
even a project afoot to move the BTLs out into a separate later that  
cannot know about MPI (so that other things can be built upon it).   
But are we sacrificing potential MPI performance here?  I think that's  
one important question.


Eugene: you mentioned that there are other possibilities to having the  
BTL understand match headers, such as a callback into the PML.  Have  
you tried this approach to see what the performance cost would be,  
perchance?


I'd like to see George's reaction to this RFC, and Brian's (if he has  
time).



On Jan 20, 2009, at 8:04 PM, Eugene Loh wrote:


Patrick Geoffray wrote:


Eugene Loh wrote:



replace the fifo’s with a single link list per process in shared
memory, with senders to this process adding match envelopes
atomically, with each process reading its own link list (multiple



*) Doesn't strike me as a "simple" change.



Actually, it's much simpler than trying to optimize/scale the N^2
implementation, IMHO.


1) The version I talk about is already done. Check my putbacks.  
"Already

done" is easier! :^)

2) The two ideas are largely orthogonal. The RFC talks about a variety
of things: cleaning up the sendi function, moving the sendi call up
higher in the PML, bypassing the PML receive-request structure  
(similar

to sendi), and stream-lining the data convertors in common cases. Only
one part of the RFC (directed polling) overlaps with having a single
FIFO per receiver.

*) Not sure this addresses all-to-all well.  E.g., let's say you  
post a
receive for a particular source.  Do you then wade through a long  
FIFO

to look for your match?


The tradeoff is between demultiplexing by the sender, which cost in  
time

and in space, or by the receiver, which cost an atomic inc. ANY_TAG
forces you to demultiplex on the receive side anyway. Regarding
all-to-all, it won't be more expensive if the receives are pre- 
posted,

and they should be.



Not sure I understand this paragraph. I do, however, think there are
great benefits to the single-receiver-queue model. It implies  
congestion

on the receiver side in the many-to-one case, but if a single receiver
is reading all those messages anyhow, message-processing is already
going to throttle the message rate. The extra "bottleneck" at the FIFO
might never be seen.


What the RFC talks about is not the last SM development we'll ever
need.  It's only supposed to be one step forward from where we are
today.  The "single queue per receiver" approach has many  
advantages,

but I think it's a different topic.


But is this intermediate step worth it or should we (well,  
you :-) ) go

directly for the single queue model ?


To recap:
1) The work is already done.
2) The single-queue model addresses only one of the RFC's issues.
3) I'm a fan of the single-queue model, but it's just a separate  
discussion.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems




Re: [OMPI devel] RFC: sm Latency

2009-01-20 Thread Eugene Loh
Patrick Geoffray wrote:

>Eugene Loh wrote:
>  
>
>>>replace the fifo’s with a single link list per process in shared 
>>>memory, with senders to this process adding match envelopes 
>>>atomically, with each process reading its own link list (multiple 
>>>  
>>>
>>*) Doesn't strike me as a "simple" change.
>>
>>
>Actually, it's much simpler than trying to optimize/scale the N^2
>implementation, IMHO.
>  
>
1) The version I talk about is already done. Check my putbacks. "Already
done" is easier! :^)

2) The two ideas are largely orthogonal. The RFC talks about a variety
of things: cleaning up the sendi function, moving the sendi call up
higher in the PML, bypassing the PML receive-request structure (similar
to sendi), and stream-lining the data convertors in common cases. Only
one part of the RFC (directed polling) overlaps with having a single
FIFO per receiver.

>>*) Not sure this addresses all-to-all well.  E.g., let's say you post a 
>>receive for a particular source.  Do you then wade through a long FIFO 
>>to look for your match?
>>
>>
>The tradeoff is between demultiplexing by the sender, which cost in time
>and in space, or by the receiver, which cost an atomic inc. ANY_TAG
>forces you to demultiplex on the receive side anyway. Regarding
>all-to-all, it won't be more expensive if the receives are pre-posted,
>and they should be.
>  
>
Not sure I understand this paragraph. I do, however, think there are
great benefits to the single-receiver-queue model. It implies congestion
on the receiver side in the many-to-one case, but if a single receiver
is reading all those messages anyhow, message-processing is already
going to throttle the message rate. The extra "bottleneck" at the FIFO
might never be seen.

>>What the RFC talks about is not the last SM development we'll ever 
>>need.  It's only supposed to be one step forward from where we are 
>>today.  The "single queue per receiver" approach has many advantages, 
>>but I think it's a different topic.
>>
>>
>But is this intermediate step worth it or should we (well, you :-) ) go
>directly for the single queue model ?
>
To recap:
1) The work is already done.
2) The single-queue model addresses only one of the RFC's issues.
3) I'm a fan of the single-queue model, but it's just a separate discussion.


[OMPI devel] RFC: Adding OMPI_CHECK_WITHDIR checks

2009-01-20 Thread Jeff Squyres

What: Adding OMPI_CHECK_WITHDIR checks in various .m4 files

Why: Help prevent user errors via --with-=DIR configure options

Where: config/*m4 and */mca/*/*/configure.m4 files, affecting the  
following environments:

- bproc (***)
- gm (***)
- loadleveler (***)
- lsf
- mx (***)
- open fabrics
- portals (***)
- psm (***)
- tm
- udapl
- elan (***)
- sctp
- blcr (***)
- libnuma
- valgrind
===> I could not easily test the (***) environments

When: For OMPI v1.4 (could be convinced to make it for v1.3.1)

Timeout: COB Thursday, Jan 29, 2009



The intent for OMPI v1.3's new OMPI_CHECK_WITHDIR m4 macro was to fix  
a case where a user was doing the following:


  ./configure --with-openib=/path/to/nonexistent/OFED/installation

...but configure succeeded anyway because the sysadmins had installed  
OFED into /usr.  Hence, the user was getting something unexpected.


OMPI_CHECK_WITHDIR does a very basic sanity check on directories  
provided by --with-=DIR configure options.  Specifically, it  
checks if the directory exists and if a token file exists in that  
directory (specifically, it calls "ls ", so wildcards are  
acceptable).  If either of those tests fail, configure aborts with an  
appropriate error message.  This macro was used in the openib BTL  
configure stuff, but we didn't add it anywhere else.  I'm now adding  
it everywhere we have a --with-=DIR, which are in various .m4  
files in the environments described above.


Here's the hg where I added OMPI_CHECK_WITHDIR to all the environments  
listed above, but was unable to test the (***) environments:


http://www.open-mpi.org/hg/hgwebdir.cgi/jsquyres/ 
ompi_check_withdir/


We could bring this patch to v1.3.1 or we could wait until v1.4.  I  
don't really care either way.


I plan to bring this work into the trunk next Thursday COB; it would  
be great if those who have the (*) environments could pull down the hg  
tree before then and give it a whirl so we can fix any problems  
beforehand.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Make Error: io_romio_ad_wait.c

2009-01-20 Thread Jeff Squyres

On Jan 18, 2009, at 8:59 PM, Jeremy Espenshade wrote:

libtool: compile:  ppc_4xx-gcc -DHAVE_CONFIG_H -I. -I../../adio/ 
include -DOMPI_BUILDING=1 -I/home/jeremy/Desktop/openmpi-1.2.8/ompi/ 
mca/io/romio/romio/../../../../.. -I/home/jeremy/Desktop/ 
openmpi-1.2.8/ompi/mca/io/romio/romio/../../../../../opal/include - 
I../../../../../../../opal/include -I../../../../../../../ompi/ 
include -I/home/jeremy/Desktop/openmpi-1.2.8/ompi/mca/io/romio/romio/ 
include -I/home/jeremy/Desktop/openmpi-1.2.8/ompi/mca/io/romio/romio/ 
adio/include -O3 -DNDEBUG -finline-functions -fno-strict-aliasing - 
pthread -D_LARGEFILE64_SOURCE -D_FILE_OFFSET_BITS=64 - 
DHAVE_ROMIOCONF_H -DHAVE_ROMIOCONF_H -I../../include -MT  
io_romio_ad_wait.lo -MD -MP -MF .deps/io_romio_ad_wait.Tpo -c  
io_romio_ad_wait.c  -fPIC -DPIC -o .libs/io_romio_ad_wait.o

io_romio_ad_wait.c: In function 'ADIOI_GEN_IOComplete':
io_romio_ad_wait.c:59: warning: passing argument 1 of 'aio_suspend'  
makes pointer from integer without a cast
io_romio_ad_wait.c:59: warning: passing argument 2 of 'aio_suspend'  
makes integer from pointer without a cast
io_romio_ad_wait.c:59: error: too few arguments to function  
'aio_suspend'
io_romio_ad_wait.c:62: error: 'tmp1' undeclared (first use in this  
function)
io_romio_ad_wait.c:62: error: (Each undeclared identifier is  
reported only once

io_romio_ad_wait.c:62: error: for each function it appears in.)



This looks like a prototype mismatch with the aio_suspend() library  
function.


What is the prototype of this function on your system?  The prototype  
is the same on several systems that I have checked (RHEL4, Debian  
somethingorother with kernel 2.6.18, OS X Leopard):


 #include 

 int
 aio_suspend(const struct aiocb *const list[], int nent,
 const struct timespec *timeout);

Is it different on your system?

FWIW, I notice that in the upgraded ROMIO in the just-released OMPI  
v1.3, it doesn't use the call to aio_complete at all.  So you might  
want to try that...?


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] RFC: sm Latency

2009-01-20 Thread Patrick Geoffray
Hi Eugene,

Eugene Loh wrote:
>> replace the fifo’s with a single link list per process in shared 
>> memory, with senders to this process adding match envelopes 
>> atomically, with each process reading its own link list (multiple 


> *) Doesn't strike me as a "simple" change.

Actually, it's much simpler than trying to optimize/scale the N^2
implementation, IMHO.

> *) Not sure this addresses all-to-all well.  E.g., let's say you post a 
> receive for a particular source.  Do you then wade through a long FIFO 
> to look for your match?

The tradeoff is between demultiplexing by the sender, which cost in time
and in space, or by the receiver, which cost an atomic inc. ANY_TAG
forces you to demultiplex on the receive side anyway. Regarding
all-to-all, it won't be more expensive if the receives are pre-posted,
and they should be.

> What the RFC talks about is not the last SM development we'll ever 
> need.  It's only supposed to be one step forward from where we are 
> today.  The "single queue per receiver" approach has many advantages, 
> but I think it's a different topic.

But is this intermediate step worth it or should we (well, you :-) ) go
directly for the single queue model ?

Patrick


Re: [OMPI devel] -display-map

2009-01-20 Thread Greg Watson

Looks good now. Thanks!

Greg

On Jan 20, 2009, at 12:00 PM, Ralph Castain wrote:

I'm embarrassed to admit that I never actually implemented the xml  
option for tag-output...this has been rectified with r20302.


Let me know if that works for you - sorry for confusion.

Ralph


On Jan 20, 2009, at 8:08 AM, Greg Watson wrote:


Ralph,

The encapsulation is not quite right yet. I'm seeing this:

[1,0]n = 0
[1,1]n = 0

but it should be:

n = 0
n = 0

Thanks,

Greg

On Jan 20, 2009, at 9:20 AM, Ralph Castain wrote:

You need to add --tag-output - this is a separate option as it  
applies both to xml and non-xml situations.


If you like, I can force tag-output "on" by default whenever -xml  
is specified.


Ralph


On Jan 16, 2009, at 12:52 PM, Greg Watson wrote:


Ralph,

Is there something I need to do to enable stdout/err  
encapsulation (apart from -xml)? Here's what I see:


$ mpirun -mca orte_show_resolved_nodenames 1 -xml -display-map - 
np 5 /Users/greg/Documents/workspace1/testMPI/Debug/testMPI


















n = 0
n = 0
n = 0
n = 0
n = 0

On Jan 15, 2009, at 1:13 PM, Ralph Castain wrote:

Okay, it is in the trunk as of r20284 - I'll file the request to  
have it moved to 1.3.1.


Let me know if you get a chance to test the stdout/err stuff in  
the trunk - we should try and iterate it so any changes can make  
1.3.1 as well.


Thanks!
Ralph


On Jan 15, 2009, at 11:03 AM, Greg Watson wrote:


Ralph,

I think the second form would be ideal and would simplify  
things greatly.


Greg

On Jan 15, 2009, at 10:53 AM, Ralph Castain wrote:

Here is what I was able to do - note that the resolve messages  
are associated with the specific hostname, not the overall map:











Will that work for you? If you like, I can remove the name=  
field from the noderesolve element since the info is specific  
to the host element that contains it. In other words, I can  
make it look like this:











if that would help.

Ralph


On Jan 14, 2009, at 7:57 AM, Ralph Castain wrote:

We -may- be able to do a more formal XML output at some  
point. The problem will be the natural interleaving of stdout/ 
err from the various procs due to the async behavior of MPI.  
Mpirun receives fragmented output in the forwarding system,  
limited by the buffer sizes and the amount of data we can  
read at any one "bite" from the pipes connecting us to the  
procs. So even though the user -thinks- they output a single  
large line of stuff, it may show up at mpirun as a series of  
fragments. Hence, it gets tricky to know how to put  
appropriate XML brackets around it.


Given this input about when you actually want resolved name  
info, I can at least do something about that area. Won't be  
in 1.3.0, but should make 1.3.1.


As for XML-tagged stdout/err: the OMPI community asked me not  
to turn that feature "on" for 1.3.0 as they felt it hasn't  
been adequately tested yet. The code is present, but cannot  
be activated in 1.3.0. However, I believe it is activated on  
the trunk when you do --xml --tagged-output, so perhaps some  
testing will help us debug and validate it adequately for  
1.3.1?


Thanks
Ralph


On Jan 14, 2009, at 7:02 AM, Greg Watson wrote:


Ralph,

The only time we use the resolved names is when we get a  
map, so we consider them part of the map output.


If quasi-XML is all that will ever be possible with 1.3,  
then you may as well leave as-is and we will attempt to  
clean it up in Eclipse. It would be nice if a future version  
of ompi could output correct XML (including stdout) as this  
would vastly simplify the parsing we need to do.


Regards,

Greg

On Jan 13, 2009, at 3:30 PM, Ralph Castain wrote:

Hmmm...well, I can't do either for 1.3.0 as it is departing  
this afternoon.


The first option would be very hard to do. I would have to  
expose the display-map option across the code base and  
check it prior to printing anything about resolving node  
names. I guess I should ask: do you only want noderesolve  
statements when we are displaying the map? Right now, I  
will output them regardless.


The second option could be done. I could check if any  
"display" option has been specified, and output the   
root at that time (likewise for the end). Anything we  
output in-between would be encapsulated between the two,  
but that would include any user output to stdout and/or  
stderr - which for 1.3.0 is not in xml.


Any thoughts?

Ralph

PS. Guess I should clarify that I was not striving for true  
XML interaction here, but rather a quasi-XML format that  
would help you to filter the output. I have no problem 

Re: [OMPI devel] RFC: sm Latency

2009-01-20 Thread Eugene Loh




Richard Graham wrote:

  Re: [OMPI devel] RFC: sm Latency
  First, the performance improvements look
really nice.
A few questions:
  - How much of an abstraction violation does this introduce?
Doesn't need to be much of an abstraction violation at all if, by that,
we mean teaching the BTL about the match header.  Just need to make
some choices and I flagged that one for better visibility.
This looks like the btl needs to start
“knowing” about MPI level semantics.
That's one option.  There are other options.
Currently, the btl purposefully is ulp
agnostic.
What's ULP?
I ask for 2 reasons
   - you mention having the btl look at the match header (if I
understood correctly)
  
Right, both to know if there is a match when the user had MPI_ANY_TAG
and to extract values to populate the MPI_Status variable.  There are
other alternatives, like calling back the PML.
   - not clear to me what you mean by
returning the header to the list if the irecv does not complete.  If it
does not complete, why not just pass the header back for further
processing, if all this is happening at the pml level ?
  
I was trying to read the FIFO to see what's on there.  If it's
something we can handle, we take it and handle it.  If it's too
complicated, then we just balk and tell the upper layer that we're
declining any possible action.  That just seemed to me to be the
cleanest approach.

Here's an analogy.  Let's say you have a house problem.  You don't know
how bad it is.  You think you might have to hire an expensive
contractor to do lots of work, but some local handyman thinks he can do
it quickly and cheaply.  So, you have the handyman look at the job to
decide how involved it is.  Let's say the handyman discovers that it
is, indeed, a big job.  How would you like things left at this point? 
Two options:

*) Handyman says this is a big job.  Bring in the expensive contractor
and big equipment.  I left everything as I found it.  Or,

*) Handyman says, "I took apart the this-and-this and I bought a bunch
of supplies.  I ripped out the south wall.  The water to the house is
turned off.  Etc."  You (and whoever has to come in to actually do the
work) would probably prefer that nothing had been started.

I thought it was cleaner to go the "do the whole job or don't do any of
it" approach.
  - The measurements seem to be very dual
process specific.  Have you looked at the impact of these changes on
other applications at the same process count ?  “Real” apps would be
interesting, but even hpl would be a good start. 
  
Many measurements are for np=2.  There are also np>2 HPCC pingpong
results though.  (HPCC pingpong measures pingpong between two processes
while np-2 process sit in wait loops.)  HPCC also measures "ring"
results... these are point-to-point with all np processes work.

HPL is pretty insensitive to point-to-point performance.  It either
shows basically DGEMM performance or something is broken.

I haven't looked at "real" apps.

Let me be blunt about one thing:  much of this is motivated by
simplistic (HPCC) benchmarks.  This is for two reasons:

1) These benchmarks are highly visible.
2) It's hard to tune real apps when you know the primitives need work.

Looking at real apps is important and I'll try to get to that.
  The current sm implementation is aimed only
at small smp node count, which was really the only relevant type of
systems when this code was written 5 years ago.  For large core counts
there is a rather simple change that could be put in that is simple to
implement, and will give you flat scaling for the sort of tests you are
running.  If you replace the fifo’s with a single link list per process
in shared memory, with senders to this process adding match envelopes
atomically, with each process reading its own link list (multiple
writers and single reader in non-threaded situation) there will be only
one place to poll, regardless of the number of procs involved in the
run.  One still needs other optimizations to lower the absolute latency
– perhaps what you have suggested.  If one really has all N procs
trying to write to the same fifo at once, performance will stink
because of contention, but most apps don’t have that behaviour.
  
Okay.  Yes, I am a fan of that approach.  But:

*) Doesn't strike me as a "simple" change.
*) Not sure this addresses all-to-all well.  E.g., let's say you post a
receive for a particular source.  Do you then wade through a long FIFO
to look for your match?

What the RFC talks about is not the last SM development we'll ever
need.  It's only supposed to be one step forward from where we are
today.  The "single queue per receiver" approach has many advantages,
but I think it's a different topic.




[OMPI devel] trac report 14

2009-01-20 Thread Jeff Squyres

Has now been updated to include v1.3.1 tickets:

https://svn.open-mpi.org/trac/ompi/report/14

Enjoy.

--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] -display-map

2009-01-20 Thread Ralph Castain
I'm embarrassed to admit that I never actually implemented the xml  
option for tag-output...this has been rectified with r20302.


Let me know if that works for you - sorry for confusion.

Ralph


On Jan 20, 2009, at 8:08 AM, Greg Watson wrote:


Ralph,

The encapsulation is not quite right yet. I'm seeing this:

[1,0]n = 0
[1,1]n = 0

but it should be:

n = 0
n = 0

Thanks,

Greg

On Jan 20, 2009, at 9:20 AM, Ralph Castain wrote:

You need to add --tag-output - this is a separate option as it  
applies both to xml and non-xml situations.


If you like, I can force tag-output "on" by default whenever -xml  
is specified.


Ralph


On Jan 16, 2009, at 12:52 PM, Greg Watson wrote:


Ralph,

Is there something I need to do to enable stdout/err encapsulation  
(apart from -xml)? Here's what I see:


$ mpirun -mca orte_show_resolved_nodenames 1 -xml -display-map -np  
5 /Users/greg/Documents/workspace1/testMPI/Debug/testMPI


















n = 0
n = 0
n = 0
n = 0
n = 0

On Jan 15, 2009, at 1:13 PM, Ralph Castain wrote:

Okay, it is in the trunk as of r20284 - I'll file the request to  
have it moved to 1.3.1.


Let me know if you get a chance to test the stdout/err stuff in  
the trunk - we should try and iterate it so any changes can make  
1.3.1 as well.


Thanks!
Ralph


On Jan 15, 2009, at 11:03 AM, Greg Watson wrote:


Ralph,

I think the second form would be ideal and would simplify things  
greatly.


Greg

On Jan 15, 2009, at 10:53 AM, Ralph Castain wrote:

Here is what I was able to do - note that the resolve messages  
are associated with the specific hostname, not the overall map:











Will that work for you? If you like, I can remove the name=  
field from the noderesolve element since the info is specific  
to the host element that contains it. In other words, I can  
make it look like this:











if that would help.

Ralph


On Jan 14, 2009, at 7:57 AM, Ralph Castain wrote:

We -may- be able to do a more formal XML output at some point.  
The problem will be the natural interleaving of stdout/err  
from the various procs due to the async behavior of MPI.  
Mpirun receives fragmented output in the forwarding system,  
limited by the buffer sizes and the amount of data we can read  
at any one "bite" from the pipes connecting us to the procs.  
So even though the user -thinks- they output a single large  
line of stuff, it may show up at mpirun as a series of  
fragments. Hence, it gets tricky to know how to put  
appropriate XML brackets around it.


Given this input about when you actually want resolved name  
info, I can at least do something about that area. Won't be in  
1.3.0, but should make 1.3.1.


As for XML-tagged stdout/err: the OMPI community asked me not  
to turn that feature "on" for 1.3.0 as they felt it hasn't  
been adequately tested yet. The code is present, but cannot be  
activated in 1.3.0. However, I believe it is activated on the  
trunk when you do --xml --tagged-output, so perhaps some  
testing will help us debug and validate it adequately for 1.3.1?


Thanks
Ralph


On Jan 14, 2009, at 7:02 AM, Greg Watson wrote:


Ralph,

The only time we use the resolved names is when we get a map,  
so we consider them part of the map output.


If quasi-XML is all that will ever be possible with 1.3, then  
you may as well leave as-is and we will attempt to clean it  
up in Eclipse. It would be nice if a future version of ompi  
could output correct XML (including stdout) as this would  
vastly simplify the parsing we need to do.


Regards,

Greg

On Jan 13, 2009, at 3:30 PM, Ralph Castain wrote:

Hmmm...well, I can't do either for 1.3.0 as it is departing  
this afternoon.


The first option would be very hard to do. I would have to  
expose the display-map option across the code base and check  
it prior to printing anything about resolving node names. I  
guess I should ask: do you only want noderesolve statements  
when we are displaying the map? Right now, I will output  
them regardless.


The second option could be done. I could check if any  
"display" option has been specified, and output the   
root at that time (likewise for the end). Anything we output  
in-between would be encapsulated between the two, but that  
would include any user output to stdout and/or stderr -  
which for 1.3.0 is not in xml.


Any thoughts?

Ralph

PS. Guess I should clarify that I was not striving for true  
XML interaction here, but rather a quasi-XML format that  
would help you to filter the output. I have no problem  
trying to get to something more formally correct, but it  
could be tricky in some 

Re: [OMPI devel] === CREATE FAILURE (v1.3) ===

2009-01-20 Thread Jeff Squyres
This problem has been fixed (thankfully, it occurred after the v1.3  
tarballs were made).


The problem is that ftp.gnu.org has disabled repository downloads of  
config.guess and config.sub while some git vulnerability is being  
fixed.  Hence, the scripts that we downloaded while making the  
tarballs [intentionally] have a syntax error that makes the script un- 
runnable.


We've make OMPI's distscript a little more resilient now -- it checks  
that the config.guess/config.sub are runnable  before schlepping them  
into the new tarball.




On Jan 20, 2009, at 10:09 AM, MPI Team wrote:



ERROR: Command returned a non-zero exist status (v1.3):
   make distcheck

Start time: Tue Jan 20 10:00:08 EST 2009
End time:   Tue Jan 20 10:09:20 EST 2009

= 
==

[... previous lines snipped ...]
  && ../configure --srcdir=.. --prefix="$dc_install_base" \
 \
  && make  \
  && make  dvi \
  && make  check \
  && make  install \
  && make  installcheck \
  && make  uninstall \
  && make  distuninstallcheck_dir="$dc_install_base" \
distuninstallcheck \
  && chmod -R a-w "$dc_install_base" \
  && ({ \
   (cd ../.. && umask 077 && mkdir "$dc_destdir") \
   && make  DESTDIR="$dc_destdir" install \
   && make  DESTDIR="$dc_destdir" uninstall \
   && make  DESTDIR="$dc_destdir" \
distuninstallcheck_dir="$dc_destdir" distuninstallcheck; \
  } || { rm -rf "$dc_destdir"; exit 1; }) \
  && rm -rf "$dc_destdir" \
  && make  dist \
  && rm -rf openmpi-1.3.1a0r20299.tar.gz  
openmpi-1.3.1a0r20299.tar.bz2 \

  && make  distcleancheck
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking how to create a ustar tar archive... gnutar

= 
= 
= 
= 
= 
= 
==

== Configuring Open MPI
= 
= 
= 
= 
= 
= 
==


*** Checking versions
checking Open MPI version... 1.3.1a0r20299
checking Open MPI release date... Unreleased developer copy
checking Open MPI Subversion repository version... r20299
checking Open Run-Time Environment version... 1.3.1a0r20299
checking Open Run-Time Environment release date... Unreleased  
developer copy
checking Open Run-Time Environment Subversion repository version...  
r20299

checking Open Portable Access Layer version... 1.3.1a0r20299
checking Open Portable Access Layer release date... Unreleased  
developer copy
checking Open Portable Access Layer Subversion repository version...  
r20299


*** Initialization, setup
configure: builddir: /home/mpiteam/openmpi/nightly-tarball-build- 
root/v1.3/create-r20299/ompi/openmpi-1.3.1a0r20299/_build
configure: srcdir: /home/mpiteam/openmpi/nightly-tarball-build-root/ 
v1.3/create-r20299/ompi/openmpi-1.3.1a0r20299

configure: Detected VPATH build
configure: error: cannot run /bin/sh ../config/config.sub
make: *** [distcheck] Error 1
= 
==


Your friendly daemon,
Cyrador
___
testing mailing list
test...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/testing



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] When can I use OOB channel?

2009-01-20 Thread Ralph Castain

Ah - no problem! Glad it was simple.

Be aware that the RML is the layer responsible for routing OOB  
messages. So if you go through the OOB interface, you lose all message  
routing - which means forcing open additional connections and  
potentially confusing the system.


We should undoubtedly document that somewhere so others don't  
mistakenly use the OOB interfaces directly.


Thanks for bringing this up!
Ralph

On Jan 20, 2009, at 8:06 AM, Timothy Hayes wrote:


Hi Ralph,

I'm quite embarrassed, I misread the function prototype and was  
passing in the actual proc_name rather than a pointer to it! It  
didn't complain when I was compiling so I didn't think twice. It was  
silly mistake on my part in any case! That RML tip is still handy  
though, thanks.


Cheers
Tim

2009/1/20 Ralph Castain 
You sholud be able to use the OOB by that point in the system.  
However, that is the incorrect entry point for sending messages -  
you need to enter via the RML. The correct call is to  
orte_rml.send_nb.


Or, if you are going to send a buffer instead of an iovec, then the  
call would be to orte_rml.send_buffer_nb.


Ralph



On Jan 19, 2009, at 1:01 PM, Timothy Hayes wrote:

Hello

I'm in the midst of writing a BTL component, all is going well  
although today I ran into something unexpected. In the  
mca_btl_base_module_add_procs_fn_t function, I'm trying to call  
mca_oob_tcp_send_nb() which is returning -12 (ORTE_ERR_UNREACH). Is  
this normal or have I done something wrong? Is there a way around  
this? It would be great if I could call this function in that  
particular area of code.


Kind regards
Tim Hayes
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] -display-map

2009-01-20 Thread Greg Watson

Ralph,

The encapsulation is not quite right yet. I'm seeing this:

[1,0]n = 0
[1,1]n = 0

but it should be:

n = 0
n = 0

Thanks,

Greg

On Jan 20, 2009, at 9:20 AM, Ralph Castain wrote:

You need to add --tag-output - this is a separate option as it  
applies both to xml and non-xml situations.


If you like, I can force tag-output "on" by default whenever -xml is  
specified.


Ralph


On Jan 16, 2009, at 12:52 PM, Greg Watson wrote:


Ralph,

Is there something I need to do to enable stdout/err encapsulation  
(apart from -xml)? Here's what I see:


$ mpirun -mca orte_show_resolved_nodenames 1 -xml -display-map -np  
5 /Users/greg/Documents/workspace1/testMPI/Debug/testMPI


















n = 0
n = 0
n = 0
n = 0
n = 0

On Jan 15, 2009, at 1:13 PM, Ralph Castain wrote:

Okay, it is in the trunk as of r20284 - I'll file the request to  
have it moved to 1.3.1.


Let me know if you get a chance to test the stdout/err stuff in  
the trunk - we should try and iterate it so any changes can make  
1.3.1 as well.


Thanks!
Ralph


On Jan 15, 2009, at 11:03 AM, Greg Watson wrote:


Ralph,

I think the second form would be ideal and would simplify things  
greatly.


Greg

On Jan 15, 2009, at 10:53 AM, Ralph Castain wrote:

Here is what I was able to do - note that the resolve messages  
are associated with the specific hostname, not the overall map:











Will that work for you? If you like, I can remove the name=  
field from the noderesolve element since the info is specific to  
the host element that contains it. In other words, I can make it  
look like this:











if that would help.

Ralph


On Jan 14, 2009, at 7:57 AM, Ralph Castain wrote:

We -may- be able to do a more formal XML output at some point.  
The problem will be the natural interleaving of stdout/err from  
the various procs due to the async behavior of MPI. Mpirun  
receives fragmented output in the forwarding system, limited by  
the buffer sizes and the amount of data we can read at any one  
"bite" from the pipes connecting us to the procs. So even  
though the user -thinks- they output a single large line of  
stuff, it may show up at mpirun as a series of fragments.  
Hence, it gets tricky to know how to put appropriate XML  
brackets around it.


Given this input about when you actually want resolved name  
info, I can at least do something about that area. Won't be in  
1.3.0, but should make 1.3.1.


As for XML-tagged stdout/err: the OMPI community asked me not  
to turn that feature "on" for 1.3.0 as they felt it hasn't been  
adequately tested yet. The code is present, but cannot be  
activated in 1.3.0. However, I believe it is activated on the  
trunk when you do --xml --tagged-output, so perhaps some  
testing will help us debug and validate it adequately for 1.3.1?


Thanks
Ralph


On Jan 14, 2009, at 7:02 AM, Greg Watson wrote:


Ralph,

The only time we use the resolved names is when we get a map,  
so we consider them part of the map output.


If quasi-XML is all that will ever be possible with 1.3, then  
you may as well leave as-is and we will attempt to clean it up  
in Eclipse. It would be nice if a future version of ompi could  
output correct XML (including stdout) as this would vastly  
simplify the parsing we need to do.


Regards,

Greg

On Jan 13, 2009, at 3:30 PM, Ralph Castain wrote:

Hmmm...well, I can't do either for 1.3.0 as it is departing  
this afternoon.


The first option would be very hard to do. I would have to  
expose the display-map option across the code base and check  
it prior to printing anything about resolving node names. I  
guess I should ask: do you only want noderesolve statements  
when we are displaying the map? Right now, I will output them  
regardless.


The second option could be done. I could check if any  
"display" option has been specified, and output the   
root at that time (likewise for the end). Anything we output  
in-between would be encapsulated between the two, but that  
would include any user output to stdout and/or stderr - which  
for 1.3.0 is not in xml.


Any thoughts?

Ralph

PS. Guess I should clarify that I was not striving for true  
XML interaction here, but rather a quasi-XML format that  
would help you to filter the output. I have no problem trying  
to get to something more formally correct, but it could be  
tricky in some places to achieve it due to the inherent async  
nature of the beast.



On Jan 13, 2009, at 12:17 PM, Greg Watson wrote:


Ralph,

The XML is looking better now, but there is still one  
problem. To be valid, there needs to be only one root  

Re: [OMPI devel] When can I use OOB channel?

2009-01-20 Thread Timothy Hayes
Hi Ralph,

I'm quite embarrassed, I misread the function prototype and was passing in
the actual proc_name rather than a pointer to it! It didn't complain when I
was compiling so I didn't think twice. It was silly mistake on my part in
any case! That RML tip is still handy though, thanks.

Cheers
Tim

2009/1/20 Ralph Castain 

> You sholud be able to use the OOB by that point in the system. However,
> that is the incorrect entry point for sending messages - you need to enter
> via the RML. The correct call is to orte_rml.send_nb.
>
> Or, if you are going to send a buffer instead of an iovec, then the call
> would be to orte_rml.send_buffer_nb.
>
> Ralph
>
>
>
> On Jan 19, 2009, at 1:01 PM, Timothy Hayes wrote:
>
>   Hello
>>
>> I'm in the midst of writing a BTL component, all is going well although
>> today I ran into something unexpected. In the
>> mca_btl_base_module_add_procs_fn_t function, I'm trying to call
>> mca_oob_tcp_send_nb() which is returning -12 (ORTE_ERR_UNREACH). Is this
>> normal or have I done something wrong? Is there a way around this? It would
>> be great if I could call this function in that particular area of code.
>>
>> Kind regards
>> Tim Hayes
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>


Re: [OMPI devel] -display-map

2009-01-20 Thread Greg Watson
I don't think there's any reason we'd want stdout/err not to be  
encapsulated, so forcing tag-output makes sense.


Greg

On Jan 20, 2009, at 9:20 AM, Ralph Castain wrote:

You need to add --tag-output - this is a separate option as it  
applies both to xml and non-xml situations.


If you like, I can force tag-output "on" by default whenever -xml is  
specified.


Ralph


On Jan 16, 2009, at 12:52 PM, Greg Watson wrote:


Ralph,

Is there something I need to do to enable stdout/err encapsulation  
(apart from -xml)? Here's what I see:


$ mpirun -mca orte_show_resolved_nodenames 1 -xml -display-map -np  
5 /Users/greg/Documents/workspace1/testMPI/Debug/testMPI


















n = 0
n = 0
n = 0
n = 0
n = 0

On Jan 15, 2009, at 1:13 PM, Ralph Castain wrote:

Okay, it is in the trunk as of r20284 - I'll file the request to  
have it moved to 1.3.1.


Let me know if you get a chance to test the stdout/err stuff in  
the trunk - we should try and iterate it so any changes can make  
1.3.1 as well.


Thanks!
Ralph


On Jan 15, 2009, at 11:03 AM, Greg Watson wrote:


Ralph,

I think the second form would be ideal and would simplify things  
greatly.


Greg

On Jan 15, 2009, at 10:53 AM, Ralph Castain wrote:

Here is what I was able to do - note that the resolve messages  
are associated with the specific hostname, not the overall map:











Will that work for you? If you like, I can remove the name=  
field from the noderesolve element since the info is specific to  
the host element that contains it. In other words, I can make it  
look like this:











if that would help.

Ralph


On Jan 14, 2009, at 7:57 AM, Ralph Castain wrote:

We -may- be able to do a more formal XML output at some point.  
The problem will be the natural interleaving of stdout/err from  
the various procs due to the async behavior of MPI. Mpirun  
receives fragmented output in the forwarding system, limited by  
the buffer sizes and the amount of data we can read at any one  
"bite" from the pipes connecting us to the procs. So even  
though the user -thinks- they output a single large line of  
stuff, it may show up at mpirun as a series of fragments.  
Hence, it gets tricky to know how to put appropriate XML  
brackets around it.


Given this input about when you actually want resolved name  
info, I can at least do something about that area. Won't be in  
1.3.0, but should make 1.3.1.


As for XML-tagged stdout/err: the OMPI community asked me not  
to turn that feature "on" for 1.3.0 as they felt it hasn't been  
adequately tested yet. The code is present, but cannot be  
activated in 1.3.0. However, I believe it is activated on the  
trunk when you do --xml --tagged-output, so perhaps some  
testing will help us debug and validate it adequately for 1.3.1?


Thanks
Ralph


On Jan 14, 2009, at 7:02 AM, Greg Watson wrote:


Ralph,

The only time we use the resolved names is when we get a map,  
so we consider them part of the map output.


If quasi-XML is all that will ever be possible with 1.3, then  
you may as well leave as-is and we will attempt to clean it up  
in Eclipse. It would be nice if a future version of ompi could  
output correct XML (including stdout) as this would vastly  
simplify the parsing we need to do.


Regards,

Greg

On Jan 13, 2009, at 3:30 PM, Ralph Castain wrote:

Hmmm...well, I can't do either for 1.3.0 as it is departing  
this afternoon.


The first option would be very hard to do. I would have to  
expose the display-map option across the code base and check  
it prior to printing anything about resolving node names. I  
guess I should ask: do you only want noderesolve statements  
when we are displaying the map? Right now, I will output them  
regardless.


The second option could be done. I could check if any  
"display" option has been specified, and output the   
root at that time (likewise for the end). Anything we output  
in-between would be encapsulated between the two, but that  
would include any user output to stdout and/or stderr - which  
for 1.3.0 is not in xml.


Any thoughts?

Ralph

PS. Guess I should clarify that I was not striving for true  
XML interaction here, but rather a quasi-XML format that  
would help you to filter the output. I have no problem trying  
to get to something more formally correct, but it could be  
tricky in some places to achieve it due to the inherent async  
nature of the beast.



On Jan 13, 2009, at 12:17 PM, Greg Watson wrote:


Ralph,

The XML is looking better now, but there is still one  
problem. To be valid, there needs to be only one root  
element, but 

Re: [OMPI devel] -display-map

2009-01-20 Thread Ralph Castain
You need to add --tag-output - this is a separate option as it applies  
both to xml and non-xml situations.


If you like, I can force tag-output "on" by default whenever -xml is  
specified.


Ralph


On Jan 16, 2009, at 12:52 PM, Greg Watson wrote:


Ralph,

Is there something I need to do to enable stdout/err encapsulation  
(apart from -xml)? Here's what I see:


$ mpirun -mca orte_show_resolved_nodenames 1 -xml -display-map -np  
5 /Users/greg/Documents/workspace1/testMPI/Debug/testMPI


















n = 0
n = 0
n = 0
n = 0
n = 0

On Jan 15, 2009, at 1:13 PM, Ralph Castain wrote:

Okay, it is in the trunk as of r20284 - I'll file the request to  
have it moved to 1.3.1.


Let me know if you get a chance to test the stdout/err stuff in the  
trunk - we should try and iterate it so any changes can make 1.3.1  
as well.


Thanks!
Ralph


On Jan 15, 2009, at 11:03 AM, Greg Watson wrote:


Ralph,

I think the second form would be ideal and would simplify things  
greatly.


Greg

On Jan 15, 2009, at 10:53 AM, Ralph Castain wrote:

Here is what I was able to do - note that the resolve messages  
are associated with the specific hostname, not the overall map:











Will that work for you? If you like, I can remove the name= field  
from the noderesolve element since the info is specific to the  
host element that contains it. In other words, I can make it look  
like this:











if that would help.

Ralph


On Jan 14, 2009, at 7:57 AM, Ralph Castain wrote:

We -may- be able to do a more formal XML output at some point.  
The problem will be the natural interleaving of stdout/err from  
the various procs due to the async behavior of MPI. Mpirun  
receives fragmented output in the forwarding system, limited by  
the buffer sizes and the amount of data we can read at any one  
"bite" from the pipes connecting us to the procs. So even though  
the user -thinks- they output a single large line of stuff, it  
may show up at mpirun as a series of fragments. Hence, it gets  
tricky to know how to put appropriate XML brackets around it.


Given this input about when you actually want resolved name  
info, I can at least do something about that area. Won't be in  
1.3.0, but should make 1.3.1.


As for XML-tagged stdout/err: the OMPI community asked me not to  
turn that feature "on" for 1.3.0 as they felt it hasn't been  
adequately tested yet. The code is present, but cannot be  
activated in 1.3.0. However, I believe it is activated on the  
trunk when you do --xml --tagged-output, so perhaps some testing  
will help us debug and validate it adequately for 1.3.1?


Thanks
Ralph


On Jan 14, 2009, at 7:02 AM, Greg Watson wrote:


Ralph,

The only time we use the resolved names is when we get a map,  
so we consider them part of the map output.


If quasi-XML is all that will ever be possible with 1.3, then  
you may as well leave as-is and we will attempt to clean it up  
in Eclipse. It would be nice if a future version of ompi could  
output correct XML (including stdout) as this would vastly  
simplify the parsing we need to do.


Regards,

Greg

On Jan 13, 2009, at 3:30 PM, Ralph Castain wrote:

Hmmm...well, I can't do either for 1.3.0 as it is departing  
this afternoon.


The first option would be very hard to do. I would have to  
expose the display-map option across the code base and check  
it prior to printing anything about resolving node names. I  
guess I should ask: do you only want noderesolve statements  
when we are displaying the map? Right now, I will output them  
regardless.


The second option could be done. I could check if any  
"display" option has been specified, and output the   
root at that time (likewise for the end). Anything we output  
in-between would be encapsulated between the two, but that  
would include any user output to stdout and/or stderr - which  
for 1.3.0 is not in xml.


Any thoughts?

Ralph

PS. Guess I should clarify that I was not striving for true  
XML interaction here, but rather a quasi-XML format that would  
help you to filter the output. I have no problem trying to get  
to something more formally correct, but it could be tricky in  
some places to achieve it due to the inherent async nature of  
the beast.



On Jan 13, 2009, at 12:17 PM, Greg Watson wrote:


Ralph,

The XML is looking better now, but there is still one  
problem. To be valid, there needs to be only one root  
element, but currently you don't have any (or many). So  
rather than:














the XML should 

Re: [OMPI devel] OpenMPI rpm build 1.3rc3r20226 build failed

2009-01-20 Thread Jonathan Billings
I believe the situation that is causing the error has to do with GCC's
FORTIFY_SOURCE.  I'm building under CentOS 5.2 using the 1.3 src.rpm
available on the website:

% gcc -DHAVE_CONFIG_H -I. -I.. -I../tools/opari/lib -I../extlib/otf/otflib 
-I../extlib/otf/otflib -D_GNU_SOURCE -DBINDIR=\"/usr/bin\" 
-DDATADIR=\"/usr/share\" -DRFG  -DVT_MEMHOOK -DVT_IOWRAP -O2 -g -pipe -Wall 
-Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector 
--param=ssp-buffer-size=4 -m64 -mtune=generic -MT vt_iowrap.o -MD -MP -MF 
.deps/vt_iowrap.Tpo -c -o vt_iowrap.o vt_iowrap.c 
vt_iowrap.c:1242: error: expected declaration specifiers or ‘...’
before numeric constant
vt_iowrap.c:1243: error: conflicting types for ‘__fprintf_chk’

% 
(compile fails)

I cd into the appropriate area, and re-run the gcc without
-D_FORTIFY_SOURCE=2:

% gcc -DHAVE_CONFIG_H -I. -I.. -I../tools/opari/lib -I../extlib/otf/otflib 
-I../extlib/otf/otflib -D_GNU_SOURCE -DBINDIR=\"/usr/bin\" 
-DDATADIR=\"/usr/share\" -DRFG  -DVT_MEMHOOK -DVT_IOWRAP -O2 -g -pipe -Wall 
-fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic 
-MT vt_iowrap.o -MD -MP -MF .deps/vt_iowrap.Tpo -c -o vt_iowrap.o vt_iowrap.c

% 
(compile succeeds)

If I change the spec file to remove the -D_FORTIFY_SOURCE define from
RPM_OPT_FLAGS (like what is done for non-gcc compilers) the build
succeeds.

It appears that the additional define is more strict and causes
problems when the default RPM build environment is kept intact.

-- 
Jonathan Billings 
The College of Language, Science, and the Arts
LS IT - Research Systems and Support


Re: [OMPI devel] RFC: sm Latency

2009-01-20 Thread Graham, Richard L.
If all write to the same destination at the same time - yes.  On older systems 
you could start to see drgradation around 6 procs, but things heald up ok 
further out.  My guess is that you want one such queue per n procs, where n 
might be 8 (have to experiment), so polling costs are low and memory contention 
is manageble.

Rich

- Original Message -
From: devel-boun...@open-mpi.org 
To: Open MPI Developers 
Sent: Tue Jan 20 06:56:53 2009
Subject: Re: [OMPI devel] RFC: sm Latency

Richard Graham wrote:
> First, the performance improvements look really nice.
> A few questions:
> - How much of an abstraction violation does this introduce ? This
> looks like the btl needs to start “knowing” about MPI level semantics.
> Currently, the btl purposefully is ulp agnostic. I ask for 2 reasons
> - you mention having the btl look at the match header (if I understood
> correctly)
> - not clear to me what you mean by returning the header to the list if
> the irecv does not complete. If it does not complete, why not just
> pass the header back for further processing, if all this is happening
> at the pml level ?
> - The measurements seem to be very dual process specific. Have you
> looked at the impact of these changes on other applications at the
> same process count ? “Real” apps would be interesting, but even hpl
> would be a good start.
> The current sm implementation is aimed only at small smp node count,
> which was really the only relevant type of systems when this code was
> written 5 years ago. For large core counts there is a rather simple
> change that could be put in that is simple to implement, and will give
> you flat scaling for the sort of tests you are running. If you replace
> the fifo’s with a single link list per process in shared memory, with
> senders to this process adding match envelopes atomically, with each
> process reading its own link list (multiple writers and single reader
> in non-threaded situation) there will be only one place to poll,
> regardless of the number of procs involved in the run. One still needs
> other optimizations to lower the absolute latency – perhaps what you
> have suggested. If one really has all N procs trying to write to the
> same fifo at once, performance will stink because of contention, but
> most apps don’t have that behaviour.
If I remember correctly you can get a slow down with method you mention
above even with a handful (4-6 processes) writing to the same destination.

--td

> Rich
>
>
> On 1/17/09 1:48 AM, "Eugene Loh"  wrote:
>
>
>
> 
> *RFC: **sm Latency
> WHAT:* Introducing optimizations to reduce ping-pong latencies
> over the sm BTL.
>
> *WHY:* This is a visible benchmark of MPI performance. We can
> improve shared-memory latencies from 30% (if hardware latency is
> the limiting factor) to 2× or more (if MPI software overhead is
> the limiting factor). At high process counts, the improvement can
> be 10× or more.
>
> *WHERE:* Somewhat in the sm BTL, but very importantly also in the
> PML. Changes can be seen in ssh://www.open-mpi.org/~tdd/hg/fastpath.
>
> *WHEN:* Upon acceptance. In time for OMPI 1.4.
>
> *TIMEOUT:* February 6, 2009.
> 
> This RFC is being submitted by eugene@sun.com.
> *WHY (details)
> *The sm BTL typically has the lowest hardware latencies of any
> BTL. Therefore, any OMPI software overhead we otherwise tolerate
> becomes glaringly obvious in sm latency measurements.
>
> In particular, MPI pingpong latencies are oft-cited performance
> benchmarks, popular indications of the quality of an MPI
> implementation. Competitive vendor MPIs optimize this metric
> aggressively, both for np=2 pingpongs and for pairwise pingpongs
> for high np (like the popular HPCC performance test suite).
>
> Performance reported by HPCC include:
>
> * MPI_Send()/MPI_Recv() pingpong latency.
> * MPI_Send()/MPI_Recv() pingpong latency as the number of
>   connections grows.
> * MPI_Sendrecv() latency.
>
> The slowdown of latency as the number of sm connections grows
> becomes increasingly important on large SMPs and ever more
> prevalent many-core nodes.
>
> Other MPI implementations, such as Scali and Sun HPC ClusterTools
> 6, introduced such optimizations years ago.
>
> Performance measurements indicate that the speedups we can expect
> in OMPI with these optimizations range from 30% (np=2 measurements
> where hardware is the bottleneck) to 2× (np=2 measurements where
> software is the bottleneck) to over 10× (large np).
> *WHAT (details)
> *Introduce an optimized "fast path" for "immediate" sends and
> receives. Several actions are recommended here.
> 

Re: [OMPI devel] RFC: sm Latency

2009-01-20 Thread Terry Dontje
Richard Graham wrote:
> First, the performance improvements look really nice.
> A few questions:
> - How much of an abstraction violation does this introduce ? This
> looks like the btl needs to start “knowing” about MPI level semantics.
> Currently, the btl purposefully is ulp agnostic. I ask for 2 reasons
> - you mention having the btl look at the match header (if I understood
> correctly)
> - not clear to me what you mean by returning the header to the list if
> the irecv does not complete. If it does not complete, why not just
> pass the header back for further processing, if all this is happening
> at the pml level ?
> - The measurements seem to be very dual process specific. Have you
> looked at the impact of these changes on other applications at the
> same process count ? “Real” apps would be interesting, but even hpl
> would be a good start.
> The current sm implementation is aimed only at small smp node count,
> which was really the only relevant type of systems when this code was
> written 5 years ago. For large core counts there is a rather simple
> change that could be put in that is simple to implement, and will give
> you flat scaling for the sort of tests you are running. If you replace
> the fifo’s with a single link list per process in shared memory, with
> senders to this process adding match envelopes atomically, with each
> process reading its own link list (multiple writers and single reader
> in non-threaded situation) there will be only one place to poll,
> regardless of the number of procs involved in the run. One still needs
> other optimizations to lower the absolute latency – perhaps what you
> have suggested. If one really has all N procs trying to write to the
> same fifo at once, performance will stink because of contention, but
> most apps don’t have that behaviour.
If I remember correctly you can get a slow down with method you mention
above even with a handful (4-6 processes) writing to the same destination.

--td

> Rich
>
>
> On 1/17/09 1:48 AM, "Eugene Loh"  wrote:
>
>
>
> 
> *RFC: **sm Latency
> WHAT:* Introducing optimizations to reduce ping-pong latencies
> over the sm BTL.
>
> *WHY:* This is a visible benchmark of MPI performance. We can
> improve shared-memory latencies from 30% (if hardware latency is
> the limiting factor) to 2× or more (if MPI software overhead is
> the limiting factor). At high process counts, the improvement can
> be 10× or more.
>
> *WHERE:* Somewhat in the sm BTL, but very importantly also in the
> PML. Changes can be seen in ssh://www.open-mpi.org/~tdd/hg/fastpath.
>
> *WHEN:* Upon acceptance. In time for OMPI 1.4.
>
> *TIMEOUT:* February 6, 2009.
> 
> This RFC is being submitted by eugene@sun.com.
> *WHY (details)
> *The sm BTL typically has the lowest hardware latencies of any
> BTL. Therefore, any OMPI software overhead we otherwise tolerate
> becomes glaringly obvious in sm latency measurements.
>
> In particular, MPI pingpong latencies are oft-cited performance
> benchmarks, popular indications of the quality of an MPI
> implementation. Competitive vendor MPIs optimize this metric
> aggressively, both for np=2 pingpongs and for pairwise pingpongs
> for high np (like the popular HPCC performance test suite).
>
> Performance reported by HPCC include:
>
> * MPI_Send()/MPI_Recv() pingpong latency.
> * MPI_Send()/MPI_Recv() pingpong latency as the number of
>   connections grows.
> * MPI_Sendrecv() latency.
>
> The slowdown of latency as the number of sm connections grows
> becomes increasingly important on large SMPs and ever more
> prevalent many-core nodes.
>
> Other MPI implementations, such as Scali and Sun HPC ClusterTools
> 6, introduced such optimizations years ago.
>
> Performance measurements indicate that the speedups we can expect
> in OMPI with these optimizations range from 30% (np=2 measurements
> where hardware is the bottleneck) to 2× (np=2 measurements where
> software is the bottleneck) to over 10× (large np).
> *WHAT (details)
> *Introduce an optimized "fast path" for "immediate" sends and
> receives. Several actions are recommended here.
> *1. Invoke the **sm BTL sendi (send-immediate) function
> *Each BTL is allowed to define a "send immediate" (sendi)
> function. A BTL is not required to do so, however, in which case
> the PML calls the standard BTL send function.
>
> A sendi function has already been written for sm, but it has not
> been used due to insufficient testing.
>
> The function should be reviewed, commented in, tested, and used.
>
> The changes are:
>
> * *File*: ompi/mca/btl/sm/btl_sm.c
> * *Declaration/Definition*: