Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Gleb Natapov
On Mon, Aug 13, 2007 at 03:59:28PM -0400, Richard Graham wrote:
> 
> 
> 
> On 8/13/07 3:52 PM, "Gleb Natapov"  wrote:
> 
> > On Mon, Aug 13, 2007 at 09:12:33AM -0600, Galen Shipman wrote:
> > Here are the
> > items we have identified:
> > 
> All those things sounds very promising. Is there
> > tmp branch where you
> are going to work on this?
> 
> > 
> >
> 
>  tmp/latency
> 
> Some changes have already gone in - mainly trying to remove as much as
> possible from the isend/send path, before moving on to the list bellow.  Do
> you have cycles to help with this ?
I am very interested, not sure about cycles though. I'll get back from
my vacation next week and look over this list one more time to see where
I can help.

> 
> Rich
> 
> >  >
> > 
> > 
> > 1) remove 0 byte optimization of not initializing the convertor
> >
> > This costs us an ³if³ in MCA_PML_BASE_SEND_REQUEST_INIT and an  
> > ³if³ in
> > mca_pml_ob1_send_request_start_copy
> > +++
> > Measure the convertor
> > initialization before taking any other action.
> >
> >  >
> > 
> > 
> > 
> >  >
> > 
> > 
> > 2) get rid of mca_pml_ob1_send_request_start_prepare and  
> >
> > mca_pml_ob1_send_request_start_copy by removing the  
> >
> > MCA_BTL_FLAGS_SEND_INPLACE flag. Instead we can simply have btl_send  
> >
> > return OMPI_SUCCESS if the fragment can be marked as completed and  
> >
> > OMPI_NOT_ON_WIRE if the fragment cannot be marked as complete. This  
> > solves
> > another problem, with IB if there are a bunch of isends  
> > outstanding we end
> > up buffering them all in the btl, marking  
> > completion and never get them on
> > the wire because the BTL runs out of  
> > credits, we never get credits back
> > until finalize because we never  
> > call progress cause the requests are
> > complete.  There is one issue  
> > here, start_prepare calls prepare_src and
> > start_copy calls alloc, I  
> > think we can work around this by just always
> > using prepare_src,  
> > OpenIB BTL will give a fragment off the free list
> > anyway because the  
> > fragment is less than the eager limit.
> > +++
> > Make the
> > BTL return different return codes for the send. If the  
> > fragment is gone,
> > then the PML is responsible of marking the MPI  
> > request as completed and so
> > on. Only the updated BTLs will get any  
> > benefit from this feature. Add a
> > flag into the descriptor to allow or  
> > not the BTL to free the fragment.
> >
> > 
> > Add a 3 level flag:
> > - BTL_HAVE_OWNERSHIP : the fragment can be released
> > by the BTL after  
> > the send, and then it report back a special return to the
> > PML
> > - BTL_HAVE_OWNERSHIP_AFTER_CALLBACK : the fragment will be released  
> >
> > by the BTL once the completion callback was triggered.
> > - PML_HAVE_OWNERSHIP
> > : the BTL is not allowed to release the fragment  
> > at all (the PML is
> > responsible for this).
> > 
> > Return codes:
> > - done and there will be no
> > callbacks
> > - not done, wait for a callback later
> > - error state
> >
> >  >
> > 
> > 
> > 
> >  >
> > 
> > 
> > 3) Change the remote callback function (and tag value based on what
> > 
> > data we are sending), don't use mca_pml_ob1_recv_frag_callback for  
> >
> > everything!
> >  I think we need:
> > 
> >  mca_pml_ob1_recv_frag_match
> >
> >  mca_pml_ob1_recv_frag_rndv
> >  mca_pml_ob1_recv_frag_rget
> >  
> >
> >  mca_pml_ob1_recv_match_ack_copy
> >  mca_pml_ob1_recv_match_ack_pipeline
> >  
> >
> >  mca_pml_ob1_recv_copy_frag
> >  mca_pml_ob1_recv_put_request
> >
> >  mca_pml_ob1_recv_put_fin
> > +++
> > Pass the callback as parameter to the match
> > function will save us 2  
> > switches. Add more registrations in the BTL in
> > order to jump directly  
> > in the correct function (the first 3 require a
> > match while the others  
> > don't). 4 & 4 bits on the tag so each layer will
> > have 4 bits of tags  
> > [i.e. first 4 bits for the protocol tag and lower 4
> > bits they are up  
> > to the protocol] and the registration table will still be
> > local to  
> > each component.
> >
> >  >
> > 
> > 
> > 
> >  >
> > 
> > 
> > 4) Get rid of mca_pml_ob1_recv_request_progress; this does the same
> > 
> > switch on hdr->hdr_common.hdr_type that mca_pml_ob1_recv_frag_callback!
> >
> >  I think what we can do here is modify mca_pml_ob1_recv_frag_match to  
> > take
> > a function pointer for what it should call on a successful match.
> >  So based
> > on the receive 

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Gleb Natapov
On Mon, Aug 13, 2007 at 09:12:33AM -0600, Galen Shipman wrote:
> Here are the items we have identified:
> 
All those things sounds very promising. Is there tmp branch where you
are going to work on this?

> 
>  
> 
> 
> 1) remove 0 byte optimization of not initializing the convertor
>   This costs us an “if“ in MCA_PML_BASE_SEND_REQUEST_INIT and an  
> “if“ in mca_pml_ob1_send_request_start_copy
> +++
> Measure the convertor initialization before taking any other action.
>  
> 
> 
>  
> 
> 
> 2) get rid of mca_pml_ob1_send_request_start_prepare and  
> mca_pml_ob1_send_request_start_copy by removing the  
> MCA_BTL_FLAGS_SEND_INPLACE flag. Instead we can simply have btl_send  
> return OMPI_SUCCESS if the fragment can be marked as completed and  
> OMPI_NOT_ON_WIRE if the fragment cannot be marked as complete. This  
> solves another problem, with IB if there are a bunch of isends  
> outstanding we end up buffering them all in the btl, marking  
> completion and never get them on the wire because the BTL runs out of  
> credits, we never get credits back until finalize because we never  
> call progress cause the requests are complete.  There is one issue  
> here, start_prepare calls prepare_src and start_copy calls alloc, I  
> think we can work around this by just always using prepare_src,  
> OpenIB BTL will give a fragment off the free list anyway because the  
> fragment is less than the eager limit.
> +++
> Make the BTL return different return codes for the send. If the  
> fragment is gone, then the PML is responsible of marking the MPI  
> request as completed and so on. Only the updated BTLs will get any  
> benefit from this feature. Add a flag into the descriptor to allow or  
> not the BTL to free the fragment.
> 
> Add a 3 level flag:
> - BTL_HAVE_OWNERSHIP : the fragment can be released by the BTL after  
> the send, and then it report back a special return to the PML
> - BTL_HAVE_OWNERSHIP_AFTER_CALLBACK : the fragment will be released  
> by the BTL once the completion callback was triggered.
> - PML_HAVE_OWNERSHIP : the BTL is not allowed to release the fragment  
> at all (the PML is responsible for this).
> 
> Return codes:
> - done and there will be no callbacks
> - not done, wait for a callback later
> - error state
>  
> 
> 
>  
> 
> 
> 3) Change the remote callback function (and tag value based on what  
> data we are sending), don't use mca_pml_ob1_recv_frag_callback for  
> everything!
>   I think we need:
> 
>   mca_pml_ob1_recv_frag_match
>   mca_pml_ob1_recv_frag_rndv
>   mca_pml_ob1_recv_frag_rget
>   
>   mca_pml_ob1_recv_match_ack_copy
>   mca_pml_ob1_recv_match_ack_pipeline
>   
>   mca_pml_ob1_recv_copy_frag
>   mca_pml_ob1_recv_put_request
>   mca_pml_ob1_recv_put_fin
> +++
> Pass the callback as parameter to the match function will save us 2  
> switches. Add more registrations in the BTL in order to jump directly  
> in the correct function (the first 3 require a match while the others  
> don't). 4 & 4 bits on the tag so each layer will have 4 bits of tags  
> [i.e. first 4 bits for the protocol tag and lower 4 bits they are up  
> to the protocol] and the registration table will still be local to  
> each component.
>  
> 
> 
>  
> 
> 
> 4) Get rid of mca_pml_ob1_recv_request_progress; this does the same  
> switch on hdr->hdr_common.hdr_type that mca_pml_ob1_recv_frag_callback!
>   I think what we can do here is modify mca_pml_ob1_recv_frag_match to  
> take a function pointer for what it should call on a successful match.
>   So based on the receive callback we can pass the correct scheduling  
> function to invoke into the generic mca_pml_ob1_recv_frag_match
> 
> Recv_request progress is call in a generic way from multiple places,  
> and we do a big switch inside. In the match function we might want to  
> pass a function pointer to the successful match progress function.  
> This way we will be able to specialize what happens after the match,  
> in a more optimized way. Or the recv_request_match can return the  
> match and then the caller will have to specialize it's action.
>  
> 
> 
>  
> 
> 
> 5) Don't initialize the entire request. We can use item 2 below (if  
> we get back OMPI_SUCCESS from btl_send) then we don't need to fully 

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Richard Graham



On 8/13/07 12:34 PM, "Galen Shipman"  wrote:

> 
 Ok here is the numbers on my machines:
 0 bytes
 mvapich with header caching: 1.56
 mvapich without  header caching: 1.79
 ompi 1.2: 1.59
 
 So on zero bytes ompi not so bad. Also we can see that header
 caching
 decrease the mvapich latency on 0.23
 
 1 bytes
 mvapich with header caching: 1.58
 mvapich without  header caching: 1.83
 ompi 1.2: 1.73
> 
> 
> Is this just convertor initialization cost?

Last night I measured the cost of the convertor initialization in ob1 on my
dual processor mac, using ompi-tests/simple/ping/mpi-ping, and it costs 0.02
to 0.03 microseconds. To be specific, I commented out the check for 0 byte
message size, and the latency went up from about 0.59 usec (this is with
modified code in tmp/latency) to about 0.62 usec.

Rich

> 
> - Galen
> 
 
 And here ompi make some latency jump.
 
 In mvapich the header caching decrease the header size from
 56bytes to
 12bytes.
 What is the header size (pml + btl) in ompi ?
>>> 
>>> The match header size is 16 bytes, so it looks like ours is already
>>> optimized ...
>> So for 0 bytes message we are sending only 16bytes on the wire , is it
>> correct ?
>> 
>> 
>> Pasha.
>>> 
>>>   george.
>>> 
 
 Pasha
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Jeff Squyres

On Aug 13, 2007, at 11:12 AM, Galen Shipman wrote:


1) remove 0 byte optimization of not initializing the convertor
  This costs us an “if“ in MCA_PML_BASE_SEND_REQUEST_INIT and an
“if“ in mca_pml_ob1_send_request_start_copy
+++
Measure the convertor initialization before taking any other action.
-- 
--


I talked with Galen and then with Pasha; Pasha will look into this.   
Specifically:


- Investigate ob1 and find all the places we're doing 0-byte  
optimizations (I don't think that there are any in the openib btl...?).


- Selectively remove each of the zero-byte optimizations and measure  
what the cost is, both in terms of time and cycles (using the RDTSC  
macro/inline function that's somewhere already in OMPI).  If  
possible, it would be best to measure these individually rather than  
removing all of them and looking at the aggregate.


- Do all of this with and without heterogeneous support enabled to  
measure what the cost of heterogeneity is.


This will enable us to find out where the time is being spent.   
Clearly, there's some differences between zero and nonzero byte  
messages, so it would be a good first step to understand exactly what  
they are.


-- 
--


2) get rid of mca_pml_ob1_send_request_start_prepare and


This is also all good stuff; let's look into the zero-byte  
optimizations first and then tackle the rest of these after that.


Good?

--
Jeff Squyres
Cisco Systems




Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Pavel Shamis (Pasha)

Brian Barrett wrote:

On Aug 13, 2007, at 9:33 AM, George Bosilca wrote:


On Aug 13, 2007, at 11:28 AM, Pavel Shamis (Pasha) wrote:


Jeff Squyres wrote:

I guess reading the graph that Pasha sent is difficult; Pasha -- can
you send the actual numbers?


Ok here is the numbers on my machines:
0 bytes
mvapich with header caching: 1.56
mvapich without  header caching: 1.79
ompi 1.2: 1.59

So on zero bytes ompi not so bad. Also we can see that header caching
decrease the mvapich latency on 0.23

1 bytes
mvapich with header caching: 1.58
mvapich without  header caching: 1.83
ompi 1.2: 1.73

And here ompi make some latency jump.

In mvapich the header caching decrease the header size from 56bytes to
12bytes.
What is the header size (pml + btl) in ompi ?


The match header size is 16 bytes, so it looks like ours is already
optimized ...


Pasha -- Is your build of Open MPI built with 
--disable-heterogeneous?  If not, our headers all grow slightly to 
support heterogeneous operations.  For the heterogeneous case, a 1 
byte message includes:
I didn't build with "--disable-heterogeneous". So the heterogeneous 
support was enabled in the build


  16 bytes for the match header
  4 bytes for the Open IB header
  1 byte for the payload
 
  21 bytes total

If you are using eager RDMA, there's an extra 4 bytes for the RDMA 
length in the footer.  Without heterogeneous support, 2 bytes get 
knocked off the size of the match header, so the whole thing will be 
19 bytes (+ 4 for the eager RDMA footer).
I used eager rdma - it is faster than send.  So the message size on the 
wire for 1 byte in my case was - 25bytes  VS 13bytes in mvapich. And If 
i will --disable-heterogeneous it will decrease 2 bytes. So it sound 
like we are pretty optimized.




There are also considerably more ifs in the code if heterogeneous is 
used, especially on x86 machines.


Brian





Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Galen Shipman



Ok here is the numbers on my machines:
0 bytes
mvapich with header caching: 1.56
mvapich without  header caching: 1.79
ompi 1.2: 1.59

So on zero bytes ompi not so bad. Also we can see that header  
caching

decrease the mvapich latency on 0.23

1 bytes
mvapich with header caching: 1.58
mvapich without  header caching: 1.83
ompi 1.2: 1.73



Is this just convertor initialization cost?

- Galen



And here ompi make some latency jump.

In mvapich the header caching decrease the header size from  
56bytes to

12bytes.
What is the header size (pml + btl) in ompi ?


The match header size is 16 bytes, so it looks like ours is already
optimized ...

So for 0 bytes message we are sending only 16bytes on the wire , is it
correct ?


Pasha.


  george.



Pasha
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Pavel Shamis (Pasha)

George Bosilca wrote:


On Aug 13, 2007, at 11:28 AM, Pavel Shamis (Pasha) wrote:


Jeff Squyres wrote:

I guess reading the graph that Pasha sent is difficult; Pasha -- can
you send the actual numbers?


Ok here is the numbers on my machines:
0 bytes
mvapich with header caching: 1.56
mvapich without  header caching: 1.79
ompi 1.2: 1.59

So on zero bytes ompi not so bad. Also we can see that header caching
decrease the mvapich latency on 0.23

1 bytes
mvapich with header caching: 1.58
mvapich without  header caching: 1.83
ompi 1.2: 1.73

And here ompi make some latency jump.

In mvapich the header caching decrease the header size from 56bytes to
12bytes.
What is the header size (pml + btl) in ompi ?


The match header size is 16 bytes, so it looks like ours is already 
optimized ...
So for 0 bytes message we are sending only 16bytes on the wire , is it 
correct ?



Pasha.


  george.



Pasha
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel







Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Brian Barrett

On Aug 13, 2007, at 9:33 AM, George Bosilca wrote:


On Aug 13, 2007, at 11:28 AM, Pavel Shamis (Pasha) wrote:


Jeff Squyres wrote:

I guess reading the graph that Pasha sent is difficult; Pasha -- can
you send the actual numbers?


Ok here is the numbers on my machines:
0 bytes
mvapich with header caching: 1.56
mvapich without  header caching: 1.79
ompi 1.2: 1.59

So on zero bytes ompi not so bad. Also we can see that header caching
decrease the mvapich latency on 0.23

1 bytes
mvapich with header caching: 1.58
mvapich without  header caching: 1.83
ompi 1.2: 1.73

And here ompi make some latency jump.

In mvapich the header caching decrease the header size from  
56bytes to

12bytes.
What is the header size (pml + btl) in ompi ?


The match header size is 16 bytes, so it looks like ours is already
optimized ...


Pasha -- Is your build of Open MPI built with --disable- 
heterogeneous?  If not, our headers all grow slightly to support  
heterogeneous operations.  For the heterogeneous case, a 1 byte  
message includes:


  16 bytes for the match header
  4 bytes for the Open IB header
  1 byte for the payload
 
  21 bytes total

If you are using eager RDMA, there's an extra 4 bytes for the RDMA  
length in the footer.  Without heterogeneous support, 2 bytes get  
knocked off the size of the match header, so the whole thing will be  
19 bytes (+ 4 for the eager RDMA footer).


There are also considerably more ifs in the code if heterogeneous is  
used, especially on x86 machines.


Brian


Re: [OMPI devel] openib btl header caching

2007-08-13 Thread George Bosilca


On Aug 13, 2007, at 11:28 AM, Pavel Shamis (Pasha) wrote:


Jeff Squyres wrote:

I guess reading the graph that Pasha sent is difficult; Pasha -- can
you send the actual numbers?


Ok here is the numbers on my machines:
0 bytes
mvapich with header caching: 1.56
mvapich without  header caching: 1.79
ompi 1.2: 1.59

So on zero bytes ompi not so bad. Also we can see that header caching
decrease the mvapich latency on 0.23

1 bytes
mvapich with header caching: 1.58
mvapich without  header caching: 1.83
ompi 1.2: 1.73

And here ompi make some latency jump.

In mvapich the header caching decrease the header size from 56bytes to
12bytes.
What is the header size (pml + btl) in ompi ?


The match header size is 16 bytes, so it looks like ours is already  
optimized ...


  george.



Pasha
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] openib btl header caching

2007-08-13 Thread George Bosilca


On Aug 13, 2007, at 11:07 AM, Jeff Squyres wrote:


Such a scheme is certainly possible, but I see even less use for it
than use cases for the existing microbenchmarks.  Specifically,
header caching *can* happen in real applications (i.e., repeatedly
send short messages with the same MPI signature), but repeatedly
sending to the same peer with exactly the same signature *and*
exactly the same "long-enough" data (i.e., more than a small number
of ints that an app could use for its own message data caching) is
indicative of a poorly-written MPI application IMHO.


If you look at the message size distribution for most of the HPC  
applications (at least one that get investigated in the papers) you  
will see that very small messages are only an non-significant  
percentage of messages. As this "optimization" only address these  
kind of messages, I doubt there is any real benefit from applications  
point of view (obviously there will be few exceptions as usual). The  
header caching only make sense for very small messages (MVAPICH only  
implement header caching for messages up to 155 bytes [that's less  
than 20 doubles] if I remember well), which make it a real benchmark  
optimization.





But don't complain if your Linpack run fails.


I assume you're talking about bugs in the implementation; not a
problem with the approach, right?


Of course, there is no apparent problem with my approach :) It is  
called an educated guess based on repetitive human behaviors analysis.


  george.



Re: [OMPI devel] openib btl header caching

2007-08-13 Thread George Bosilca
We're working on it. Give us few weeks to finish implementing all the  
planned optimizations/cleanups in th PML and then we can talk about  
tricks. We're expecting/hoping to slim down the PML layer by more  
than 0.5 so this header caching optimization might not make any sense  
at that point.


  Thanks,
george.

On Aug 13, 2007, at 10:38 AM, Jeff Squyres wrote:


On Aug 13, 2007, at 10:34 AM, Jeff Squyres wrote:


All this being said -- is there another reason to lower our latency?
My main goal here is to lower the latency.  If header caching is
unattractive, then another method would be fine.


Oops: s/reason/way/.  That makes my sentence make much more  
sense.  :-)


Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Jeff Squyres

On Aug 13, 2007, at 10:49 AM, George Bosilca wrote:


You want a dirtier trick for benchmarks ... Here it is ...

Implement a compression like algorithm based on checksum. The data-
type engine can compute a checksum for each fragment and if the
checksum match one in the peer [limitted] history (so we can claim
our communication protocol is adaptive), then we replace the actual
message content with the matched id in the common history. Checksums
are fairly cheap, lookup in a balanced tree is cheap too, so we will
end up with a lot of improvement (as instead of sending a full
fragment we will end-up sending one int). Based on the way most of
the benchmarks initialize the user data  (when they don't everything
is mostly 0), this trick might work on all cases for the
benchmarks ...


Are you sure you didn't want to publish a paper about this before you  
sent it across a public list?  Now someone else is likely to "invent"  
this scheme and get credit for it.  ;-)


Such a scheme is certainly possible, but I see even less use for it  
than use cases for the existing microbenchmarks.  Specifically,  
header caching *can* happen in real applications (i.e., repeatedly  
send short messages with the same MPI signature), but repeatedly  
sending to the same peer with exactly the same signature *and*  
exactly the same "long-enough" data (i.e., more than a small number  
of ints that an app could use for its own message data caching) is  
indicative of a poorly-written MPI application IMHO.



But don't complain if your Linpack run fails.


I assume you're talking about bugs in the implementation; not a  
problem with the approach, right?


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Christian Bell
On Sun, 12 Aug 2007, Gleb Natapov wrote:

> > Any objections?  We can discuss what approaches we want to take  
> > (there's going to be some complications because of the PML driver,  
> > etc.); perhaps in the Tuesday Mellanox teleconf...?
> > 
> My main objection is that the only reason you propose to do this is some
> bogus benchmark? Is there any other reason to implement header caching?
> I also hope you don't propose to break layering and somehow cache PML headers
> in BTL.

Gleb is hitting the main points I wanted to bring up.  We had
examined this header caching in the context of PSM a little while
ago.  0.5us is much more than we had observed -- at 3GHz, 0.5us would
be about 1500 cycles of code that has little amounts of branches.
For us, with a much bigger header and more fields to fetch from
different structures, it was more like 350 cycles which is on the
order of 0.1us and not worth the effort (in code complexity,
readability and frankly motivation for performance).  Maybe there's
more to it than just "code caching" -- like sending from pre-pinned
headers or using the RDMA with immediate, etc.  But I'd be suprised
to find out that openib btl doesn't do the best thing here.

I have pretty good evidence that for CM, the latency difference comes
from the receive-side (in particular opal_progress).  Doesn't the
openib btl receive-side do something similiar with opal_progress,
i.e. register a callback function?  It probably does something
different like check a few RDMA mailboxes (or per-peer landing pads)
but anything that gets called before or after it as part of
opal_progress is cause for slowdown.

. . christian

-- 
christian.b...@qlogic.com
(QLogic Host Solutions Group, formerly Pathscale)


Re: [OMPI devel] openib btl header caching

2007-08-13 Thread George Bosilca

You want a dirtier trick for benchmarks ... Here it is ...

Implement a compression like algorithm based on checksum. The data- 
type engine can compute a checksum for each fragment and if the  
checksum match one in the peer [limitted] history (so we can claim  
our communication protocol is adaptive), then we replace the actual  
message content with the matched id in the common history. Checksums  
are fairly cheap, lookup in a balanced tree is cheap too, so we will  
end up with a lot of improvement (as instead of sending a full  
fragment we will end-up sending one int). Based on the way most of  
the benchmarks initialize the user data  (when they don't everything  
is mostly 0), this trick might work on all cases for the  
benchmarks ... But don't complain if your Linpack run fails.


  george.

On Aug 13, 2007, at 10:39 AM, Gleb Natapov wrote:


On Mon, Aug 13, 2007 at 10:36:19AM -0400, Jeff Squyres wrote:

In short: it's an even dirtier trick than header caching (for
example), and we'd get beat up about it.


That was joke :) (But 3D drivers really do such things :( )


Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Jeff Squyres

On Aug 13, 2007, at 10:34 AM, Jeff Squyres wrote:


All this being said -- is there another reason to lower our latency?
My main goal here is to lower the latency.  If header caching is
unattractive, then another method would be fine.


Oops: s/reason/way/.  That makes my sentence make much more sense.  :-)

--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Jeff Squyres

On Aug 13, 2007, at 6:36 AM, Gleb Natapov wrote:


Pallas, Presta (as i know) also use static rank. So lets start to fix
all "bogus" benchmarks :-) ?


All benchmarks are bogus. I have better optimization. Check a name of
executable and if this is some know benchmark send one byte instead of
real message. 3D driver do this why can't we.


Because we'd end up in an arms race of benchmark argv[0] name and  
what is hard-coded in Open MPI.  Users/customers/partners would soon  
enough figure out that this is what we're doing and either use "mv"  
or "ln -s" to get around our hack and see the real numbers anyway.


In short: it's an even dirtier trick than header caching (for  
example), and we'd get beat up about it.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Jeff Squyres

On Aug 12, 2007, at 3:49 PM, Gleb Natapov wrote:


- Mellanox tested MVAPICH with the header caching; latency was around
1.4us
- Mellanox tested MVAPICH without the header caching; latency was
around 1.9us


As far as I remember Mellanox results and according to our testing
difference between MVAPICH with header caching and OMPI is 0.2-0.3us.
Not 0.5us. And MVAPICH without header caching is actually worse then
OMPI for small messages.


I guess reading the graph that Pasha sent is difficult; Pasha -- can  
you send the actual numbers?



Given that OMPI is the lone outlier around 1.9us, I think we have no
choice except to implement the header caching and/or examine our
header to see if we can shrink it.  Mellanox has volunteered to
implement header caching in the openib btl.
I think we have a chose. Not implement header caching, but just  
change the

osu_latency benchmark to send each message with different tag :)


If only.  :-)

But that misses the point (and the fact that all the common ping-pong  
benchmarks use a single tag: NetPIPE, IMB, osu_latency, etc.).  *All  
other MPI's* give us latency around 1.4us, but Open MPI is around  
1.9us.  So we need to do something.


Are we optimizing for a benchmark?  Yes.  But we have to do it.  Many  
people know that these benchmarks are fairly useless, but not enough  
-- too many customers do not, and education is not enough.  "Sure  
this MPI looks slower but, really, it isn't.  Trust me; my name is  
Joe Isuzu."  That's a hard sell.



I am not against header caching per se, but if it will complicate code
even a little bit I don't think we should implemented it just to  
benefit one

fabricated benchmark (AFAIR before header caching was implemented in
MVAPICH mpi_latency actually sent messages with different tags).


That may be true and a reason for us to wail and gnash our teeth, but  
it doesn't change the current reality.


Also there is really nothing to cache in openib BTL. Openin BTL  
header is 4

bytes long. The caching will have to be done in OB1 and there it will
affect every other interconnect.


Surely there is *something* we can do -- what, exactly, is the  
objection to peeking inside the PML header down in the btl?  Is it  
really so horrible for a btl to look inside the upper layer's  
header?  I agree that the PML looking into a btl header would  
[obviously] be Bad.


All this being said -- is there another reason to lower our latency?   
My main goal here is to lower the latency.  If header caching is  
unattractive, then another method would be fine.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Terry D. Dontje

Jeff Squyres wrote:

With Mellanox's new HCA (ConnectX), extremely low latencies are  
possible for short messages between two MPI processes.  Currently,  
OMPI's latency is around 1.9us while all other MPI's (HP MPI, Intel  
MPI, MVAPICH[2], etc.) are around 1.4us.  A big reason for this  
difference is that, at least with MVAPICH[2], they are doing wire  
protocol header caching where the openib BTL does not.  Specifically:


- Mellanox tested MVAPICH with the header caching; latency was around  
1.4us
- Mellanox tested MVAPICH without the header caching; latency was  
around 1.9us


Given that OMPI is the lone outlier around 1.9us, I think we have no  
choice except to implement the header caching and/or examine our  
header to see if we can shrink it.  Mellanox has volunteered to  
implement header caching in the openib btl.


Any objections?  We can discuss what approaches we want to take  
(there's going to be some complications because of the PML driver,  
etc.); perhaps in the Tuesday Mellanox teleconf...?
 


This sounds great.  Sun, would like to hear how thing are being done
so we can possibly port the solution to the udapl btl.

--td