Re: [OMPI devel] openib btl header caching
On Mon, Aug 13, 2007 at 03:59:28PM -0400, Richard Graham wrote: > > > > On 8/13/07 3:52 PM, "Gleb Natapov"wrote: > > > On Mon, Aug 13, 2007 at 09:12:33AM -0600, Galen Shipman wrote: > > Here are the > > items we have identified: > > > All those things sounds very promising. Is there > > tmp branch where you > are going to work on this? > > > > > > > tmp/latency > > Some changes have already gone in - mainly trying to remove as much as > possible from the isend/send path, before moving on to the list bellow. Do > you have cycles to help with this ? I am very interested, not sure about cycles though. I'll get back from my vacation next week and look over this list one more time to see where I can help. > > Rich > > > > > > > > > > 1) remove 0 byte optimization of not initializing the convertor > > > > This costs us an ³if³ in MCA_PML_BASE_SEND_REQUEST_INIT and an > > ³if³ in > > mca_pml_ob1_send_request_start_copy > > +++ > > Measure the convertor > > initialization before taking any other action. > > > > > > > > > > > > > > > > > > > > 2) get rid of mca_pml_ob1_send_request_start_prepare and > > > > mca_pml_ob1_send_request_start_copy by removing the > > > > MCA_BTL_FLAGS_SEND_INPLACE flag. Instead we can simply have btl_send > > > > return OMPI_SUCCESS if the fragment can be marked as completed and > > > > OMPI_NOT_ON_WIRE if the fragment cannot be marked as complete. This > > solves > > another problem, with IB if there are a bunch of isends > > outstanding we end > > up buffering them all in the btl, marking > > completion and never get them on > > the wire because the BTL runs out of > > credits, we never get credits back > > until finalize because we never > > call progress cause the requests are > > complete. There is one issue > > here, start_prepare calls prepare_src and > > start_copy calls alloc, I > > think we can work around this by just always > > using prepare_src, > > OpenIB BTL will give a fragment off the free list > > anyway because the > > fragment is less than the eager limit. > > +++ > > Make the > > BTL return different return codes for the send. If the > > fragment is gone, > > then the PML is responsible of marking the MPI > > request as completed and so > > on. Only the updated BTLs will get any > > benefit from this feature. Add a > > flag into the descriptor to allow or > > not the BTL to free the fragment. > > > > > > Add a 3 level flag: > > - BTL_HAVE_OWNERSHIP : the fragment can be released > > by the BTL after > > the send, and then it report back a special return to the > > PML > > - BTL_HAVE_OWNERSHIP_AFTER_CALLBACK : the fragment will be released > > > > by the BTL once the completion callback was triggered. > > - PML_HAVE_OWNERSHIP > > : the BTL is not allowed to release the fragment > > at all (the PML is > > responsible for this). > > > > Return codes: > > - done and there will be no > > callbacks > > - not done, wait for a callback later > > - error state > > > > > > > > > > > > > > > > > > > > 3) Change the remote callback function (and tag value based on what > > > > data we are sending), don't use mca_pml_ob1_recv_frag_callback for > > > > everything! > > I think we need: > > > > mca_pml_ob1_recv_frag_match > > > > mca_pml_ob1_recv_frag_rndv > > mca_pml_ob1_recv_frag_rget > > > > > > mca_pml_ob1_recv_match_ack_copy > > mca_pml_ob1_recv_match_ack_pipeline > > > > > > mca_pml_ob1_recv_copy_frag > > mca_pml_ob1_recv_put_request > > > > mca_pml_ob1_recv_put_fin > > +++ > > Pass the callback as parameter to the match > > function will save us 2 > > switches. Add more registrations in the BTL in > > order to jump directly > > in the correct function (the first 3 require a > > match while the others > > don't). 4 & 4 bits on the tag so each layer will > > have 4 bits of tags > > [i.e. first 4 bits for the protocol tag and lower 4 > > bits they are up > > to the protocol] and the registration table will still be > > local to > > each component. > > > > > > > > > > > > > > > > > > > > 4) Get rid of mca_pml_ob1_recv_request_progress; this does the same > > > > switch on hdr->hdr_common.hdr_type that mca_pml_ob1_recv_frag_callback! > > > > I think what we can do here is modify mca_pml_ob1_recv_frag_match to > > take > > a function pointer for what it should call on a successful match. > > So based > > on the receive
Re: [OMPI devel] openib btl header caching
On Mon, Aug 13, 2007 at 09:12:33AM -0600, Galen Shipman wrote: > Here are the items we have identified: > All those things sounds very promising. Is there tmp branch where you are going to work on this? > > > > > 1) remove 0 byte optimization of not initializing the convertor > This costs us an “if“ in MCA_PML_BASE_SEND_REQUEST_INIT and an > “if“ in mca_pml_ob1_send_request_start_copy > +++ > Measure the convertor initialization before taking any other action. > > > > > > > 2) get rid of mca_pml_ob1_send_request_start_prepare and > mca_pml_ob1_send_request_start_copy by removing the > MCA_BTL_FLAGS_SEND_INPLACE flag. Instead we can simply have btl_send > return OMPI_SUCCESS if the fragment can be marked as completed and > OMPI_NOT_ON_WIRE if the fragment cannot be marked as complete. This > solves another problem, with IB if there are a bunch of isends > outstanding we end up buffering them all in the btl, marking > completion and never get them on the wire because the BTL runs out of > credits, we never get credits back until finalize because we never > call progress cause the requests are complete. There is one issue > here, start_prepare calls prepare_src and start_copy calls alloc, I > think we can work around this by just always using prepare_src, > OpenIB BTL will give a fragment off the free list anyway because the > fragment is less than the eager limit. > +++ > Make the BTL return different return codes for the send. If the > fragment is gone, then the PML is responsible of marking the MPI > request as completed and so on. Only the updated BTLs will get any > benefit from this feature. Add a flag into the descriptor to allow or > not the BTL to free the fragment. > > Add a 3 level flag: > - BTL_HAVE_OWNERSHIP : the fragment can be released by the BTL after > the send, and then it report back a special return to the PML > - BTL_HAVE_OWNERSHIP_AFTER_CALLBACK : the fragment will be released > by the BTL once the completion callback was triggered. > - PML_HAVE_OWNERSHIP : the BTL is not allowed to release the fragment > at all (the PML is responsible for this). > > Return codes: > - done and there will be no callbacks > - not done, wait for a callback later > - error state > > > > > > > 3) Change the remote callback function (and tag value based on what > data we are sending), don't use mca_pml_ob1_recv_frag_callback for > everything! > I think we need: > > mca_pml_ob1_recv_frag_match > mca_pml_ob1_recv_frag_rndv > mca_pml_ob1_recv_frag_rget > > mca_pml_ob1_recv_match_ack_copy > mca_pml_ob1_recv_match_ack_pipeline > > mca_pml_ob1_recv_copy_frag > mca_pml_ob1_recv_put_request > mca_pml_ob1_recv_put_fin > +++ > Pass the callback as parameter to the match function will save us 2 > switches. Add more registrations in the BTL in order to jump directly > in the correct function (the first 3 require a match while the others > don't). 4 & 4 bits on the tag so each layer will have 4 bits of tags > [i.e. first 4 bits for the protocol tag and lower 4 bits they are up > to the protocol] and the registration table will still be local to > each component. > > > > > > > 4) Get rid of mca_pml_ob1_recv_request_progress; this does the same > switch on hdr->hdr_common.hdr_type that mca_pml_ob1_recv_frag_callback! > I think what we can do here is modify mca_pml_ob1_recv_frag_match to > take a function pointer for what it should call on a successful match. > So based on the receive callback we can pass the correct scheduling > function to invoke into the generic mca_pml_ob1_recv_frag_match > > Recv_request progress is call in a generic way from multiple places, > and we do a big switch inside. In the match function we might want to > pass a function pointer to the successful match progress function. > This way we will be able to specialize what happens after the match, > in a more optimized way. Or the recv_request_match can return the > match and then the caller will have to specialize it's action. > > > > > > > 5) Don't initialize the entire request. We can use item 2 below (if > we get back OMPI_SUCCESS from btl_send) then we don't need to fully
Re: [OMPI devel] openib btl header caching
On 8/13/07 12:34 PM, "Galen Shipman"wrote: > Ok here is the numbers on my machines: 0 bytes mvapich with header caching: 1.56 mvapich without header caching: 1.79 ompi 1.2: 1.59 So on zero bytes ompi not so bad. Also we can see that header caching decrease the mvapich latency on 0.23 1 bytes mvapich with header caching: 1.58 mvapich without header caching: 1.83 ompi 1.2: 1.73 > > > Is this just convertor initialization cost? Last night I measured the cost of the convertor initialization in ob1 on my dual processor mac, using ompi-tests/simple/ping/mpi-ping, and it costs 0.02 to 0.03 microseconds. To be specific, I commented out the check for 0 byte message size, and the latency went up from about 0.59 usec (this is with modified code in tmp/latency) to about 0.62 usec. Rich > > - Galen > And here ompi make some latency jump. In mvapich the header caching decrease the header size from 56bytes to 12bytes. What is the header size (pml + btl) in ompi ? >>> >>> The match header size is 16 bytes, so it looks like ours is already >>> optimized ... >> So for 0 bytes message we are sending only 16bytes on the wire , is it >> correct ? >> >> >> Pasha. >>> >>> george. >>> Pasha ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] openib btl header caching
On Aug 13, 2007, at 11:12 AM, Galen Shipman wrote: 1) remove 0 byte optimization of not initializing the convertor This costs us an “if“ in MCA_PML_BASE_SEND_REQUEST_INIT and an “if“ in mca_pml_ob1_send_request_start_copy +++ Measure the convertor initialization before taking any other action. -- -- I talked with Galen and then with Pasha; Pasha will look into this. Specifically: - Investigate ob1 and find all the places we're doing 0-byte optimizations (I don't think that there are any in the openib btl...?). - Selectively remove each of the zero-byte optimizations and measure what the cost is, both in terms of time and cycles (using the RDTSC macro/inline function that's somewhere already in OMPI). If possible, it would be best to measure these individually rather than removing all of them and looking at the aggregate. - Do all of this with and without heterogeneous support enabled to measure what the cost of heterogeneity is. This will enable us to find out where the time is being spent. Clearly, there's some differences between zero and nonzero byte messages, so it would be a good first step to understand exactly what they are. -- -- 2) get rid of mca_pml_ob1_send_request_start_prepare and This is also all good stuff; let's look into the zero-byte optimizations first and then tackle the rest of these after that. Good? -- Jeff Squyres Cisco Systems
Re: [OMPI devel] openib btl header caching
Brian Barrett wrote: On Aug 13, 2007, at 9:33 AM, George Bosilca wrote: On Aug 13, 2007, at 11:28 AM, Pavel Shamis (Pasha) wrote: Jeff Squyres wrote: I guess reading the graph that Pasha sent is difficult; Pasha -- can you send the actual numbers? Ok here is the numbers on my machines: 0 bytes mvapich with header caching: 1.56 mvapich without header caching: 1.79 ompi 1.2: 1.59 So on zero bytes ompi not so bad. Also we can see that header caching decrease the mvapich latency on 0.23 1 bytes mvapich with header caching: 1.58 mvapich without header caching: 1.83 ompi 1.2: 1.73 And here ompi make some latency jump. In mvapich the header caching decrease the header size from 56bytes to 12bytes. What is the header size (pml + btl) in ompi ? The match header size is 16 bytes, so it looks like ours is already optimized ... Pasha -- Is your build of Open MPI built with --disable-heterogeneous? If not, our headers all grow slightly to support heterogeneous operations. For the heterogeneous case, a 1 byte message includes: I didn't build with "--disable-heterogeneous". So the heterogeneous support was enabled in the build 16 bytes for the match header 4 bytes for the Open IB header 1 byte for the payload 21 bytes total If you are using eager RDMA, there's an extra 4 bytes for the RDMA length in the footer. Without heterogeneous support, 2 bytes get knocked off the size of the match header, so the whole thing will be 19 bytes (+ 4 for the eager RDMA footer). I used eager rdma - it is faster than send. So the message size on the wire for 1 byte in my case was - 25bytes VS 13bytes in mvapich. And If i will --disable-heterogeneous it will decrease 2 bytes. So it sound like we are pretty optimized. There are also considerably more ifs in the code if heterogeneous is used, especially on x86 machines. Brian
Re: [OMPI devel] openib btl header caching
Ok here is the numbers on my machines: 0 bytes mvapich with header caching: 1.56 mvapich without header caching: 1.79 ompi 1.2: 1.59 So on zero bytes ompi not so bad. Also we can see that header caching decrease the mvapich latency on 0.23 1 bytes mvapich with header caching: 1.58 mvapich without header caching: 1.83 ompi 1.2: 1.73 Is this just convertor initialization cost? - Galen And here ompi make some latency jump. In mvapich the header caching decrease the header size from 56bytes to 12bytes. What is the header size (pml + btl) in ompi ? The match header size is 16 bytes, so it looks like ours is already optimized ... So for 0 bytes message we are sending only 16bytes on the wire , is it correct ? Pasha. george. Pasha ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] openib btl header caching
George Bosilca wrote: On Aug 13, 2007, at 11:28 AM, Pavel Shamis (Pasha) wrote: Jeff Squyres wrote: I guess reading the graph that Pasha sent is difficult; Pasha -- can you send the actual numbers? Ok here is the numbers on my machines: 0 bytes mvapich with header caching: 1.56 mvapich without header caching: 1.79 ompi 1.2: 1.59 So on zero bytes ompi not so bad. Also we can see that header caching decrease the mvapich latency on 0.23 1 bytes mvapich with header caching: 1.58 mvapich without header caching: 1.83 ompi 1.2: 1.73 And here ompi make some latency jump. In mvapich the header caching decrease the header size from 56bytes to 12bytes. What is the header size (pml + btl) in ompi ? The match header size is 16 bytes, so it looks like ours is already optimized ... So for 0 bytes message we are sending only 16bytes on the wire , is it correct ? Pasha. george. Pasha ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] openib btl header caching
On Aug 13, 2007, at 9:33 AM, George Bosilca wrote: On Aug 13, 2007, at 11:28 AM, Pavel Shamis (Pasha) wrote: Jeff Squyres wrote: I guess reading the graph that Pasha sent is difficult; Pasha -- can you send the actual numbers? Ok here is the numbers on my machines: 0 bytes mvapich with header caching: 1.56 mvapich without header caching: 1.79 ompi 1.2: 1.59 So on zero bytes ompi not so bad. Also we can see that header caching decrease the mvapich latency on 0.23 1 bytes mvapich with header caching: 1.58 mvapich without header caching: 1.83 ompi 1.2: 1.73 And here ompi make some latency jump. In mvapich the header caching decrease the header size from 56bytes to 12bytes. What is the header size (pml + btl) in ompi ? The match header size is 16 bytes, so it looks like ours is already optimized ... Pasha -- Is your build of Open MPI built with --disable- heterogeneous? If not, our headers all grow slightly to support heterogeneous operations. For the heterogeneous case, a 1 byte message includes: 16 bytes for the match header 4 bytes for the Open IB header 1 byte for the payload 21 bytes total If you are using eager RDMA, there's an extra 4 bytes for the RDMA length in the footer. Without heterogeneous support, 2 bytes get knocked off the size of the match header, so the whole thing will be 19 bytes (+ 4 for the eager RDMA footer). There are also considerably more ifs in the code if heterogeneous is used, especially on x86 machines. Brian
Re: [OMPI devel] openib btl header caching
On Aug 13, 2007, at 11:28 AM, Pavel Shamis (Pasha) wrote: Jeff Squyres wrote: I guess reading the graph that Pasha sent is difficult; Pasha -- can you send the actual numbers? Ok here is the numbers on my machines: 0 bytes mvapich with header caching: 1.56 mvapich without header caching: 1.79 ompi 1.2: 1.59 So on zero bytes ompi not so bad. Also we can see that header caching decrease the mvapich latency on 0.23 1 bytes mvapich with header caching: 1.58 mvapich without header caching: 1.83 ompi 1.2: 1.73 And here ompi make some latency jump. In mvapich the header caching decrease the header size from 56bytes to 12bytes. What is the header size (pml + btl) in ompi ? The match header size is 16 bytes, so it looks like ours is already optimized ... george. Pasha ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] openib btl header caching
On Aug 13, 2007, at 11:07 AM, Jeff Squyres wrote: Such a scheme is certainly possible, but I see even less use for it than use cases for the existing microbenchmarks. Specifically, header caching *can* happen in real applications (i.e., repeatedly send short messages with the same MPI signature), but repeatedly sending to the same peer with exactly the same signature *and* exactly the same "long-enough" data (i.e., more than a small number of ints that an app could use for its own message data caching) is indicative of a poorly-written MPI application IMHO. If you look at the message size distribution for most of the HPC applications (at least one that get investigated in the papers) you will see that very small messages are only an non-significant percentage of messages. As this "optimization" only address these kind of messages, I doubt there is any real benefit from applications point of view (obviously there will be few exceptions as usual). The header caching only make sense for very small messages (MVAPICH only implement header caching for messages up to 155 bytes [that's less than 20 doubles] if I remember well), which make it a real benchmark optimization. But don't complain if your Linpack run fails. I assume you're talking about bugs in the implementation; not a problem with the approach, right? Of course, there is no apparent problem with my approach :) It is called an educated guess based on repetitive human behaviors analysis. george.
Re: [OMPI devel] openib btl header caching
We're working on it. Give us few weeks to finish implementing all the planned optimizations/cleanups in th PML and then we can talk about tricks. We're expecting/hoping to slim down the PML layer by more than 0.5 so this header caching optimization might not make any sense at that point. Thanks, george. On Aug 13, 2007, at 10:38 AM, Jeff Squyres wrote: On Aug 13, 2007, at 10:34 AM, Jeff Squyres wrote: All this being said -- is there another reason to lower our latency? My main goal here is to lower the latency. If header caching is unattractive, then another method would be fine. Oops: s/reason/way/. That makes my sentence make much more sense. :-)
Re: [OMPI devel] openib btl header caching
On Aug 13, 2007, at 10:49 AM, George Bosilca wrote: You want a dirtier trick for benchmarks ... Here it is ... Implement a compression like algorithm based on checksum. The data- type engine can compute a checksum for each fragment and if the checksum match one in the peer [limitted] history (so we can claim our communication protocol is adaptive), then we replace the actual message content with the matched id in the common history. Checksums are fairly cheap, lookup in a balanced tree is cheap too, so we will end up with a lot of improvement (as instead of sending a full fragment we will end-up sending one int). Based on the way most of the benchmarks initialize the user data (when they don't everything is mostly 0), this trick might work on all cases for the benchmarks ... Are you sure you didn't want to publish a paper about this before you sent it across a public list? Now someone else is likely to "invent" this scheme and get credit for it. ;-) Such a scheme is certainly possible, but I see even less use for it than use cases for the existing microbenchmarks. Specifically, header caching *can* happen in real applications (i.e., repeatedly send short messages with the same MPI signature), but repeatedly sending to the same peer with exactly the same signature *and* exactly the same "long-enough" data (i.e., more than a small number of ints that an app could use for its own message data caching) is indicative of a poorly-written MPI application IMHO. But don't complain if your Linpack run fails. I assume you're talking about bugs in the implementation; not a problem with the approach, right? -- Jeff Squyres Cisco Systems
Re: [OMPI devel] openib btl header caching
On Sun, 12 Aug 2007, Gleb Natapov wrote: > > Any objections? We can discuss what approaches we want to take > > (there's going to be some complications because of the PML driver, > > etc.); perhaps in the Tuesday Mellanox teleconf...? > > > My main objection is that the only reason you propose to do this is some > bogus benchmark? Is there any other reason to implement header caching? > I also hope you don't propose to break layering and somehow cache PML headers > in BTL. Gleb is hitting the main points I wanted to bring up. We had examined this header caching in the context of PSM a little while ago. 0.5us is much more than we had observed -- at 3GHz, 0.5us would be about 1500 cycles of code that has little amounts of branches. For us, with a much bigger header and more fields to fetch from different structures, it was more like 350 cycles which is on the order of 0.1us and not worth the effort (in code complexity, readability and frankly motivation for performance). Maybe there's more to it than just "code caching" -- like sending from pre-pinned headers or using the RDMA with immediate, etc. But I'd be suprised to find out that openib btl doesn't do the best thing here. I have pretty good evidence that for CM, the latency difference comes from the receive-side (in particular opal_progress). Doesn't the openib btl receive-side do something similiar with opal_progress, i.e. register a callback function? It probably does something different like check a few RDMA mailboxes (or per-peer landing pads) but anything that gets called before or after it as part of opal_progress is cause for slowdown. . . christian -- christian.b...@qlogic.com (QLogic Host Solutions Group, formerly Pathscale)
Re: [OMPI devel] openib btl header caching
You want a dirtier trick for benchmarks ... Here it is ... Implement a compression like algorithm based on checksum. The data- type engine can compute a checksum for each fragment and if the checksum match one in the peer [limitted] history (so we can claim our communication protocol is adaptive), then we replace the actual message content with the matched id in the common history. Checksums are fairly cheap, lookup in a balanced tree is cheap too, so we will end up with a lot of improvement (as instead of sending a full fragment we will end-up sending one int). Based on the way most of the benchmarks initialize the user data (when they don't everything is mostly 0), this trick might work on all cases for the benchmarks ... But don't complain if your Linpack run fails. george. On Aug 13, 2007, at 10:39 AM, Gleb Natapov wrote: On Mon, Aug 13, 2007 at 10:36:19AM -0400, Jeff Squyres wrote: In short: it's an even dirtier trick than header caching (for example), and we'd get beat up about it. That was joke :) (But 3D drivers really do such things :( )
Re: [OMPI devel] openib btl header caching
On Aug 13, 2007, at 10:34 AM, Jeff Squyres wrote: All this being said -- is there another reason to lower our latency? My main goal here is to lower the latency. If header caching is unattractive, then another method would be fine. Oops: s/reason/way/. That makes my sentence make much more sense. :-) -- Jeff Squyres Cisco Systems
Re: [OMPI devel] openib btl header caching
On Aug 13, 2007, at 6:36 AM, Gleb Natapov wrote: Pallas, Presta (as i know) also use static rank. So lets start to fix all "bogus" benchmarks :-) ? All benchmarks are bogus. I have better optimization. Check a name of executable and if this is some know benchmark send one byte instead of real message. 3D driver do this why can't we. Because we'd end up in an arms race of benchmark argv[0] name and what is hard-coded in Open MPI. Users/customers/partners would soon enough figure out that this is what we're doing and either use "mv" or "ln -s" to get around our hack and see the real numbers anyway. In short: it's an even dirtier trick than header caching (for example), and we'd get beat up about it. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] openib btl header caching
On Aug 12, 2007, at 3:49 PM, Gleb Natapov wrote: - Mellanox tested MVAPICH with the header caching; latency was around 1.4us - Mellanox tested MVAPICH without the header caching; latency was around 1.9us As far as I remember Mellanox results and according to our testing difference between MVAPICH with header caching and OMPI is 0.2-0.3us. Not 0.5us. And MVAPICH without header caching is actually worse then OMPI for small messages. I guess reading the graph that Pasha sent is difficult; Pasha -- can you send the actual numbers? Given that OMPI is the lone outlier around 1.9us, I think we have no choice except to implement the header caching and/or examine our header to see if we can shrink it. Mellanox has volunteered to implement header caching in the openib btl. I think we have a chose. Not implement header caching, but just change the osu_latency benchmark to send each message with different tag :) If only. :-) But that misses the point (and the fact that all the common ping-pong benchmarks use a single tag: NetPIPE, IMB, osu_latency, etc.). *All other MPI's* give us latency around 1.4us, but Open MPI is around 1.9us. So we need to do something. Are we optimizing for a benchmark? Yes. But we have to do it. Many people know that these benchmarks are fairly useless, but not enough -- too many customers do not, and education is not enough. "Sure this MPI looks slower but, really, it isn't. Trust me; my name is Joe Isuzu." That's a hard sell. I am not against header caching per se, but if it will complicate code even a little bit I don't think we should implemented it just to benefit one fabricated benchmark (AFAIR before header caching was implemented in MVAPICH mpi_latency actually sent messages with different tags). That may be true and a reason for us to wail and gnash our teeth, but it doesn't change the current reality. Also there is really nothing to cache in openib BTL. Openin BTL header is 4 bytes long. The caching will have to be done in OB1 and there it will affect every other interconnect. Surely there is *something* we can do -- what, exactly, is the objection to peeking inside the PML header down in the btl? Is it really so horrible for a btl to look inside the upper layer's header? I agree that the PML looking into a btl header would [obviously] be Bad. All this being said -- is there another reason to lower our latency? My main goal here is to lower the latency. If header caching is unattractive, then another method would be fine. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] openib btl header caching
Jeff Squyres wrote: With Mellanox's new HCA (ConnectX), extremely low latencies are possible for short messages between two MPI processes. Currently, OMPI's latency is around 1.9us while all other MPI's (HP MPI, Intel MPI, MVAPICH[2], etc.) are around 1.4us. A big reason for this difference is that, at least with MVAPICH[2], they are doing wire protocol header caching where the openib BTL does not. Specifically: - Mellanox tested MVAPICH with the header caching; latency was around 1.4us - Mellanox tested MVAPICH without the header caching; latency was around 1.9us Given that OMPI is the lone outlier around 1.9us, I think we have no choice except to implement the header caching and/or examine our header to see if we can shrink it. Mellanox has volunteered to implement header caching in the openib btl. Any objections? We can discuss what approaches we want to take (there's going to be some complications because of the PML driver, etc.); perhaps in the Tuesday Mellanox teleconf...? This sounds great. Sun, would like to hear how thing are being done so we can possibly port the solution to the udapl btl. --td