Re: [OMPI devel] openib btl header caching

Galen Shipman Mon, 13 Aug 2007 11:13:19 -0400

I think we need to take a step back from micro-optimizations such asheader caching.

Rich, George, Brian and I are currently looking into latencyimprovements. We came up with several areas of performanceenhancements that can be done with minimal disruption. The progressissue that Christian and others have pointed does appear to be aproblem, but will take a bit more work. I would like to see progressin these areas first as I really don't like the idea of caching moreendpoint state in OMPI for micro-benchmark latency improvements untilwe are certain we have done the ground work for improving latency inthe general case.





Here are the items we have identified:

----------------------------------------------------------------------------


1) remove 0 byte optimization of not initializing the convertor

This costs us an “if“ in MCA_PML_BASE_SEND_REQUEST_INIT and an“if“ in mca_pml_ob1_send_request_start_copy

+++
Measure the convertor initialization before taking any other action.

----------------------------------------------------------------------------

2) get rid of mca_pml_ob1_send_request_start_prepare andmca_pml_ob1_send_request_start_copy by removing theMCA_BTL_FLAGS_SEND_INPLACE flag. Instead we can simply have btl_sendreturn OMPI_SUCCESS if the fragment can be marked as completed andOMPI_NOT_ON_WIRE if the fragment cannot be marked as complete. Thissolves another problem, with IB if there are a bunch of isendsoutstanding we end up buffering them all in the btl, markingcompletion and never get them on the wire because the BTL runs out ofcredits, we never get credits back until finalize because we nevercall progress cause the requests are complete. There is one issuehere, start_prepare calls prepare_src and start_copy calls alloc, Ithink we can work around this by just always using prepare_src,OpenIB BTL will give a fragment off the free list anyway because thefragment is less than the eager limit.

+++

Make the BTL return different return codes for the send. If thefragment is gone, then the PML is responsible of marking the MPIrequest as completed and so on. Only the updated BTLs will get anybenefit from this feature. Add a flag into the descriptor to allow ornot the BTL to free the fragment.


Add a 3 level flag:

- BTL_HAVE_OWNERSHIP : the fragment can be released by the BTL afterthe send, and then it report back a special return to the PML- BTL_HAVE_OWNERSHIP_AFTER_CALLBACK : the fragment will be releasedby the BTL once the completion callback was triggered.- PML_HAVE_OWNERSHIP : the BTL is not allowed to release the fragmentat all (the PML is responsible for this).


Return codes:
- done and there will be no callbacks
- not done, wait for a callback later
- error state

----------------------------------------------------------------------------

3) Change the remote callback function (and tag value based on whatdata we are sending), don't use mca_pml_ob1_recv_frag_callback foreverything!

        I think we need:

        mca_pml_ob1_recv_frag_match
        mca_pml_ob1_recv_frag_rndv
        mca_pml_ob1_recv_frag_rget

        mca_pml_ob1_recv_match_ack_copy
        mca_pml_ob1_recv_match_ack_pipeline

        mca_pml_ob1_recv_copy_frag
        mca_pml_ob1_recv_put_request
        mca_pml_ob1_recv_put_fin
+++

Pass the callback as parameter to the match function will save us 2switches. Add more registrations in the BTL in order to jump directlyin the correct function (the first 3 require a match while the othersdon't). 4 & 4 bits on the tag so each layer will have 4 bits of tags[i.e. first 4 bits for the protocol tag and lower 4 bits they are upto the protocol] and the registration table will still be local toeach component.----------------------------------------------------------------------------

----------------------------------------------------------------------------

4) Get rid of mca_pml_ob1_recv_request_progress; this does the sameswitch on hdr->hdr_common.hdr_type that mca_pml_ob1_recv_frag_callback!I think what we can do here is modify mca_pml_ob1_recv_frag_match totake a function pointer for what it should call on a successful match.So based on the receive callback we can pass the correct schedulingfunction to invoke into the generic mca_pml_ob1_recv_frag_match

Recv_request progress is call in a generic way from multiple places,and we do a big switch inside. In the match function we might want topass a function pointer to the successful match progress function.This way we will be able to specialize what happens after the match,in a more optimized way. Or the recv_request_match can return thematch and then the caller will have to specialize it's action.----------------------------------------------------------------------------

----------------------------------------------------------------------------

5) Don't initialize the entire request. We can use item 2 below (ifwe get back OMPI_SUCCESS from btl_send) then we don't need to fullyinitialize the request, we need the convertor setup but the rest wecan pass down the stack in order to setup the match header and setupthe request if we get OMPI_NOT_ON_WIRE back from btl_send.


I think we need something like:
MCA_PML_BASE_SEND_REQUEST_INIT_CONV

and
MCA_PML_BASE_SEND_REQUEST_INIT_FULL

so the first macro just sets up the convertor, the second populatesall the rest of the request state in the case that we will need itlater because the fragment doesn't hit the wire.

+++
We all agreed.

----------------------------------------------------------------------------




On Aug 13, 2007, at 9:00 AM, Christian Bell wrote:

On Sun, 12 Aug 2007, Gleb Natapov wrote:

Any objections?  We can discuss what approaches we want to take
(there's going to be some complications because of the PML driver,
etc.); perhaps in the Tuesday Mellanox teleconf...?
My main objection is that the only reason you propose to do thisis somebogus benchmark? Is there any other reason to implement headercaching?I also hope you don't propose to break layering and somehow cachePML headers
in BTL.


Gleb is hitting the main points I wanted to bring up.  We had
examined this header caching in the context of PSM a little while
ago.  0.5us is much more than we had observed -- at 3GHz, 0.5us would
be about 1500 cycles of code that has little amounts of branches.
For us, with a much bigger header and more fields to fetch from
different structures, it was more like 350 cycles which is on the
order of 0.1us and not worth the effort (in code complexity,
readability and frankly motivation for performance).  Maybe there's
more to it than just "code caching" -- like sending from pre-pinned
headers or using the RDMA with immediate, etc.  But I'd be suprised
to find out that openib btl doesn't do the best thing here.

I have pretty good evidence that for CM, the latency difference comes
from the receive-side (in particular opal_progress).  Doesn't the
openib btl receive-side do something similiar with opal_progress,
i.e. register a callback function?  It probably does something
different like check a few RDMA mailboxes (or per-peer landing pads)
but anything that gets called before or after it as part of
opal_progress is cause for slowdown.

    . . christian

--
[email protected]
(QLogic Host Solutions Group, formerly Pathscale)
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] openib btl header caching

Reply via email to