I think we need to take a step back from micro-optimizations such as header caching.

Rich, George, Brian and I are currently looking into latency improvements. We came up with several areas of performance enhancements that can be done with minimal disruption. The progress issue that Christian and others have pointed does appear to be a problem, but will take a bit more work. I would like to see progress in these areas first as I really don't like the idea of caching more endpoint state in OMPI for micro-benchmark latency improvements until we are certain we have done the ground work for improving latency in the general case.




Here are the items we have identified:


------------------------------------------------------------------------ ----

1) remove 0 byte optimization of not initializing the convertor
This costs us an “if“ in MCA_PML_BASE_SEND_REQUEST_INIT and an “if“ in mca_pml_ob1_send_request_start_copy
+++
Measure the convertor initialization before taking any other action.
------------------------------------------------------------------------ ----

------------------------------------------------------------------------ ----

2) get rid of mca_pml_ob1_send_request_start_prepare and mca_pml_ob1_send_request_start_copy by removing the MCA_BTL_FLAGS_SEND_INPLACE flag. Instead we can simply have btl_send return OMPI_SUCCESS if the fragment can be marked as completed and OMPI_NOT_ON_WIRE if the fragment cannot be marked as complete. This solves another problem, with IB if there are a bunch of isends outstanding we end up buffering them all in the btl, marking completion and never get them on the wire because the BTL runs out of credits, we never get credits back until finalize because we never call progress cause the requests are complete. There is one issue here, start_prepare calls prepare_src and start_copy calls alloc, I think we can work around this by just always using prepare_src, OpenIB BTL will give a fragment off the free list anyway because the fragment is less than the eager limit.
+++
Make the BTL return different return codes for the send. If the fragment is gone, then the PML is responsible of marking the MPI request as completed and so on. Only the updated BTLs will get any benefit from this feature. Add a flag into the descriptor to allow or not the BTL to free the fragment.

Add a 3 level flag:
- BTL_HAVE_OWNERSHIP : the fragment can be released by the BTL after the send, and then it report back a special return to the PML - BTL_HAVE_OWNERSHIP_AFTER_CALLBACK : the fragment will be released by the BTL once the completion callback was triggered. - PML_HAVE_OWNERSHIP : the BTL is not allowed to release the fragment at all (the PML is responsible for this).

Return codes:
- done and there will be no callbacks
- not done, wait for a callback later
- error state
------------------------------------------------------------------------ ----

------------------------------------------------------------------------ ----

3) Change the remote callback function (and tag value based on what data we are sending), don't use mca_pml_ob1_recv_frag_callback for everything!
        I think we need:

        mca_pml_ob1_recv_frag_match
        mca_pml_ob1_recv_frag_rndv
        mca_pml_ob1_recv_frag_rget

        mca_pml_ob1_recv_match_ack_copy
        mca_pml_ob1_recv_match_ack_pipeline

        mca_pml_ob1_recv_copy_frag
        mca_pml_ob1_recv_put_request
        mca_pml_ob1_recv_put_fin
+++
Pass the callback as parameter to the match function will save us 2 switches. Add more registrations in the BTL in order to jump directly in the correct function (the first 3 require a match while the others don't). 4 & 4 bits on the tag so each layer will have 4 bits of tags [i.e. first 4 bits for the protocol tag and lower 4 bits they are up to the protocol] and the registration table will still be local to each component. ------------------------------------------------------------------------ ----

------------------------------------------------------------------------ ----

4) Get rid of mca_pml_ob1_recv_request_progress; this does the same switch on hdr->hdr_common.hdr_type that mca_pml_ob1_recv_frag_callback! I think what we can do here is modify mca_pml_ob1_recv_frag_match to take a function pointer for what it should call on a successful match. So based on the receive callback we can pass the correct scheduling function to invoke into the generic mca_pml_ob1_recv_frag_match

Recv_request progress is call in a generic way from multiple places, and we do a big switch inside. In the match function we might want to pass a function pointer to the successful match progress function. This way we will be able to specialize what happens after the match, in a more optimized way. Or the recv_request_match can return the match and then the caller will have to specialize it's action. ------------------------------------------------------------------------ ----

------------------------------------------------------------------------ ----

5) Don't initialize the entire request. We can use item 2 below (if we get back OMPI_SUCCESS from btl_send) then we don't need to fully initialize the request, we need the convertor setup but the rest we can pass down the stack in order to setup the match header and setup the request if we get OMPI_NOT_ON_WIRE back from btl_send.

I think we need something like:
MCA_PML_BASE_SEND_REQUEST_INIT_CONV

and
MCA_PML_BASE_SEND_REQUEST_INIT_FULL

so the first macro just sets up the convertor, the second populates all the rest of the request state in the case that we will need it later because the fragment doesn't hit the wire.
+++
We all agreed.
------------------------------------------------------------------------ ----



On Aug 13, 2007, at 9:00 AM, Christian Bell wrote:

On Sun, 12 Aug 2007, Gleb Natapov wrote:

Any objections?  We can discuss what approaches we want to take
(there's going to be some complications because of the PML driver,
etc.); perhaps in the Tuesday Mellanox teleconf...?

My main objection is that the only reason you propose to do this is some bogus benchmark? Is there any other reason to implement header caching? I also hope you don't propose to break layering and somehow cache PML headers
in BTL.

Gleb is hitting the main points I wanted to bring up.  We had
examined this header caching in the context of PSM a little while
ago.  0.5us is much more than we had observed -- at 3GHz, 0.5us would
be about 1500 cycles of code that has little amounts of branches.
For us, with a much bigger header and more fields to fetch from
different structures, it was more like 350 cycles which is on the
order of 0.1us and not worth the effort (in code complexity,
readability and frankly motivation for performance).  Maybe there's
more to it than just "code caching" -- like sending from pre-pinned
headers or using the RDMA with immediate, etc.  But I'd be suprised
to find out that openib btl doesn't do the best thing here.

I have pretty good evidence that for CM, the latency difference comes
from the receive-side (in particular opal_progress).  Doesn't the
openib btl receive-side do something similiar with opal_progress,
i.e. register a callback function?  It probably does something
different like check a few RDMA mailboxes (or per-peer landing pads)
but anything that gets called before or after it as part of
opal_progress is cause for slowdown.

    . . christian

--
christian.b...@qlogic.com
(QLogic Host Solutions Group, formerly Pathscale)
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to