Re: [OMPI devel] openib btl header caching

Richard Graham Mon, 13 Aug 2007 15:59:42 -0400

On 8/13/07 3:52 PM, "Gleb Natapov" <[email protected]> wrote:

> On Mon, Aug 13, 2007 at 09:12:33AM -0600, Galen Shipman wrote:
> Here are the
> items we have identified:
> 
All those things sounds very promising. Is there
> tmp branch where you
are going to work on this?

> 
>

 tmp/latency

Some changes have already gone in - mainly trying to remove as much as
possible from the isend/send path, before moving on to the list bellow.  Do
you have cycles to help with this ?

Rich

> ------------------------------------------------------------------------ >
> ----
> 
> 1) remove 0 byte optimization of not initializing the convertor
>
> This costs us an ³if³ in MCA_PML_BASE_SEND_REQUEST_INIT and an  
> ³if³ in
> mca_pml_ob1_send_request_start_copy
> +++
> Measure the convertor
> initialization before taking any other action.
>
> ------------------------------------------------------------------------ >
> ----
> 
> 
> ------------------------------------------------------------------------ >
> ----
> 
> 2) get rid of mca_pml_ob1_send_request_start_prepare and  
>
> mca_pml_ob1_send_request_start_copy by removing the  
>
> MCA_BTL_FLAGS_SEND_INPLACE flag. Instead we can simply have btl_send  
>
> return OMPI_SUCCESS if the fragment can be marked as completed and  
>
> OMPI_NOT_ON_WIRE if the fragment cannot be marked as complete. This  
> solves
> another problem, with IB if there are a bunch of isends  
> outstanding we end
> up buffering them all in the btl, marking  
> completion and never get them on
> the wire because the BTL runs out of  
> credits, we never get credits back
> until finalize because we never  
> call progress cause the requests are
> complete.  There is one issue  
> here, start_prepare calls prepare_src and
> start_copy calls alloc, I  
> think we can work around this by just always
> using prepare_src,  
> OpenIB BTL will give a fragment off the free list
> anyway because the  
> fragment is less than the eager limit.
> +++
> Make the
> BTL return different return codes for the send. If the  
> fragment is gone,
> then the PML is responsible of marking the MPI  
> request as completed and so
> on. Only the updated BTLs will get any  
> benefit from this feature. Add a
> flag into the descriptor to allow or  
> not the BTL to free the fragment.
>
> 
> Add a 3 level flag:
> - BTL_HAVE_OWNERSHIP : the fragment can be released
> by the BTL after  
> the send, and then it report back a special return to the
> PML
> - BTL_HAVE_OWNERSHIP_AFTER_CALLBACK : the fragment will be released  
>
> by the BTL once the completion callback was triggered.
> - PML_HAVE_OWNERSHIP
> : the BTL is not allowed to release the fragment  
> at all (the PML is
> responsible for this).
> 
> Return codes:
> - done and there will be no
> callbacks
> - not done, wait for a callback later
> - error state
>
> ------------------------------------------------------------------------ >
> ----
> 
> 
> ------------------------------------------------------------------------ >
> ----
> 
> 3) Change the remote callback function (and tag value based on what
> 
> data we are sending), don't use mca_pml_ob1_recv_frag_callback for  
>
> everything!
>  I think we need:
> 
>  mca_pml_ob1_recv_frag_match
>
>  mca_pml_ob1_recv_frag_rndv
>  mca_pml_ob1_recv_frag_rget
>  
>
>  mca_pml_ob1_recv_match_ack_copy
>  mca_pml_ob1_recv_match_ack_pipeline
>  
>
>  mca_pml_ob1_recv_copy_frag
>  mca_pml_ob1_recv_put_request
>
>  mca_pml_ob1_recv_put_fin
> +++
> Pass the callback as parameter to the match
> function will save us 2  
> switches. Add more registrations in the BTL in
> order to jump directly  
> in the correct function (the first 3 require a
> match while the others  
> don't). 4 & 4 bits on the tag so each layer will
> have 4 bits of tags  
> [i.e. first 4 bits for the protocol tag and lower 4
> bits they are up  
> to the protocol] and the registration table will still be
> local to  
> each component.
>
> ------------------------------------------------------------------------ >
> ----
> 
> 
> ------------------------------------------------------------------------ >
> ----
> 
> 4) Get rid of mca_pml_ob1_recv_request_progress; this does the same
> 
> switch on hdr->hdr_common.hdr_type that mca_pml_ob1_recv_frag_callback!
>
>  I think what we can do here is modify mca_pml_ob1_recv_frag_match to  
> take
> a function pointer for what it should call on a successful match.
>  So based
> on the receive callback we can pass the correct scheduling  
> function to
> invoke into the generic mca_pml_ob1_recv_frag_match
> 
> Recv_request progress
> is call in a generic way from multiple places,  
> and we do a big switch
> inside. In the match function we might want to  
> pass a function pointer to
> the successful match progress function.  
> This way we will be able to
> specialize what happens after the match,  
> in a more optimized way. Or the
> recv_request_match can return the  
> match and then the caller will have to
> specialize it's action.
>
> ------------------------------------------------------------------------ >
> ----
> 
> 
> ------------------------------------------------------------------------ >
> ----
> 
> 5) Don't initialize the entire request. We can use item 2 below (if
> 
> we get back OMPI_SUCCESS from btl_send) then we don't need to fully  
>
> initialize the request, we need the convertor setup but the rest we  
> can
> pass down the stack in order to setup the match header and setup  
> the
> request if we get OMPI_NOT_ON_WIRE back from btl_send.
> 
> I think we need
> something like:
> MCA_PML_BASE_SEND_REQUEST_INIT_CONV
> 
> and
>
> MCA_PML_BASE_SEND_REQUEST_INIT_FULL
> 
> so the first macro just sets up the
> convertor, the second populates  
> all the rest of the request state in the
> case that we will need it  
> later because the fragment doesn't hit the
> wire.
> +++
> We all agreed.
>
> ------------------------------------------------------------------------ >
> ----
> 
> 
> 
> On Aug 13, 2007, at 9:00 AM, Christian Bell wrote:
> 
> > On
> Sun, 12 Aug 2007, Gleb Natapov wrote:
> >
> >>> Any objections?  We can
> discuss what approaches we want to take
> >>> (there's going to be some
> complications because of the PML driver,
> >>> etc.); perhaps in the Tuesday
> Mellanox teleconf...?
> >>>
> >> My main objection is that the only reason you
> propose to do this  
> >> is some
> >> bogus benchmark? Is there any other
> reason to implement header  
> >> caching?
> >> I also hope you don't propose
> to break layering and somehow cache  
> >> PML headers
> >> in BTL.
> >
> >
> Gleb is hitting the main points I wanted to bring up.  We had
> > examined
> this header caching in the context of PSM a little while
> > ago.  0.5us is
> much more than we had observed -- at 3GHz, 0.5us would
> > be about 1500
> cycles of code that has little amounts of branches.
> > For us, with a much
> bigger header and more fields to fetch from
> > different structures, it was
> more like 350 cycles which is on the
> > order of 0.1us and not worth the
> effort (in code complexity,
> > readability and frankly motivation for
> performance).  Maybe there's
> > more to it than just "code caching" -- like
> sending from pre-pinned
> > headers or using the RDMA with immediate, etc.
> But I'd be suprised
> > to find out that openib btl doesn't do the best thing
> here.
> >
> > I have pretty good evidence that for CM, the latency difference
> comes
> > from the receive-side (in particular opal_progress).  Doesn't the
>
> > openib btl receive-side do something similiar with opal_progress,
> > i.e.
> register a callback function?  It probably does something
> > different like
> check a few RDMA mailboxes (or per-peer landing pads)
> > but anything that
> gets called before or after it as part of
> > opal_progress is cause for
> slowdown.
> >
> >     . . christian
> >
> > -- 
> >
> [email protected]
> > (QLogic Host Solutions Group, formerly
> Pathscale)
> > _______________________________________________
> > devel
> mailing list
> > [email protected]
> >
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
>
> _______________________________________________
> devel mailing list
>
> [email protected]
>
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
   Gleb.

_____________
> __________________________________
devel mailing
> list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] openib btl header caching

Reply via email to