Per my other email, I would suggest #ifdef comments instead of nonblocking
replacements for the blocking calls. After that modification, I think this
patch is fine. As was mentioned previously, we will need to go back (after
things compile) and figure out a new model for this behavior.

For the existing calls to recv_buffer_nb, converting those to the versions
without return codes is completely fine and correct. So leave those changes
in there. It is just when you move from recv_buffer to recv_buffer_nb that
I would leave some kind of marker for yourself and preserve the prior code.

After that modification, I think it is ready to be committed.

Thanks!
Josh



On Wed, Dec 4, 2013 at 9:46 AM, Jeff Squyres (jsquyres)
<jsquy...@cisco.com>wrote:

> On Nov 25, 2013, at 9:59 AM, Adrian Reber <adr...@lisas.de> wrote:
>
> > @@ -5616,16 +5597,8 @@ static int
> do_send_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref,
> >     /*
> >      * Recv the ACK msg
> >      */
> > -    if ( 0 > (ret = ompi_rte_recv_buffer(&peer_ref->proc_name, buffer,
> > -                                         OMPI_CRCP_COORD_BOOKMARK_TAG,
> 0) ) ) {
> > -        opal_output(mca_crcp_bkmrk_component.super.output_handle,
> > -                    "crcp:bkmrk: do_send_msg_detail: %s --> %s Failed
> to receive ACK buffer from peer. Return %d\n",
> > -                    OMPI_NAME_PRINT(OMPI_PROC_MY_NAME),
> > -                    OMPI_NAME_PRINT(&(peer_ref->proc_name)),
> > -                    ret);
> > -        exit_status = ret;
> > -        goto cleanup;
> > -    }
> > +    ompi_rte_recv_buffer_nb(&peer_ref->proc_name,
> OMPI_CRCP_COORD_BOOKMARK_TAG, 0,
> > +                            orte_rml_recv_callback, NULL);
> >
> >     UNPACK_BUFFER(buffer, recv_response, 1, OPAL_UINT32,
> >                   "crcp:bkmrk: send_msg_details: Failed to unpack the
> ACK from peer buffer.");
>
> I see a bunch of hunks like this that I do not think are correct.
>
> They're replacing orte_rte_recv_buffer() with orte_rte_recv_buffer_nb(),
> which, by definition, may not actually complete the receive.  Hence, the
> receive buffer may not be filled by the time orte_rte_recv_buffer_nb()
> returns, and therefore you can't proceed with the FT processing.  E.g., the
> UNPACK_BUFFER() statement here may well be undefined because the buffer
> isn't full yet.
>
> I'm not sure what this means in terms of overall FT processing -- the fact
> that UNPACK_BUFFER is erroneous may be fairly obvious, but the other FT
> processing that occurs below UNPACK_BUFFER, and, indeed, by the caller of
> this function, may not be able to proceed until this receive completes,
> either.
>
> Ditto for all the other hunks like this.
>
> If I'm not mistaken, the actual receive will occur in a progress thread,
> so the cbfunc supplied to orte_rte_recv_buffer_nb() will need to do
> something in a thread safe manner -- I'm not sure if it will need to
> transfer control back up to the main thread, or if the FT processing in
> question is safe to occur in the ORTE progress thread (it would be *much*
> better if it could occur on the ORTE progress thread so that we can get
> asynchronous progress of this stuff).
>
> If I'm correct, then all of those hunks will need to be adapted.  I.e.,
> these aren't just compile errors to be fixed -- they'll also require
> architectural changes, too.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Joshua Hursey
Assistant Professor of Computer Science
University of Wisconsin-La Crosse
http://cs.uwlax.edu/~jjhursey

Reply via email to