Per my other email, I would suggest #ifdef comments instead of nonblocking replacements for the blocking calls. After that modification, I think this patch is fine. As was mentioned previously, we will need to go back (after things compile) and figure out a new model for this behavior.
For the existing calls to recv_buffer_nb, converting those to the versions without return codes is completely fine and correct. So leave those changes in there. It is just when you move from recv_buffer to recv_buffer_nb that I would leave some kind of marker for yourself and preserve the prior code. After that modification, I think it is ready to be committed. Thanks! Josh On Wed, Dec 4, 2013 at 9:46 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com>wrote: > On Nov 25, 2013, at 9:59 AM, Adrian Reber <adr...@lisas.de> wrote: > > > @@ -5616,16 +5597,8 @@ static int > do_send_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref, > > /* > > * Recv the ACK msg > > */ > > - if ( 0 > (ret = ompi_rte_recv_buffer(&peer_ref->proc_name, buffer, > > - OMPI_CRCP_COORD_BOOKMARK_TAG, > 0) ) ) { > > - opal_output(mca_crcp_bkmrk_component.super.output_handle, > > - "crcp:bkmrk: do_send_msg_detail: %s --> %s Failed > to receive ACK buffer from peer. Return %d\n", > > - OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), > > - OMPI_NAME_PRINT(&(peer_ref->proc_name)), > > - ret); > > - exit_status = ret; > > - goto cleanup; > > - } > > + ompi_rte_recv_buffer_nb(&peer_ref->proc_name, > OMPI_CRCP_COORD_BOOKMARK_TAG, 0, > > + orte_rml_recv_callback, NULL); > > > > UNPACK_BUFFER(buffer, recv_response, 1, OPAL_UINT32, > > "crcp:bkmrk: send_msg_details: Failed to unpack the > ACK from peer buffer."); > > I see a bunch of hunks like this that I do not think are correct. > > They're replacing orte_rte_recv_buffer() with orte_rte_recv_buffer_nb(), > which, by definition, may not actually complete the receive. Hence, the > receive buffer may not be filled by the time orte_rte_recv_buffer_nb() > returns, and therefore you can't proceed with the FT processing. E.g., the > UNPACK_BUFFER() statement here may well be undefined because the buffer > isn't full yet. > > I'm not sure what this means in terms of overall FT processing -- the fact > that UNPACK_BUFFER is erroneous may be fairly obvious, but the other FT > processing that occurs below UNPACK_BUFFER, and, indeed, by the caller of > this function, may not be able to proceed until this receive completes, > either. > > Ditto for all the other hunks like this. > > If I'm not mistaken, the actual receive will occur in a progress thread, > so the cbfunc supplied to orte_rte_recv_buffer_nb() will need to do > something in a thread safe manner -- I'm not sure if it will need to > transfer control back up to the main thread, or if the FT processing in > question is safe to occur in the ORTE progress thread (it would be *much* > better if it could occur on the ORTE progress thread so that we can get > asynchronous progress of this stuff). > > If I'm correct, then all of those hunks will need to be adapted. I.e., > these aren't just compile errors to be fixed -- they'll also require > architectural changes, too. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Joshua Hursey Assistant Professor of Computer Science University of Wisconsin-La Crosse http://cs.uwlax.edu/~jjhursey