On Nov 25, 2013, at 9:59 AM, Adrian Reber <adr...@lisas.de> wrote:

> @@ -5616,16 +5597,8 @@ static int 
> do_send_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref,
>     /*
>      * Recv the ACK msg
>      */
> -    if ( 0 > (ret = ompi_rte_recv_buffer(&peer_ref->proc_name, buffer,
> -                                         OMPI_CRCP_COORD_BOOKMARK_TAG, 0) ) 
> ) {
> -        opal_output(mca_crcp_bkmrk_component.super.output_handle,
> -                    "crcp:bkmrk: do_send_msg_detail: %s --> %s Failed to 
> receive ACK buffer from peer. Return %d\n",
> -                    OMPI_NAME_PRINT(OMPI_PROC_MY_NAME),
> -                    OMPI_NAME_PRINT(&(peer_ref->proc_name)),
> -                    ret);
> -        exit_status = ret;
> -        goto cleanup;
> -    }
> +    ompi_rte_recv_buffer_nb(&peer_ref->proc_name, 
> OMPI_CRCP_COORD_BOOKMARK_TAG, 0,
> +                            orte_rml_recv_callback, NULL);
> 
>     UNPACK_BUFFER(buffer, recv_response, 1, OPAL_UINT32,
>                   "crcp:bkmrk: send_msg_details: Failed to unpack the ACK 
> from peer buffer.");

I see a bunch of hunks like this that I do not think are correct.

They're replacing orte_rte_recv_buffer() with orte_rte_recv_buffer_nb(), which, 
by definition, may not actually complete the receive.  Hence, the receive 
buffer may not be filled by the time orte_rte_recv_buffer_nb() returns, and 
therefore you can't proceed with the FT processing.  E.g., the UNPACK_BUFFER() 
statement here may well be undefined because the buffer isn't full yet.

I'm not sure what this means in terms of overall FT processing -- the fact that 
UNPACK_BUFFER is erroneous may be fairly obvious, but the other FT processing 
that occurs below UNPACK_BUFFER, and, indeed, by the caller of this function, 
may not be able to proceed until this receive completes, either.  

Ditto for all the other hunks like this.

If I'm not mistaken, the actual receive will occur in a progress thread, so the 
cbfunc supplied to orte_rte_recv_buffer_nb() will need to do something in a 
thread safe manner -- I'm not sure if it will need to transfer control back up 
to the main thread, or if the FT processing in question is safe to occur in the 
ORTE progress thread (it would be *much* better if it could occur on the ORTE 
progress thread so that we can get asynchronous progress of this stuff).

If I'm correct, then all of those hunks will need to be adapted.  I.e., these 
aren't just compile errors to be fixed -- they'll also require architectural 
changes, too.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to