Status update of C/R with Open MPI: With the last two patches applied I am now seeing communication between orte-checkpoint and orterun:
orte-checkpoint 23975: [dcbz:23986] orte_checkpoint: Checkpointing... [dcbz:23986] PID 23975 [dcbz:23986] Connected to Mpirun [[45520,0],0] [dcbz:23986] orte_checkpoint: notify_hnp: Contact Head Node Process PID 23975 [dcbz:23986] [[45509,0],0] rml_send_buffer to peer [[45520,0],0] at tag 13 [dcbz:23986] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] [dcbz:23986] [[45509,0],0] posting recv [dcbz:23986] [[45509,0],0] posting persistent recv on tag 9 for peer [[WILDCARD],WILDCARD] [dcbz:23986] [[45509,0],0] posting recv [dcbz:23986] [[45509,0],0] posting persistent recv on tag 13 for peer [[WILDCARD],WILDCARD] [dcbz:23986] [[45509,0],0] rml_send_msg to peer [[45520,0],0] at tag 13 [dcbz:23986] [[45509,0],0]-[[45520,0],0] Send message complete at ../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:220 [dcbz:23986] [[45509,0],0] Message posted at ../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:519 [dcbz:23986] [[45509,0],0] message received 39 bytes from [[45520,0],0] for tag 13 [dcbz:23986] orte_checkpoint: hnp_receiver: Receive a command message. [dcbz:23986] orte_checkpoint: hnp_receiver: Status Update. -------------------------------------------------------------------------- Error: The application (PID = 23975) failed to checkpoint properly. Returned -1. -------------------------------------------------------------------------- orterun: [dcbz:23975] [[45520,0],0] Message posted at ../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:519 [dcbz:23975] [[45520,0],0] message received 50 bytes from [[45509,0],0] for tag 13 [dcbz:23975] Global) Command Line: Start a checkpoint operation [Sender = [[45509,0],0]] [dcbz:23975] Global) Command line requested a checkpoint [command 1] [dcbz:23975] Global-Local) base:ckpt_init_cmd: Receiving commands [dcbz:23975] Global-Local) base:ckpt_init_cmd: Received [0, 0, [INVALID]] [dcbz:23975] Global) request_cmd(): Checkpointing currently disabled, rejecting request [dcbz:23975] 23975: Failed to checkpoint process [45520,0]. [dcbz:23975] Global-Local) base:ckpt_update_cmd: Sending update command <status 0> [dcbz:23975] Global-Local) base:ckpt_update_cmd: Sending update command <status 0> + <ref (null)> <seq -1> [dcbz:23975] [[45520,0],0] rml_send_buffer to peer [[45509,0],0] at tag 13 [dcbz:23975] Global) Startup Command Line Channel [dcbz:23975] [[45520,0],0] rml_recv_buffer_nb for peer [[WILDCARD],WILDCARD] tag 13 [dcbz:23975] [[45520,0],0] rml_send_msg to peer [[45509,0],0] at tag 13 [dcbz:23975] [[45520,0],0] posting recv [dcbz:23975] [[45520,0],0] posting non-persistent recv on tag 13 for peer [[WILDCARD],WILDCARD] [dcbz:23975] [[45520,0],0]-[[45509,0],0] Send message complete at ../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:220 It's still not working but at least both processes are talking to each other which is good. Adrian On Thu, Jan 23, 2014 at 11:27:42AM -0600, Josh Hursey wrote: > +1 > > > On Thu, Jan 23, 2014 at 10:16 AM, Ralph Castain <r...@open-mpi.org> wrote: > > > Looks correct to me - you are right in that you cannot release the buffer > > until after the send completes. We don't copy the data underneath to save > > memory and time. > > > > > > On Jan 23, 2014, at 6:51 AM, Adrian Reber <adr...@lisas.de> wrote: > > > > > Following patch makes orte-checkpoint communicate with orterun again: > > > > > > diff --git a/orte/tools/orte-checkpoint/orte-checkpoint.c > > b/orte/tools/orte-checkpoint/orte-checkpoint.c > > > index 7106342..8539f34 100644 > > > --- a/orte/tools/orte-checkpoint/orte-checkpoint.c > > > +++ b/orte/tools/orte-checkpoint/orte-checkpoint.c > > > @@ -834,7 +834,7 @@ static int > > notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options) > > > } > > > > > > if (ORTE_SUCCESS != (ret = > > orte_rml.send_buffer_nb(&(orterun_hnp->name), buffer, > > > - > > ORTE_RML_TAG_CKPT, hnp_receiver, > > > + > > ORTE_RML_TAG_CKPT, orte_rml_send_callback, > > > NULL))) { > > > exit_status = ret; > > > goto cleanup; > > > @@ -845,11 +845,6 @@ static int > > notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options) > > > ORTE_JOBID_PRINT(jobid)); > > > > > > cleanup: > > > - if( NULL != buffer) { > > > - OBJ_RELEASE(buffer); > > > - buffer = NULL; > > > - } > > > - > > > if( ORTE_SUCCESS != exit_status ) { > > > opal_show_help("help-orte-checkpoint.txt", "unable_to_connect", > > true, > > > orte_checkpoint_globals.pid); > > > > > > > > > Before committing the code into the repository I wanted to make > > > sure it is the correct way to fix it. > > > > > > The first change changes the callback to orte_rml_send_callback(). > > > When I initially made the code compile again I used hnp_receiver() > > > to change the code from blocking to non-blocking and that was > > > wrong. > > > > > > The second change (removal of OBJ_RELEASE(buffer)) is necessary > > > because this seems to delete buffer during communication and then > > > everything breaks badly. > > > > > > Adrian > > > _______________________________________________ > > > devel mailing list > > > de...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > -- > Joshua Hursey > Assistant Professor of Computer Science > University of Wisconsin-La Crosse > http://cs.uwlax.edu/~jjhursey > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel Adrian -- Adrian Reber <adr...@lisas.de> http://lisas.de/~adrian/ Bing's Rule: Don't try to stem the tide -- move the beach.