Status update of C/R with Open MPI:

With the last two patches applied I am now seeing communication
between orte-checkpoint and orterun:

orte-checkpoint 23975:

[dcbz:23986] orte_checkpoint: Checkpointing...
[dcbz:23986]     PID 23975
[dcbz:23986]     Connected to Mpirun [[45520,0],0]
[dcbz:23986] orte_checkpoint: notify_hnp: Contact Head Node Process PID 23975
[dcbz:23986] [[45509,0],0] rml_send_buffer to peer [[45520,0],0] at tag 13
[dcbz:23986] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid 
[INVALID]
[dcbz:23986] [[45509,0],0] posting recv
[dcbz:23986] [[45509,0],0] posting persistent recv on tag 9 for peer 
[[WILDCARD],WILDCARD]
[dcbz:23986] [[45509,0],0] posting recv
[dcbz:23986] [[45509,0],0] posting persistent recv on tag 13 for peer 
[[WILDCARD],WILDCARD]
[dcbz:23986] [[45509,0],0] rml_send_msg to peer [[45520,0],0] at tag 13
[dcbz:23986] [[45509,0],0]-[[45520,0],0] Send message complete at 
../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:220
[dcbz:23986] [[45509,0],0] Message posted at 
../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:519
[dcbz:23986] [[45509,0],0] message received 39 bytes from [[45520,0],0] for tag 
13
[dcbz:23986] orte_checkpoint: hnp_receiver: Receive a command message.
[dcbz:23986] orte_checkpoint: hnp_receiver: Status Update.
--------------------------------------------------------------------------
Error: The application (PID = 23975) failed to checkpoint properly.
       Returned -1.
--------------------------------------------------------------------------

orterun:

[dcbz:23975] [[45520,0],0] Message posted at 
../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:519
[dcbz:23975] [[45520,0],0] message received 50 bytes from [[45509,0],0] for tag 
13
[dcbz:23975] Global) Command Line: Start a checkpoint operation [Sender = 
[[45509,0],0]]
[dcbz:23975] Global) Command line requested a checkpoint [command 1]
[dcbz:23975] Global-Local) base:ckpt_init_cmd: Receiving commands
[dcbz:23975] Global-Local) base:ckpt_init_cmd: Received [0, 0, [INVALID]]
[dcbz:23975] Global) request_cmd(): Checkpointing currently disabled, rejecting 
request
[dcbz:23975] 23975: Failed to checkpoint process [45520,0].
[dcbz:23975] Global-Local) base:ckpt_update_cmd: Sending update command <status 
0>
[dcbz:23975] Global-Local) base:ckpt_update_cmd: Sending update command <status 
0> + <ref (null)> <seq -1>
[dcbz:23975] [[45520,0],0] rml_send_buffer to peer [[45509,0],0] at tag 13
[dcbz:23975] Global) Startup Command Line Channel
[dcbz:23975] [[45520,0],0] rml_recv_buffer_nb for peer [[WILDCARD],WILDCARD] 
tag 13
[dcbz:23975] [[45520,0],0] rml_send_msg to peer [[45509,0],0] at tag 13
[dcbz:23975] [[45520,0],0] posting recv
[dcbz:23975] [[45520,0],0] posting non-persistent recv on tag 13 for peer 
[[WILDCARD],WILDCARD]
[dcbz:23975] [[45520,0],0]-[[45509,0],0] Send message complete at 
../../../../../orte/mca/oob/tcp/oob_tcp_sendrecv.c:220

It's still not working but at least both processes are
talking to each other which is good.

                Adrian


On Thu, Jan 23, 2014 at 11:27:42AM -0600, Josh Hursey wrote:
> +1
> 
> 
> On Thu, Jan 23, 2014 at 10:16 AM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> > Looks correct to me - you are right in that you cannot release the buffer
> > until after the send completes. We don't copy the data underneath to save
> > memory and time.
> >
> >
> > On Jan 23, 2014, at 6:51 AM, Adrian Reber <adr...@lisas.de> wrote:
> >
> > > Following patch makes orte-checkpoint communicate with orterun again:
> > >
> > > diff --git a/orte/tools/orte-checkpoint/orte-checkpoint.c
> > b/orte/tools/orte-checkpoint/orte-checkpoint.c
> > > index 7106342..8539f34 100644
> > > --- a/orte/tools/orte-checkpoint/orte-checkpoint.c
> > > +++ b/orte/tools/orte-checkpoint/orte-checkpoint.c
> > > @@ -834,7 +834,7 @@ static int
> > notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options)
> > >     }
> > >
> > >     if (ORTE_SUCCESS != (ret =
> > orte_rml.send_buffer_nb(&(orterun_hnp->name), buffer,
> > > -
> > ORTE_RML_TAG_CKPT, hnp_receiver,
> > > +
> > ORTE_RML_TAG_CKPT, orte_rml_send_callback,
> > >                                                        NULL))) {
> > >         exit_status = ret;
> > >         goto cleanup;
> > > @@ -845,11 +845,6 @@ static int
> > notify_process_for_checkpoint(opal_crs_base_ckpt_options_t *options)
> > >                         ORTE_JOBID_PRINT(jobid));
> > >
> > >  cleanup:
> > > -    if( NULL != buffer) {
> > > -        OBJ_RELEASE(buffer);
> > > -        buffer = NULL;
> > > -    }
> > > -
> > >     if( ORTE_SUCCESS != exit_status ) {
> > >         opal_show_help("help-orte-checkpoint.txt", "unable_to_connect",
> > true,
> > >                        orte_checkpoint_globals.pid);
> > >
> > >
> > > Before committing the code into the repository I wanted to make
> > > sure it is the correct way to fix it.
> > >
> > > The first change changes the callback to orte_rml_send_callback().
> > > When I initially made the code compile again I used hnp_receiver()
> > > to change the code from blocking to non-blocking and that was
> > > wrong.
> > >
> > > The second change (removal of OBJ_RELEASE(buffer)) is necessary
> > > because this seems to delete buffer during communication and then
> > > everything breaks badly.
> > >
> > >               Adrian
> > > _______________________________________________
> > > devel mailing list
> > > de...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> 
> 
> 
> -- 
> Joshua Hursey
> Assistant Professor of Computer Science
> University of Wisconsin-La Crosse
> http://cs.uwlax.edu/~jjhursey

> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


                Adrian

-- 
Adrian Reber <adr...@lisas.de>            http://lisas.de/~adrian/
Bing's Rule:
        Don't try to stem the tide -- move the beach.

Reply via email to