Thanks. Now it's working.
On Tue, Aug 12, 2008 at 8:21 AM, Jeff Squyres <jsquy...@cisco.com> wrote: > Ralph committed a proper fix yesterday; see if that works for you. > > On Aug 11, 2008, at 7:44 PM, Caciano Machado wrote: > >> Jeff, >> >> Here is an ugly hack that I'm using to get this working in Linux until >> Josh returns. >> >> ########################################################## >> --- ompi-trunk/orte/util/hnp_contact.c 2008-08-12 12:10:07.000000000 >> +0200 >> +++ ompi-trunk-caciano/orte/util/hnp_contact.c 2008-08-12 >> 12:08:52.000000000 +0200 >> @@ -255,7 +255,7 @@ >> * See if a contact file exists in this directory and read it >> */ >> contact_filename = opal_os_path( false, headdir, >> - dir_entry->d_name, "contact.txt", >> NULL ); >> + dir_entry->d_name, >> "0/contact.txt", NULL ); >> >> hnp = OBJ_NEW(orte_hnp_contact_t); >> if (ORTE_SUCCESS == (ret = >> orte_read_hnp_contact_file(contact_filename, hnp))) { >> ########################################################## >> >> Regards >> >> On Mon, Aug 11, 2008 at 8:28 PM, Jeff Squyres <jsquy...@cisco.com> wrote: >>> >>> This is likely to two things: >>> >>> - we just made some minor changes to the session directory stuff >>> - the checkpoint/restart guy (Josh) is off on vacation for about 3 weeks >>> >>> I'll file a ticket about this so that he's aware of it and can fix it >>> when >>> he returns. >>> >>> Thanks for the heads-up! >>> >>> >>> On Aug 11, 2008, at 7:16 PM, Caciano Machado wrote: >>> >>>> I found that open mpi is looking for the file contact.txt in the wrong >>>> directory. It always searches the file in the directory >>>> "/tmp/openmpi-sessions-root@debian_0/<MPIRUN PID>/" but this file >>>> exists only in "/tmp/openmpi-sessions-root@debian_0/<MPIRUN PID>/0". >>>> When I copy contact.txt to the directory where open mpi searches, then >>>> "ompi-ps" and "ompi-checkpoint" works. >>>> >>>> On Mon, Aug 11, 2008 at 4:06 PM, Caciano Machado <caci...@gmail.com> >>>> wrote: >>>>> >>>>> Hi, >>>>> >>>>> I'm trying to run the last checkpoint/restart (rev 19235) but ompi is >>>>> showing the following error in "ompi-checkpoint". >>>>> >>>>> It seems to be something in function "orte_list_local_hnps" of the >>>>> file orte/util/hnp_contact.c. I'm using BLCR 0.7.2 and it's working >>>>> correctly with the example applications. >>>>> >>>>> ################################################################ >>>>> root@debian:~/pp# ompi-clean >>>>> root@debian:~/pp# mpirun -machinefile machinefile -np 2 -am >>>>> ft-enable-cr -v -d pp 1 2 1000000 >>>>> [debian:27936] procdir: /tmp/openmpi-sessions-root@debian_0/31810/0/0 >>>>> [debian:27936] jobdir: /tmp/openmpi-sessions-root@debian_0/31810/0 >>>>> [debian:27936] top: openmpi-sessions-root@debian_0 >>>>> [debian:27936] tmp: /tmp >>>>> [debian:27936] [[31810,0],0] hostfile: checking hostfile machinefile >>>>> for >>>>> nodes >>>>> [debian:27936] [[31810,0],0] hostfile: filtering nodes through >>>>> hostfile machinefile >>>>> [debian:27936] progressed_wait: base/plm_base_launch_support.c 436 >>>>> [debian:27940] procdir: /tmp/openmpi-sessions-root@debian_0/31810/0/1 >>>>> [debian:27940] jobdir: /tmp/openmpi-sessions-root@debian_0/31810/0 >>>>> [debian:27940] top: openmpi-sessions-root@debian_0 >>>>> [debian:27940] tmp: /tmp >>>>> [debian:27936] defining message event: base/plm_base_launch_support.c >>>>> 400 >>>>> [debian:27936] defining message event: grpcomm_bad_module.c 183 >>>>> [debian:27936] progressed_wait: base/plm_base_launch_support.c 679 >>>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by >>>>> [[31810,0],0] for tag 1 >>>>> [debian:27936] defining message event: orted/orted_comm.c 382 >>>>> [debian:27936] [[31810,0],0] node[0].name debian daemon 0 arch ffca0200 >>>>> [debian:27936] [[31810,0],0] node[1].name debian daemon 1 arch ffca0200 >>>>> [debian:27936] defining message event: base/odls_base_default_fns.c >>>>> 1060 >>>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing >>>>> commands completed >>>>> [debian:27936] [[31810,0],0] orte:daemon:send_relay >>>>> [debian:27936] [[31810,0],0] orte:daemon:send_relay sending relay msg >>>>> to [[31810,0],1] >>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from >>>>> [[31810,0],0] >>>>> [debian:27940] defining message event: orted/orted_comm.c 277 >>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv >>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by >>>>> [[31810,0],0] for tag 1 >>>>> [debian:27940] defining message event: orted/orted_comm.c 382 >>>>> [debian:27940] [[31810,0],1] node[0].name debian daemon 0 arch ffca0200 >>>>> [debian:27940] [[31810,0],1] node[1].name debian daemon 1 arch ffca0200 >>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing >>>>> commands completed >>>>> [debian:27940] [[31810,0],1] orte:daemon:send_relay >>>>> [debian:27940] [[31810,0],1] orte:daemon:send_relay - recipient list is >>>>> empty! >>>>> [debian:27936] defining message event: base/plm_base_launch_support.c >>>>> 635 >>>>> [debian:27936] Info: Setting up debugger process table for applications >>>>> MPIR_being_debugged = 0 >>>>> MPIR_debug_state = 1 >>>>> MPIR_partial_attach_ok = 1 >>>>> MPIR_i_am_starter = 0 >>>>> MPIR_proctable_size = 2 >>>>> MPIR_proctable: >>>>> (i, host, exe, pid) = (0, debian, /root/pp/pp, 27941) >>>>> (i, host, exe, pid) = (1, debian, /root/pp/pp, 27942) >>>>> [debian:27942] procdir: /tmp/openmpi-sessions-root@debian_0/31810/1/1 >>>>> [debian:27941] procdir: /tmp/openmpi-sessions-root@debian_0/31810/1/0 >>>>> [debian:27941] jobdir: /tmp/openmpi-sessions-root@debian_0/31810/1 >>>>> [debian:27941] top: openmpi-sessions-root@debian_0 >>>>> [debian:27941] tmp: /tmp >>>>> [debian:27942] jobdir: /tmp/openmpi-sessions-root@debian_0/31810/1 >>>>> [debian:27942] top: openmpi-sessions-root@debian_0 >>>>> [debian:27942] tmp: /tmp >>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from >>>>> [[31810,1],0] >>>>> [debian:27940] defining message event: orted/orted_comm.c 277 >>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv >>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by >>>>> [[31810,1],0] for tag 1 >>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing >>>>> commands completed >>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from >>>>> [[31810,1],1] >>>>> [debian:27940] defining message event: orted/orted_comm.c 277 >>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv >>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by >>>>> [[31810,1],1] for tag 1 >>>>> [debian:27936] defining message event: base/routed_base_receive.c 153 >>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing >>>>> commands completed >>>>> [debian:27941] progressed_wait: base/routed_base_register_sync.c 104 >>>>> [debian:27942] progressed_wait: base/routed_base_register_sync.c 104 >>>>> [debian:27941] [[31810,1],0] node[0].name debian daemon 0 arch ffca0200 >>>>> [debian:27941] [[31810,1],0] node[1].name debian daemon 1 arch ffca0200 >>>>> [debian:27942] [[31810,1],1] node[0].name debian daemon 0 arch ffca0200 >>>>> [debian:27942] [[31810,1],1] node[1].name debian daemon 1 arch ffca0200 >>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from >>>>> [[31810,1],0] >>>>> [debian:27940] defining message event: orted/orted_comm.c 277 >>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv >>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by >>>>> [[31810,1],0] for tag 1 >>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing >>>>> commands completed >>>>> [debian:27941] progressed_wait: grpcomm_bad_module.c 394 >>>>> [debian:27936] [[31810,0],0] orted_recv_cmd: received message from >>>>> [[31810,0],1] >>>>> [debian:27936] defining message event: orted/orted_comm.c 277 >>>>> [debian:27936] [[31810,0],0] orted_recv_cmd: reissued recv >>>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by >>>>> [[31810,0],1] for tag 1 >>>>> [debian:27936] defining message event: grpcomm_bad_module.c 183 >>>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing >>>>> commands completed >>>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by >>>>> [[31810,0],0] for tag 1 >>>>> [debian:27936] defining message event: orted/orted_comm.c 382 >>>>> [debian:27936] [[31810,0],0] orted:comm:message_local_procs delivering >>>>> message to job [31810,1] tag 15 >>>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing >>>>> commands completed >>>>> [debian:27936] [[31810,0],0] orte:daemon:send_relay >>>>> [debian:27936] [[31810,0],0] orte:daemon:send_relay sending relay msg >>>>> to [[31810,0],1] >>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from >>>>> [[31810,1],1] >>>>> [debian:27940] defining message event: orted/orted_comm.c 277 >>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv >>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by >>>>> [[31810,1],1] for tag 1 >>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing >>>>> commands completed >>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from >>>>> [[31810,0],0] >>>>> [debian:27940] defining message event: orted/orted_comm.c 277 >>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv >>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by >>>>> [[31810,0],0] for tag 1 >>>>> [debian:27940] defining message event: orted/orted_comm.c 382 >>>>> [debian:27940] [[31810,0],1] orted:comm:message_local_procs delivering >>>>> message to job [31810,1] tag 15 >>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing >>>>> commands completed >>>>> [debian:27940] [[31810,0],1] orte:daemon:send_relay >>>>> [debian:27940] [[31810,0],1] orte:daemon:send_relay - recipient list is >>>>> empty! >>>>> [debian:27942] progressed_wait: grpcomm_bad_module.c 394 >>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from >>>>> [[31810,1],1] >>>>> [debian:27940] defining message event: orted/orted_comm.c 277 >>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv >>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by >>>>> [[31810,1],1] for tag 1 >>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing >>>>> commands completed >>>>> [debian:27942] progressed_wait: grpcomm_bad_module.c 270 >>>>> [debian:27936] [[31810,0],0] orted_recv_cmd: received message from >>>>> [[31810,0],1] >>>>> [debian:27936] defining message event: orted/orted_comm.c 277 >>>>> [debian:27936] [[31810,0],0] orted_recv_cmd: reissued recv >>>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by >>>>> [[31810,0],1] for tag 1 >>>>> [debian:27936] defining message event: grpcomm_bad_module.c 183 >>>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing >>>>> commands completed >>>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by >>>>> [[31810,0],0] for tag 1 >>>>> [debian:27936] defining message event: orted/orted_comm.c 382 >>>>> [debian:27936] [[31810,0],0] orted:comm:message_local_procs delivering >>>>> message to job [31810,1] tag 17 >>>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing >>>>> commands completed >>>>> [debian:27936] [[31810,0],0] orte:daemon:send_relay >>>>> [debian:27936] [[31810,0],0] orte:daemon:send_relay sending relay msg >>>>> to [[31810,0],1] >>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from >>>>> [[31810,1],0] >>>>> [debian:27940] defining message event: orted/orted_comm.c 277 >>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv >>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by >>>>> [[31810,1],0] for tag 1 >>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing >>>>> commands completed >>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from >>>>> [[31810,0],0] >>>>> [debian:27940] defining message event: orted/orted_comm.c 277 >>>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv >>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by >>>>> [[31810,0],0] for tag 1 >>>>> [debian:27940] defining message event: orted/orted_comm.c 382 >>>>> [debian:27940] [[31810,0],1] orted:comm:message_local_procs delivering >>>>> message to job [31810,1] tag 17 >>>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing >>>>> commands completed >>>>> [debian:27940] [[31810,0],1] orte:daemon:send_relay >>>>> [debian:27940] [[31810,0],1] orte:daemon:send_relay - recipient list is >>>>> empty! >>>>> [debian:27941] progressed_wait: grpcomm_bad_module.c 270 >>>>> # >>>>> # ping-pong com MPI >>>>> # >>>>> # msgs from 1 to 2 bytes >>>>> # results are the mean of 1000000 repetitions for each msg size >>>>> # Tue Aug 12 06:26:29 2008 >>>>> # >>>>> # size lat (us) bw (MB/s) >>>>> >>>>> ################################################################ >>>>> 27936 pts/1 S+ 0:00 mpirun -machinefile machinefile -np 2 -am >>>>> ft-enable-cr -v -d pp 1 2 1000000 >>>>> 27937 pts/1 S+ 0:00 /usr/bin/ssh -x debian orted --debug >>>>> --heartbeat 0 -mca ess env -mca orte_ess_jobid 2084700160 >>>>> 27938 ? Ss 0:00 sshd: root@notty >>>>> 27940 ? Ss 0:00 orted --debug --heartbeat 0 -mca ess env >>>>> -mca orte_ess_jobid 2084700160 -mca orte_ess_vpid 1 -mc >>>>> 27941 ? Rl 0:21 pp 1 2 1000000 >>>>> 27942 ? Rl 0:21 pp 1 2 1000000 >>>>> 28021 pts/0 R+ 0:00 ps xa >>>>> >>>>> root@debian:~/pp# ompi-checkpoint 27936 -v >>>>> [debian:28022] [[31764,0],0] ORTE_ERROR_LOG: Not found in file >>>>> orte-checkpoint.c at line 395 >>>>> [debian:28022] HNP with PID 27936 Not found! >>>>> >>>>> ################################################################ >>>>> >>>>> Regards, >>>>> Caciano Machado >>>>> Computer Science Graduate Student/UFRGS >>>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> -- >>> Jeff Squyres >>> Cisco Systems >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >