Jeff,

Here is an ugly hack that I'm using to get this working in Linux until
Josh returns.

##########################################################
--- ompi-trunk/orte/util/hnp_contact.c  2008-08-12 12:10:07.000000000 +0200
+++ ompi-trunk-caciano/orte/util/hnp_contact.c  2008-08-12
12:08:52.000000000 +0200
@@ -255,7 +255,7 @@
          * See if a contact file exists in this directory and read it
          */
         contact_filename = opal_os_path( false, headdir,
-                                        dir_entry->d_name, "contact.txt", NULL 
);
+                                        dir_entry->d_name, "0/contact.txt", 
NULL );

         hnp = OBJ_NEW(orte_hnp_contact_t);
         if (ORTE_SUCCESS == (ret =
orte_read_hnp_contact_file(contact_filename, hnp))) {
##########################################################

Regards

On Mon, Aug 11, 2008 at 8:28 PM, Jeff Squyres <jsquy...@cisco.com> wrote:
> This is likely to two things:
>
> - we just made some minor changes to the session directory stuff
> - the checkpoint/restart guy (Josh) is off on vacation for about 3 weeks
>
> I'll file a ticket about this so that he's aware of it and can fix it when
> he returns.
>
> Thanks for the heads-up!
>
>
> On Aug 11, 2008, at 7:16 PM, Caciano Machado wrote:
>
>> I found that open mpi is looking for the file contact.txt in the wrong
>> directory. It always searches the file in the directory
>> "/tmp/openmpi-sessions-root@debian_0/<MPIRUN PID>/" but this file
>> exists only in "/tmp/openmpi-sessions-root@debian_0/<MPIRUN PID>/0".
>> When I copy contact.txt to the directory where open mpi searches, then
>> "ompi-ps" and "ompi-checkpoint" works.
>>
>> On Mon, Aug 11, 2008 at 4:06 PM, Caciano Machado <caci...@gmail.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> I'm trying to run the last checkpoint/restart (rev 19235) but ompi is
>>> showing the following error in "ompi-checkpoint".
>>>
>>> It seems to be something in function "orte_list_local_hnps" of the
>>> file orte/util/hnp_contact.c. I'm using BLCR 0.7.2 and it's working
>>> correctly with the example applications.
>>>
>>> ################################################################
>>> root@debian:~/pp# ompi-clean
>>> root@debian:~/pp# mpirun -machinefile machinefile -np 2 -am
>>> ft-enable-cr -v -d pp 1 2 1000000
>>> [debian:27936] procdir: /tmp/openmpi-sessions-root@debian_0/31810/0/0
>>> [debian:27936] jobdir: /tmp/openmpi-sessions-root@debian_0/31810/0
>>> [debian:27936] top: openmpi-sessions-root@debian_0
>>> [debian:27936] tmp: /tmp
>>> [debian:27936] [[31810,0],0] hostfile: checking hostfile machinefile for
>>> nodes
>>> [debian:27936] [[31810,0],0] hostfile: filtering nodes through
>>> hostfile machinefile
>>> [debian:27936] progressed_wait: base/plm_base_launch_support.c 436
>>> [debian:27940] procdir: /tmp/openmpi-sessions-root@debian_0/31810/0/1
>>> [debian:27940] jobdir: /tmp/openmpi-sessions-root@debian_0/31810/0
>>> [debian:27940] top: openmpi-sessions-root@debian_0
>>> [debian:27940] tmp: /tmp
>>> [debian:27936] defining message event: base/plm_base_launch_support.c 400
>>> [debian:27936] defining message event: grpcomm_bad_module.c 183
>>> [debian:27936] progressed_wait: base/plm_base_launch_support.c 679
>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by
>>> [[31810,0],0] for tag 1
>>> [debian:27936] defining message event: orted/orted_comm.c 382
>>> [debian:27936] [[31810,0],0] node[0].name debian daemon 0 arch ffca0200
>>> [debian:27936] [[31810,0],0] node[1].name debian daemon 1 arch ffca0200
>>> [debian:27936] defining message event: base/odls_base_default_fns.c 1060
>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27936] [[31810,0],0] orte:daemon:send_relay
>>> [debian:27936] [[31810,0],0] orte:daemon:send_relay sending relay msg
>>> to [[31810,0],1]
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>> [[31810,0],0]
>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>> [[31810,0],0] for tag 1
>>> [debian:27940] defining message event: orted/orted_comm.c 382
>>> [debian:27940] [[31810,0],1] node[0].name debian daemon 0 arch ffca0200
>>> [debian:27940] [[31810,0],1] node[1].name debian daemon 1 arch ffca0200
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27940] [[31810,0],1] orte:daemon:send_relay
>>> [debian:27940] [[31810,0],1] orte:daemon:send_relay - recipient list is
>>> empty!
>>> [debian:27936] defining message event: base/plm_base_launch_support.c 635
>>> [debian:27936] Info: Setting up debugger process table for applications
>>> MPIR_being_debugged = 0
>>> MPIR_debug_state = 1
>>> MPIR_partial_attach_ok = 1
>>> MPIR_i_am_starter = 0
>>> MPIR_proctable_size = 2
>>> MPIR_proctable:
>>>  (i, host, exe, pid) = (0, debian, /root/pp/pp, 27941)
>>>  (i, host, exe, pid) = (1, debian, /root/pp/pp, 27942)
>>> [debian:27942] procdir: /tmp/openmpi-sessions-root@debian_0/31810/1/1
>>> [debian:27941] procdir: /tmp/openmpi-sessions-root@debian_0/31810/1/0
>>> [debian:27941] jobdir: /tmp/openmpi-sessions-root@debian_0/31810/1
>>> [debian:27941] top: openmpi-sessions-root@debian_0
>>> [debian:27941] tmp: /tmp
>>> [debian:27942] jobdir: /tmp/openmpi-sessions-root@debian_0/31810/1
>>> [debian:27942] top: openmpi-sessions-root@debian_0
>>> [debian:27942] tmp: /tmp
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>> [[31810,1],0]
>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>> [[31810,1],0] for tag 1
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>> [[31810,1],1]
>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>> [[31810,1],1] for tag 1
>>> [debian:27936] defining message event: base/routed_base_receive.c 153
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27941] progressed_wait: base/routed_base_register_sync.c 104
>>> [debian:27942] progressed_wait: base/routed_base_register_sync.c 104
>>> [debian:27941] [[31810,1],0] node[0].name debian daemon 0 arch ffca0200
>>> [debian:27941] [[31810,1],0] node[1].name debian daemon 1 arch ffca0200
>>> [debian:27942] [[31810,1],1] node[0].name debian daemon 0 arch ffca0200
>>> [debian:27942] [[31810,1],1] node[1].name debian daemon 1 arch ffca0200
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>> [[31810,1],0]
>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>> [[31810,1],0] for tag 1
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27941] progressed_wait: grpcomm_bad_module.c 394
>>> [debian:27936] [[31810,0],0] orted_recv_cmd: received message from
>>> [[31810,0],1]
>>> [debian:27936] defining message event: orted/orted_comm.c 277
>>> [debian:27936] [[31810,0],0] orted_recv_cmd: reissued recv
>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by
>>> [[31810,0],1] for tag 1
>>> [debian:27936] defining message event: grpcomm_bad_module.c 183
>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by
>>> [[31810,0],0] for tag 1
>>> [debian:27936] defining message event: orted/orted_comm.c 382
>>> [debian:27936] [[31810,0],0] orted:comm:message_local_procs delivering
>>> message to job [31810,1] tag 15
>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27936] [[31810,0],0] orte:daemon:send_relay
>>> [debian:27936] [[31810,0],0] orte:daemon:send_relay sending relay msg
>>> to [[31810,0],1]
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>> [[31810,1],1]
>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>> [[31810,1],1] for tag 1
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>> [[31810,0],0]
>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>> [[31810,0],0] for tag 1
>>> [debian:27940] defining message event: orted/orted_comm.c 382
>>> [debian:27940] [[31810,0],1] orted:comm:message_local_procs delivering
>>> message to job [31810,1] tag 15
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27940] [[31810,0],1] orte:daemon:send_relay
>>> [debian:27940] [[31810,0],1] orte:daemon:send_relay - recipient list is
>>> empty!
>>> [debian:27942] progressed_wait: grpcomm_bad_module.c 394
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>> [[31810,1],1]
>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>> [[31810,1],1] for tag 1
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27942] progressed_wait: grpcomm_bad_module.c 270
>>> [debian:27936] [[31810,0],0] orted_recv_cmd: received message from
>>> [[31810,0],1]
>>> [debian:27936] defining message event: orted/orted_comm.c 277
>>> [debian:27936] [[31810,0],0] orted_recv_cmd: reissued recv
>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by
>>> [[31810,0],1] for tag 1
>>> [debian:27936] defining message event: grpcomm_bad_module.c 183
>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by
>>> [[31810,0],0] for tag 1
>>> [debian:27936] defining message event: orted/orted_comm.c 382
>>> [debian:27936] [[31810,0],0] orted:comm:message_local_procs delivering
>>> message to job [31810,1] tag 17
>>> [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27936] [[31810,0],0] orte:daemon:send_relay
>>> [debian:27936] [[31810,0],0] orte:daemon:send_relay sending relay msg
>>> to [[31810,0],1]
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>> [[31810,1],0]
>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>> [[31810,1],0] for tag 1
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: received message from
>>> [[31810,0],0]
>>> [debian:27940] defining message event: orted/orted_comm.c 277
>>> [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by
>>> [[31810,0],0] for tag 1
>>> [debian:27940] defining message event: orted/orted_comm.c 382
>>> [debian:27940] [[31810,0],1] orted:comm:message_local_procs delivering
>>> message to job [31810,1] tag 17
>>> [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing
>>> commands completed
>>> [debian:27940] [[31810,0],1] orte:daemon:send_relay
>>> [debian:27940] [[31810,0],1] orte:daemon:send_relay - recipient list is
>>> empty!
>>> [debian:27941] progressed_wait: grpcomm_bad_module.c 270
>>> #
>>> # ping-pong com MPI
>>> #
>>> # msgs from 1 to 2 bytes
>>> # results are the mean of 1000000 repetitions for each msg size
>>> # Tue Aug 12 06:26:29 2008
>>> #
>>> #   size       lat (us)      bw (MB/s)
>>>
>>> ################################################################
>>> 27936 pts/1    S+     0:00 mpirun -machinefile machinefile -np 2 -am
>>> ft-enable-cr -v -d pp 1 2 1000000
>>> 27937 pts/1    S+     0:00 /usr/bin/ssh -x debian  orted --debug
>>> --heartbeat 0 -mca ess env -mca orte_ess_jobid 2084700160
>>> 27938 ?        Ss     0:00 sshd: root@notty
>>> 27940 ?        Ss     0:00 orted --debug --heartbeat 0 -mca ess env
>>> -mca orte_ess_jobid 2084700160 -mca orte_ess_vpid 1 -mc
>>> 27941 ?        Rl     0:21 pp 1 2 1000000
>>> 27942 ?        Rl     0:21 pp 1 2 1000000
>>> 28021 pts/0    R+     0:00 ps xa
>>>
>>> root@debian:~/pp# ompi-checkpoint 27936 -v
>>> [debian:28022] [[31764,0],0] ORTE_ERROR_LOG: Not found in file
>>> orte-checkpoint.c at line 395
>>> [debian:28022] HNP with PID 27936 Not found!
>>>
>>> ################################################################
>>>
>>> Regards,
>>> Caciano Machado
>>> Computer Science Graduate Student/UFRGS
>>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

Reply via email to