I found that open mpi is looking for the file contact.txt in the wrong directory. It always searches the file in the directory "/tmp/openmpi-sessions-root@debian_0/<MPIRUN PID>/" but this file exists only in "/tmp/openmpi-sessions-root@debian_0/<MPIRUN PID>/0". When I copy contact.txt to the directory where open mpi searches, then "ompi-ps" and "ompi-checkpoint" works.
On Mon, Aug 11, 2008 at 4:06 PM, Caciano Machado <caci...@gmail.com> wrote: > Hi, > > I'm trying to run the last checkpoint/restart (rev 19235) but ompi is > showing the following error in "ompi-checkpoint". > > It seems to be something in function "orte_list_local_hnps" of the > file orte/util/hnp_contact.c. I'm using BLCR 0.7.2 and it's working > correctly with the example applications. > > ################################################################ > root@debian:~/pp# ompi-clean > root@debian:~/pp# mpirun -machinefile machinefile -np 2 -am > ft-enable-cr -v -d pp 1 2 1000000 > [debian:27936] procdir: /tmp/openmpi-sessions-root@debian_0/31810/0/0 > [debian:27936] jobdir: /tmp/openmpi-sessions-root@debian_0/31810/0 > [debian:27936] top: openmpi-sessions-root@debian_0 > [debian:27936] tmp: /tmp > [debian:27936] [[31810,0],0] hostfile: checking hostfile machinefile for nodes > [debian:27936] [[31810,0],0] hostfile: filtering nodes through > hostfile machinefile > [debian:27936] progressed_wait: base/plm_base_launch_support.c 436 > [debian:27940] procdir: /tmp/openmpi-sessions-root@debian_0/31810/0/1 > [debian:27940] jobdir: /tmp/openmpi-sessions-root@debian_0/31810/0 > [debian:27940] top: openmpi-sessions-root@debian_0 > [debian:27940] tmp: /tmp > [debian:27936] defining message event: base/plm_base_launch_support.c 400 > [debian:27936] defining message event: grpcomm_bad_module.c 183 > [debian:27936] progressed_wait: base/plm_base_launch_support.c 679 > [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by > [[31810,0],0] for tag 1 > [debian:27936] defining message event: orted/orted_comm.c 382 > [debian:27936] [[31810,0],0] node[0].name debian daemon 0 arch ffca0200 > [debian:27936] [[31810,0],0] node[1].name debian daemon 1 arch ffca0200 > [debian:27936] defining message event: base/odls_base_default_fns.c 1060 > [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing > commands completed > [debian:27936] [[31810,0],0] orte:daemon:send_relay > [debian:27936] [[31810,0],0] orte:daemon:send_relay sending relay msg > to [[31810,0],1] > [debian:27940] [[31810,0],1] orted_recv_cmd: received message from > [[31810,0],0] > [debian:27940] defining message event: orted/orted_comm.c 277 > [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv > [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by > [[31810,0],0] for tag 1 > [debian:27940] defining message event: orted/orted_comm.c 382 > [debian:27940] [[31810,0],1] node[0].name debian daemon 0 arch ffca0200 > [debian:27940] [[31810,0],1] node[1].name debian daemon 1 arch ffca0200 > [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing > commands completed > [debian:27940] [[31810,0],1] orte:daemon:send_relay > [debian:27940] [[31810,0],1] orte:daemon:send_relay - recipient list is empty! > [debian:27936] defining message event: base/plm_base_launch_support.c 635 > [debian:27936] Info: Setting up debugger process table for applications > MPIR_being_debugged = 0 > MPIR_debug_state = 1 > MPIR_partial_attach_ok = 1 > MPIR_i_am_starter = 0 > MPIR_proctable_size = 2 > MPIR_proctable: > (i, host, exe, pid) = (0, debian, /root/pp/pp, 27941) > (i, host, exe, pid) = (1, debian, /root/pp/pp, 27942) > [debian:27942] procdir: /tmp/openmpi-sessions-root@debian_0/31810/1/1 > [debian:27941] procdir: /tmp/openmpi-sessions-root@debian_0/31810/1/0 > [debian:27941] jobdir: /tmp/openmpi-sessions-root@debian_0/31810/1 > [debian:27941] top: openmpi-sessions-root@debian_0 > [debian:27941] tmp: /tmp > [debian:27942] jobdir: /tmp/openmpi-sessions-root@debian_0/31810/1 > [debian:27942] top: openmpi-sessions-root@debian_0 > [debian:27942] tmp: /tmp > [debian:27940] [[31810,0],1] orted_recv_cmd: received message from > [[31810,1],0] > [debian:27940] defining message event: orted/orted_comm.c 277 > [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv > [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by > [[31810,1],0] for tag 1 > [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing > commands completed > [debian:27940] [[31810,0],1] orted_recv_cmd: received message from > [[31810,1],1] > [debian:27940] defining message event: orted/orted_comm.c 277 > [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv > [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by > [[31810,1],1] for tag 1 > [debian:27936] defining message event: base/routed_base_receive.c 153 > [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing > commands completed > [debian:27941] progressed_wait: base/routed_base_register_sync.c 104 > [debian:27942] progressed_wait: base/routed_base_register_sync.c 104 > [debian:27941] [[31810,1],0] node[0].name debian daemon 0 arch ffca0200 > [debian:27941] [[31810,1],0] node[1].name debian daemon 1 arch ffca0200 > [debian:27942] [[31810,1],1] node[0].name debian daemon 0 arch ffca0200 > [debian:27942] [[31810,1],1] node[1].name debian daemon 1 arch ffca0200 > [debian:27940] [[31810,0],1] orted_recv_cmd: received message from > [[31810,1],0] > [debian:27940] defining message event: orted/orted_comm.c 277 > [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv > [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by > [[31810,1],0] for tag 1 > [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing > commands completed > [debian:27941] progressed_wait: grpcomm_bad_module.c 394 > [debian:27936] [[31810,0],0] orted_recv_cmd: received message from > [[31810,0],1] > [debian:27936] defining message event: orted/orted_comm.c 277 > [debian:27936] [[31810,0],0] orted_recv_cmd: reissued recv > [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by > [[31810,0],1] for tag 1 > [debian:27936] defining message event: grpcomm_bad_module.c 183 > [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing > commands completed > [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by > [[31810,0],0] for tag 1 > [debian:27936] defining message event: orted/orted_comm.c 382 > [debian:27936] [[31810,0],0] orted:comm:message_local_procs delivering > message to job [31810,1] tag 15 > [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing > commands completed > [debian:27936] [[31810,0],0] orte:daemon:send_relay > [debian:27936] [[31810,0],0] orte:daemon:send_relay sending relay msg > to [[31810,0],1] > [debian:27940] [[31810,0],1] orted_recv_cmd: received message from > [[31810,1],1] > [debian:27940] defining message event: orted/orted_comm.c 277 > [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv > [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by > [[31810,1],1] for tag 1 > [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing > commands completed > [debian:27940] [[31810,0],1] orted_recv_cmd: received message from > [[31810,0],0] > [debian:27940] defining message event: orted/orted_comm.c 277 > [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv > [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by > [[31810,0],0] for tag 1 > [debian:27940] defining message event: orted/orted_comm.c 382 > [debian:27940] [[31810,0],1] orted:comm:message_local_procs delivering > message to job [31810,1] tag 15 > [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing > commands completed > [debian:27940] [[31810,0],1] orte:daemon:send_relay > [debian:27940] [[31810,0],1] orte:daemon:send_relay - recipient list is empty! > [debian:27942] progressed_wait: grpcomm_bad_module.c 394 > [debian:27940] [[31810,0],1] orted_recv_cmd: received message from > [[31810,1],1] > [debian:27940] defining message event: orted/orted_comm.c 277 > [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv > [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by > [[31810,1],1] for tag 1 > [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing > commands completed > [debian:27942] progressed_wait: grpcomm_bad_module.c 270 > [debian:27936] [[31810,0],0] orted_recv_cmd: received message from > [[31810,0],1] > [debian:27936] defining message event: orted/orted_comm.c 277 > [debian:27936] [[31810,0],0] orted_recv_cmd: reissued recv > [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by > [[31810,0],1] for tag 1 > [debian:27936] defining message event: grpcomm_bad_module.c 183 > [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing > commands completed > [debian:27936] [[31810,0],0] orte:daemon:cmd:processor called by > [[31810,0],0] for tag 1 > [debian:27936] defining message event: orted/orted_comm.c 382 > [debian:27936] [[31810,0],0] orted:comm:message_local_procs delivering > message to job [31810,1] tag 17 > [debian:27936] [[31810,0],0] orte:daemon:cmd:processor: processing > commands completed > [debian:27936] [[31810,0],0] orte:daemon:send_relay > [debian:27936] [[31810,0],0] orte:daemon:send_relay sending relay msg > to [[31810,0],1] > [debian:27940] [[31810,0],1] orted_recv_cmd: received message from > [[31810,1],0] > [debian:27940] defining message event: orted/orted_comm.c 277 > [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv > [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by > [[31810,1],0] for tag 1 > [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing > commands completed > [debian:27940] [[31810,0],1] orted_recv_cmd: received message from > [[31810,0],0] > [debian:27940] defining message event: orted/orted_comm.c 277 > [debian:27940] [[31810,0],1] orted_recv_cmd: reissued recv > [debian:27940] [[31810,0],1] orte:daemon:cmd:processor called by > [[31810,0],0] for tag 1 > [debian:27940] defining message event: orted/orted_comm.c 382 > [debian:27940] [[31810,0],1] orted:comm:message_local_procs delivering > message to job [31810,1] tag 17 > [debian:27940] [[31810,0],1] orte:daemon:cmd:processor: processing > commands completed > [debian:27940] [[31810,0],1] orte:daemon:send_relay > [debian:27940] [[31810,0],1] orte:daemon:send_relay - recipient list is empty! > [debian:27941] progressed_wait: grpcomm_bad_module.c 270 > # > # ping-pong com MPI > # > # msgs from 1 to 2 bytes > # results are the mean of 1000000 repetitions for each msg size > # Tue Aug 12 06:26:29 2008 > # > # size lat (us) bw (MB/s) > > ################################################################ > 27936 pts/1 S+ 0:00 mpirun -machinefile machinefile -np 2 -am > ft-enable-cr -v -d pp 1 2 1000000 > 27937 pts/1 S+ 0:00 /usr/bin/ssh -x debian orted --debug > --heartbeat 0 -mca ess env -mca orte_ess_jobid 2084700160 > 27938 ? Ss 0:00 sshd: root@notty > 27940 ? Ss 0:00 orted --debug --heartbeat 0 -mca ess env > -mca orte_ess_jobid 2084700160 -mca orte_ess_vpid 1 -mc > 27941 ? Rl 0:21 pp 1 2 1000000 > 27942 ? Rl 0:21 pp 1 2 1000000 > 28021 pts/0 R+ 0:00 ps xa > > root@debian:~/pp# ompi-checkpoint 27936 -v > [debian:28022] [[31764,0],0] ORTE_ERROR_LOG: Not found in file > orte-checkpoint.c at line 395 > [debian:28022] HNP with PID 27936 Not found! > > ################################################################ > > Regards, > Caciano Machado > Computer Science Graduate Student/UFRGS >