Hello all,
I recently checked in code for xcpu, so that now xpcu can be used as one
of the launchers within open-mpi.
It works fine but I am having one problem.
In file trunk/orte/tools/orterun/totalview.c, on line 402,
I am getting proc->proc_node as NULL which is causing mpirun to crash.
If I change line 402 from
MPIR_proctable[i].host_name = proc->proc_node->node->node_name;
to
if(proc->proc_node){
MPIR_proctable[i].host_name = proc->proc_node->node->node_name;
}
it works fine.
I am not sure why I am getting it as NULL. Any inputs will be appreciated.
Thanks a lot.
-Sushant
----------------------------------------------------------
Here is the gdb output for mpirun
(gdb) run --mca pls xcpu --hostfile /home/sushant/ompi/my-tests/hostfile
-np 1 /home/sushant/ompi/my-tests/hello.o
Starting program: /home/sushant/ompi/install/bin/mpirun --mca pls xcpu
--hostfile /home/sushant/ompi/my-tests/hostfile -np 1
/home/sushant/ompi/my-tests/hello.o
[Thread debugging using libthread_db enabled]
[New Thread -1210691456 (LWP 8117)]
[New Thread -1211511888 (LWP 8120)]
[New Thread -1219900496 (LWP 8126)]
[New Thread -1228289104 (LWP 8127)]
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1210691456 (LWP 8117)]
0x0804d6b9 in orte_totalview_init_after_spawn (jobid=1) at
../../../../trunk/orte/tools/orterun/totalview.c:402
402 MPIR_proctable[i].host_name =
proc->proc_node->node->node_name;
(gdb) where
#0 0x0804d6b9 in orte_totalview_init_after_spawn (jobid=1) at
../../../../trunk/orte/tools/orterun/totalview.c:402
#1 0x0804af92 in job_state_callback (jobid=1, state=4) at
../../../../trunk/orte/tools/orterun/orterun.c:638
#2 0xb7d5b8bc in orte_rmgr_urm_callback (data=0x80c7f00, cbdata=0x804aee8)
at ../../../../../trunk/orte/mca/rmgr/urm/rmgr_urm.c:282
#3 0xb7ced98d in orte_gpr_replica_deliver_notify_msg (msg=0x80c7ed0)
at
../../../../../../trunk/orte/mca/gpr/replica/api_layer/gpr_replica_deliver_notify_msg_api.c:134
#4 0xb7cf68b9 in orte_gpr_replica_process_callbacks ()
at
../../../../../../trunk/orte/mca/gpr/replica/functional_layer/gpr_replica_messaging_fn.c:80
#5 0xb7d0221f in orte_gpr_replica_recv (status=1564, sender=0x80670a0,
buffer=0xbfbc2820, tag=2, cbdata=0x0)
at
../../../../../../trunk/orte/mca/gpr/replica/communications/gpr_replica_recv_proxy_msgs.c:85
#6 0xb7f74b4a in mca_oob_recv_callback (status=1564, peer=0x80670a0,
msg=0x8083ec0, count=1, tag=2, cbdata=0x8083ec0)
at ../../../../trunk/orte/mca/oob/base/oob_base_recv_nb.c:159
#7 0xb7d2e8ec in mca_oob_tcp_msg_data (msg=0x8068460, peer=0x8067080)
at ../../../../../trunk/orte/mca/oob/tcp/oob_tcp_msg.c:487
#8 0xb7d2e506 in mca_oob_tcp_msg_recv_complete (msg=0x8068460,
peer=0x8067080)
at ../../../../../trunk/orte/mca/oob/tcp/oob_tcp_msg.c:396
#9 0xb7d31cf2 in mca_oob_tcp_peer_recv_handler (sd=10, flags=2,
user=0x8067080)
at ../../../../../trunk/orte/mca/oob/tcp/oob_tcp_peer.c:715
#10 0xb7f0990a in opal_event_process_active () at
../../../trunk/opal/event/event.c:428
#11 0xb7f09bc1 in opal_event_loop (flags=1) at
../../../trunk/opal/event/event.c:513
#12 0xb7f02d81 in opal_progress () at
../../trunk/opal/runtime/opal_progress.c:259
#13 0x0804c976 in opal_condition_wait (c=0x804fa90, m=0x804fa64) at
condition.h:81
#14 0x0804a660 in orterun (argc=9, argv=0xbfbc2b24) at
../../../../trunk/orte/tools/orterun/orterun.c:415
#15 0x08049e76 in main (argc=9, argv=0xbfbc2b24) at
../../../../trunk/orte/tools/orterun/main.c:13
(gdb)