Hi,

Since I upgraded to MacOS X 10.5.1, I've been having problems running MPI programs (using both 1.2.4 and 1.2.5). The symptoms are intermittent (i.e. sometimes the application runs fine), and appear as follows:

1. One or more of the application processes die (I've see both one and two processes die).

2. (It appears) that the orted's associated with these application process then spin continually.

Here is what I see when I run "mpirun -np 4 ./mpitest":

12467 ?? Rs 1:26.52 orted --bootproxy 1 --name 0.0.1 -- num_procs 5 --vpid_start 0 --nodename node0 --universe greg@Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp:// 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// 10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp:// 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// 10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid 12468 ?? Rs 1:26.63 orted --bootproxy 1 --name 0.0.2 -- num_procs 5 --vpid_start 0 --nodename node1 --universe greg@Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp:// 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// 10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp:// 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// 10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid 12469 ?? Ss 0:00.04 orted --bootproxy 1 --name 0.0.3 -- num_procs 5 --vpid_start 0 --nodename node2 --universe greg@Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp:// 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// 10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp:// 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// 10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid 12470 ?? Ss 0:00.04 orted --bootproxy 1 --name 0.0.4 -- num_procs 5 --vpid_start 0 --nodename node3 --universe greg@Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp:// 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// 10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp:// 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// 10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid
12471   ??  S      0:00.05 ./mpitest
12472   ??  S      0:00.05 ./mpitest

Killing the mpirun results in:

$ mpirun -np 4 ./mpitest
^Cmpirun: killing job...

^ C --------------------------------------------------------------------------
WARNING: mpirun is in the process of killing a job, but has detected an
interruption (probably control-C).

It is dangerous to interrupt mpirun while it is killing a job (proper
termination may not be guaranteed).  Hit control-C again within 1
second if you really want to kill mpirun immediately.
--------------------------------------------------------------------------
^Cmpirun: forcibly killing job...
--------------------------------------------------------------------------
WARNING: mpirun has exited before it received notification that all
started processes had terminated.  You should double check and ensure
that there are no runaway processes still executing.
--------------------------------------------------------------------------

At this point, the two spinning orted's are left running, and the only way to kill them is with -9.

Is anyone else seeing this problem?

Greg

Reply via email to