forgot this bit in my mail. With the mpirun just hanging out there I attached GDB and got the following stack trace:
(gdb) bt
#0  0x0000003d1b9bd1af in poll () from /lib64/tls/libc.so.6
#1 0x0000002a956e6389 in opal_poll_dispatch (base=0x5136d0, arg=0x513730, tv=0x7fbfffee70) at poll.c:191 #2 0x0000002a956e28b6 in opal_event_base_loop (base=0x5136d0, flags=5) at event.c:584
#3  0x0000002a956e26b7 in opal_event_loop (flags=5) at event.c:514
#4 0x0000002a956db7c7 in opal_progress () at runtime/opal_progress.c: 259 #5 0x000000000040334c in opal_condition_wait (c=0x509650, m=0x509600) at ../../../opal/threads/condition.h:81 #6 0x0000000000402f52 in orterun (argc=9, argv=0x7fbffff0b8) at orterun.c:444
#7  0x00000000004028a3 in main (argc=9, argv=0x7fbffff0b8) at main.c:13

Seems that mpirun is waiting for things to complete :/

On Aug 30, 2006, at 9:53 AM, Josh Hursey wrote:


On Aug 30, 2006, at 7:19 AM, Jeff Squyres wrote:

On 8/29/06 8:57 PM, "Josh Hursey" <jjhur...@open-mpi.org> wrote:

Does this apply to *all* tests, or only some of the tests (like
allgather)?

All of the tests: Trivial and ibm. They all timeout :(

Blah.  The trivial tests are simply "hello world", so they should
take just
about no time at all.

Is this running under SLURM?  I put the code in there to know how
many procs
to use in SLURM but have not tested it in eons.  I doubt that's the
problem,
but that's one thing to check.


Yep it is in SLURM. and it seems that the 'number of procs' code is
working fine as it changes with different allocations.

Can you set a super-long timeout (e.g., a few minutes), and while
one of the
trivial tests is running, do some ps's on the relevant nodes and
see what,
if anything, is running?  E.g., mpirun, the test executable on the
nodes,
etc.

Without setting a long timeout. It seems that mpirun is running, but
nothing else and only on the launching node.

When a test starts you see the mpirun launching properly:
$ ps aux | grep ...
USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
mpiteam  15117  0.5  0.8 113024 33680 ?      S    09:32   0:06 perl ./
client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --file /u/
mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time
mpiteam  15294  0.0  0.0     0    0 ?        Z    09:32   0:00 [sh]
<defunct>
mpiteam  28453  0.2  0.0 38072 3536 ?        S    09:50   0:00 mpirun
-mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/mtt-
scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/1.3a1r11497/
install collective/allgather_in_place
mpiteam  28454  0.0  0.0 41716 2040 ?        Sl   09:50   0:00 srun --
nodes=16 --ntasks=16 --
nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin016,odin0 15
,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin007
orted --no-daemonize --bootproxy 1 --ns-nds slurm --name 0.0.1 --
num_procs 16 --vpid_start 0 --universe
mpit...@odin007.cs.indiana.edu:default-universe-28453 --nsreplica
"0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp://
129.79.240.107:40904"
mpiteam  28455  0.0  0.0 23212 1804 ?        Ssl  09:50   0:00 srun --
nodes=16 --ntasks=16 --
nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin016,odin0 15
,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin007
orted --no-daemonize --bootproxy 1 --ns-nds slurm --name 0.0.1 --
num_procs 16 --vpid_start 0 --universe
mpit...@odin007.cs.indiana.edu:default-universe-28453 --nsreplica
"0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp://
129.79.240.107:40904"
mpiteam  28472  0.0  0.0 36956 2256 ?        S    09:50   0:00 /san/
homedirs/mpiteam/tmp/mtt-scratch/installs/ompi-nightly-trunk/
odin_gcc_warnings/1.3a1r11497/install/bin/orted --no-daemonize --
bootproxy 1 --ns-nds slurm --name 0.0.1 --num_procs 16 --vpid_start 0
--universe mpit...@odin007.cs.indiana.edu:default-universe-28453 --
nsreplica "0.0.0;tcp://129.79.240.107:40904" --gprreplica
"0.0.0;tcp://129.79.240.107:40904"
mpiteam  28482  0.1  0.0 64296 3564 ?        S    09:50   0:00
collective/allgather_in_place
mpiteam  28483  0.1  0.0 64296 3564 ?        S    09:50   0:00
collective/allgather_in_place

But once the test finishes, mpirun seems to just be hanging out.
$ ps aux | grep ...
USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
mpiteam  15083  0.0  0.0 52760 1040 ?        S    09:31   0:00 /bin/
bash /var/tmp/slurmd/job148126/script
root     15086  0.0  0.0 42884 3172 ?        Ss   09:31   0:00 sshd:
mpiteam [priv]
mpiteam  15088  0.0  0.0 43012 3252 ?        S    09:31   0:00 sshd:
mpiteam@pts/1
mpiteam  15089  0.0  0.0 56680 1912 pts/1    Ss   09:31   0:00 -tcsh
mpiteam  15117  0.5  0.8 113024 33680 ?      S    09:32   0:06 perl ./
client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --file /u/
mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time
mpiteam  15294  0.0  0.0     0    0 ?        Z    09:32   0:00 [sh]
<defunct>
mpiteam  28453  0.0  0.0 38204 3568 ?        S    09:50   0:00 mpirun
-mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/mtt-
scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/1.3a1r11497/
install collective/allgather_in_place

Thoughts?


--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems

_______________________________________________
mtt-users mailing list
mtt-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users

Reply via email to