I already tried that. However I'm trying it in a couple different ways and getting some mixed results. Let me formulate the error cases and get back to you.

Cheers,
Josh

On Aug 30, 2006, at 10:17 AM, Ralph H Castain wrote:

Well, why don't you try first separating this from MTT? Just run the command manually in batch mode and see if it works. If that works, then the problem
is with MTT. Otherwise, we have a problem with notification.

Or are you saying that you have already done this?
Ralph


On 8/30/06 8:03 AM, "Josh Hursey" <jjhur...@open-mpi.org> wrote:

yet another point (sorry for the spam). This may not be an MTT issue
but a broken ORTE on the trunk :(

When I try to run in a allocation (srun -N 16 -A) things run fine.
But if I try to run in batch mode (srun -N 16 -b myscript.sh) then I
see the same hang as in MTT. seems that mpirun is not getting
properly notified of the completion of the job. :(

I'll try to investigate a bit further today. Any thoughts on what
might be causing this?

Cheers,
Josh

On Aug 30, 2006, at 9:54 AM, Josh Hursey wrote:

forgot this bit in my mail. With the mpirun just hanging out there I
attached GDB and got the following stack trace:
(gdb) bt
#0  0x0000003d1b9bd1af in poll () from /lib64/tls/libc.so.6
#1  0x0000002a956e6389 in opal_poll_dispatch (base=0x5136d0,
arg=0x513730, tv=0x7fbfffee70) at poll.c:191
#2  0x0000002a956e28b6 in opal_event_base_loop (base=0x5136d0,
flags=5) at event.c:584
#3  0x0000002a956e26b7 in opal_event_loop (flags=5) at event.c:514
#4 0x0000002a956db7c7 in opal_progress () at runtime/ opal_progress.c:
259
#5  0x000000000040334c in opal_condition_wait (c=0x509650,
m=0x509600) at ../../../opal/threads/condition.h:81
#6  0x0000000000402f52 in orterun (argc=9, argv=0x7fbffff0b8) at
orterun.c:444
#7  0x00000000004028a3 in main (argc=9, argv=0x7fbffff0b8) at
main.c:13

Seems that mpirun is waiting for things to complete :/

On Aug 30, 2006, at 9:53 AM, Josh Hursey wrote:


On Aug 30, 2006, at 7:19 AM, Jeff Squyres wrote:

On 8/29/06 8:57 PM, "Josh Hursey" <jjhur...@open-mpi.org> wrote:

Does this apply to *all* tests, or only some of the tests (like
allgather)?

All of the tests: Trivial and ibm. They all timeout :(

Blah.  The trivial tests are simply "hello world", so they should
take just
about no time at all.

Is this running under SLURM?  I put the code in there to know how
many procs
to use in SLURM but have not tested it in eons. I doubt that's the
problem,
but that's one thing to check.


Yep it is in SLURM. and it seems that the 'number of procs' code is
working fine as it changes with different allocations.

Can you set a super-long timeout (e.g., a few minutes), and while
one of the
trivial tests is running, do some ps's on the relevant nodes and
see what,
if anything, is running?  E.g., mpirun, the test executable on the
nodes,
etc.

Without setting a long timeout. It seems that mpirun is running, but
nothing else and only on the launching node.

When a test starts you see the mpirun launching properly:
$ ps aux | grep ...
USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME
COMMAND
mpiteam  15117  0.5  0.8 113024 33680 ?      S    09:32   0:06
perl ./
client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --file /u/
mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time
mpiteam  15294  0.0  0.0     0    0 ?        Z    09:32   0:00 [sh]
<defunct>
mpiteam 28453 0.2 0.0 38072 3536 ? S 09:50 0:00 mpirun
-mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/mtt-
scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/1.3a1r11497/
install collective/allgather_in_place
mpiteam  28454  0.0  0.0 41716 2040 ?        Sl   09:50   0:00
srun --
nodes=16 --ntasks=16 --
nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin016,od in
0
15
,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin007
orted --no-daemonize --bootproxy 1 --ns-nds slurm --name 0.0.1 --
num_procs 16 --vpid_start 0 --universe
mpit...@odin007.cs.indiana.edu:default-universe-28453 --nsreplica
"0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp://
129.79.240.107:40904"
mpiteam  28455  0.0  0.0 23212 1804 ?        Ssl  09:50   0:00
srun --
nodes=16 --ntasks=16 --
nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin016,od in
0
15
,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin007
orted --no-daemonize --bootproxy 1 --ns-nds slurm --name 0.0.1 --
num_procs 16 --vpid_start 0 --universe
mpit...@odin007.cs.indiana.edu:default-universe-28453 --nsreplica
"0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp://
129.79.240.107:40904"
mpiteam 28472 0.0 0.0 36956 2256 ? S 09:50 0:00 / san/
homedirs/mpiteam/tmp/mtt-scratch/installs/ompi-nightly-trunk/
odin_gcc_warnings/1.3a1r11497/install/bin/orted --no-daemonize --
bootproxy 1 --ns-nds slurm --name 0.0.1 --num_procs 16 -- vpid_start 0
--universe mpit...@odin007.cs.indiana.edu:default-universe-28453 --
nsreplica "0.0.0;tcp://129.79.240.107:40904" --gprreplica
"0.0.0;tcp://129.79.240.107:40904"
mpiteam  28482  0.1  0.0 64296 3564 ?        S    09:50   0:00
collective/allgather_in_place
mpiteam  28483  0.1  0.0 64296 3564 ?        S    09:50   0:00
collective/allgather_in_place

But once the test finishes, mpirun seems to just be hanging out.
$ ps aux | grep ...
USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME
COMMAND
mpiteam 15083 0.0 0.0 52760 1040 ? S 09:31 0:00 / bin/
bash /var/tmp/slurmd/job148126/script
root 15086 0.0 0.0 42884 3172 ? Ss 09:31 0:00 sshd:
mpiteam [priv]
mpiteam 15088 0.0 0.0 43012 3252 ? S 09:31 0:00 sshd:
mpiteam@pts/1
mpiteam 15089 0.0 0.0 56680 1912 pts/1 Ss 09:31 0:00 - tcsh
mpiteam  15117  0.5  0.8 113024 33680 ?      S    09:32   0:06
perl ./
client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --file /u/
mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time
mpiteam  15294  0.0  0.0     0    0 ?        Z    09:32   0:00 [sh]
<defunct>
mpiteam 28453 0.0 0.0 38204 3568 ? S 09:50 0:00 mpirun
-mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/mtt-
scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/1.3a1r11497/
install collective/allgather_in_place

Thoughts?


--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems

_______________________________________________
mtt-users mailing list
mtt-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users

_______________________________________________
mtt-users mailing list
mtt-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users



Reply via email to