On Aug 30, 2006, at 7:19 AM, Jeff Squyres wrote:

On 8/29/06 8:57 PM, "Josh Hursey" <jjhur...@open-mpi.org> wrote:

Does this apply to *all* tests, or only some of the tests (like
allgather)?

All of the tests: Trivial and ibm. They all timeout :(

Blah. The trivial tests are simply "hello world", so they should take just
about no time at all.

Is this running under SLURM? I put the code in there to know how many procs to use in SLURM but have not tested it in eons. I doubt that's the problem,
but that's one thing to check.


Yep it is in SLURM. and it seems that the 'number of procs' code is working fine as it changes with different allocations.

Can you set a super-long timeout (e.g., a few minutes), and while one of the trivial tests is running, do some ps's on the relevant nodes and see what, if anything, is running? E.g., mpirun, the test executable on the nodes,
etc.

Without setting a long timeout. It seems that mpirun is running, but nothing else and only on the launching node.

When a test starts you see the mpirun launching properly:
$ ps aux | grep ...
USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
mpiteam 15117 0.5 0.8 113024 33680 ? S 09:32 0:06 perl ./ client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --file /u/ mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time mpiteam 15294 0.0 0.0 0 0 ? Z 09:32 0:00 [sh] <defunct> mpiteam 28453 0.2 0.0 38072 3536 ? S 09:50 0:00 mpirun -mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/mtt- scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/1.3a1r11497/ install collective/allgather_in_place mpiteam 28454 0.0 0.0 41716 2040 ? Sl 09:50 0:00 srun -- nodes=16 --ntasks=16 -- nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin016,odin015 ,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin007 orted --no-daemonize --bootproxy 1 --ns-nds slurm --name 0.0.1 -- num_procs 16 --vpid_start 0 --universe mpit...@odin007.cs.indiana.edu:default-universe-28453 --nsreplica "0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp:// 129.79.240.107:40904" mpiteam 28455 0.0 0.0 23212 1804 ? Ssl 09:50 0:00 srun -- nodes=16 --ntasks=16 -- nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin016,odin015 ,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin007 orted --no-daemonize --bootproxy 1 --ns-nds slurm --name 0.0.1 -- num_procs 16 --vpid_start 0 --universe mpit...@odin007.cs.indiana.edu:default-universe-28453 --nsreplica "0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp:// 129.79.240.107:40904" mpiteam 28472 0.0 0.0 36956 2256 ? S 09:50 0:00 /san/ homedirs/mpiteam/tmp/mtt-scratch/installs/ompi-nightly-trunk/ odin_gcc_warnings/1.3a1r11497/install/bin/orted --no-daemonize -- bootproxy 1 --ns-nds slurm --name 0.0.1 --num_procs 16 --vpid_start 0 --universe mpit...@odin007.cs.indiana.edu:default-universe-28453 -- nsreplica "0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp://129.79.240.107:40904" mpiteam 28482 0.1 0.0 64296 3564 ? S 09:50 0:00 collective/allgather_in_place mpiteam 28483 0.1 0.0 64296 3564 ? S 09:50 0:00 collective/allgather_in_place

But once the test finishes, mpirun seems to just be hanging out.
$ ps aux | grep ...
USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
mpiteam 15083 0.0 0.0 52760 1040 ? S 09:31 0:00 /bin/ bash /var/tmp/slurmd/job148126/script root 15086 0.0 0.0 42884 3172 ? Ss 09:31 0:00 sshd: mpiteam [priv] mpiteam 15088 0.0 0.0 43012 3252 ? S 09:31 0:00 sshd: mpiteam@pts/1
mpiteam  15089  0.0  0.0 56680 1912 pts/1    Ss   09:31   0:00 -tcsh
mpiteam 15117 0.5 0.8 113024 33680 ? S 09:32 0:06 perl ./ client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --file /u/ mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time mpiteam 15294 0.0 0.0 0 0 ? Z 09:32 0:00 [sh] <defunct> mpiteam 28453 0.0 0.0 38204 3568 ? S 09:50 0:00 mpirun -mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/mtt- scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/1.3a1r11497/ install collective/allgather_in_place

Thoughts?


--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems

Reply via email to