Hi, Chris

Funny you should mention this now. We identified and diagnosed the issue some 
time ago as a combination of SLURM's PMI1 implementation and some of, what I'll 
call, OMPI's topology requirements (probably not the right word.) Here's what 
is happening, in a nutshell, when you launch with srun:

1. Each process pushes his endpoint data up to the PMI "cloud" via PMI put (I 
think it's about five or six puts, bottom line, O(1).)
2. Then executes a PMI commit and PMI barrier to ensure all other processes 
have finished committing their data to the "cloud".
3.  Subsequent to this, each process executes O(N) (N is the number of procs in 
the job) PMI gets in order to get all of the endpoint data for every process 
regardless of whether or not the process communicates with that endpoint. 

"We" (MLNX et al.) undertook an in-depth scaling study of this and identified 
several poorly scaling pieces with the worst offenders being:

1. PMI Barrier scales worse than linear.
2. At scale, the PMI get phase starts to look quadratic.   

The proposed solution that "we" (OMPI + SLURM) have come up with is to modify 
OMPI to support PMI2 and to use SLURM 2.6 which has support for PMI2 and is 
(allegedly) much more scalable than PMI1. Several folks in the combined 
communities are working hard, as we speak, trying to get this functional to see 
if it indeed makes a difference. Stay tuned, Chris. Hopefully we will have some 
data by the end of the week.  

Best regards,

Josh


Joshua S. Ladd, PhD
HPC Algorithms Engineer
Mellanox Technologies 

Email: josh...@mellanox.com
Cell: +1 (865) 258 - 8898





-----Original Message-----
From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf 
Of Christopher Samuel
Sent: Tuesday, July 23, 2013 3:06 AM
To: slurm-dev; Open MPI Developers
Subject: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed 
than with mpirun

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi there slurm-dev and OMPI devel lists,

Bringing up a new IBM SandyBridge cluster I'm running a NAMD test case and 
noticed that if I run it with srun rather than mpirun it goes over 20% slower.  
These are all launched from an sbatch script too.

Slurm 2.6.0, RHEL 6.4 (latest kernel), FDR IB.

Here are some timings as reported as the WallClock time by NAMD itself (so not 
including startup/tear down overhead from Slurm).

srun:

run1/slurm-93744.out:WallClock: 695.079773  CPUTime: 695.079773
run4/slurm-94011.out:WallClock: 723.907959  CPUTime: 723.907959
run5/slurm-94013.out:WallClock: 726.156799  CPUTime: 726.156799
run6/slurm-94017.out:WallClock: 724.828918  CPUTime: 724.828918

Average of 692 seconds

mpirun:

run2/slurm-93746.out:WallClock: 559.311035  CPUTime: 559.311035
run3/slurm-93910.out:WallClock: 544.116333  CPUTime: 544.116333
run7/slurm-94019.out:WallClock: 586.072693  CPUTime: 586.072693

Average of 563 seconds.

So that's about 23% slower.

Everything is identical (they're all symlinks to the same golden
master) *except* for the srun / mpirun which is modified by copying the batch 
script and substituting mpirun for srun.

When they are running I can see that for jobs launched with srun they are 
direct children of slurmstepd whereas when started with mpirun they are 
children of Open-MPI's orted (or mpirun on the launch node) which itself is a 
child of slurmstepd.

Has anyone else seen anything like this, or got any ideas?

cheers,
Chris
- -- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlHuKxoACgkQO2KABBYQAh8cYQCfT/YIFkyeDaNb/ksT2xk4W416
kycAoJfdZInLwy+nTIL7CzWapZZU20qm
=ZJ1B
-----END PGP SIGNATURE-----
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to