Hi Gilles,
Thanks for your reply!
> by "running on the head node", shall i understand you mean
> "running mpirun command *and* all mpi tasks on the head node" ?
Precisely.
> by "running on the compute node", shall i understand you mean
> "running mpirun on the compute node *and* all mpi tasks on the *same*
> compute node" ?
What I meant to say is "running mpirun on the head node and all mpi
tasks on a single worker node", i.e.
mpirun --hostfile node1 -np 48 --byslot ./mpitest.py
where node1 contains:
node1 slots=48 max_slots=48
I then ssh'd to node1 and ran htop to observe that processes are all
colored in red, i.e. kernel cycles. A curious observation is the load
average, which shows 700+, i.e. ridiculous given that there are only 48
processors.
> what i mean is that if your mpi program is very sensitive to
> latency/bandwidth, TCP over a gigabit ethernet is way slower that
> shared memory, and that can explain the 20x drop in performance.
Is there a good way to profile this? I installed an evaluation version
of vampirtrace, but I didn't yet get so far to produce a trace with
mpirun.
> ideally, i would recommend you first try using openmpi v1.8 or the
> latest v1.10.1rc2.
I tried 1.8.2 and needed to use --bind-to none to get the cores to even
run (with Jeff Squyres' help), most likely because of this L3 bios bug.
I'll try 1.10.1rc2 tomorrow and report back.
> btw, is your head node running on similar hardware than compute nodes
> (e.g. 4 sockets 48 cores) ?
Yes, all nodes are identical, 4 sockets, 48 cores.
> are you using a batch manager to run on compute nodes ?
> if yes, did you check all tasks are running on different cores ?
> (depending on your configuration and how you submit your job, all mpi
> tasks can endup sharing the same core ...)
I do have torque/maui installed and configured, but I disabled them for
these tests, and running mpirun directly.
> you reported a lot of time is spent in the kernel, is this time spent
> for MPI communication ?
> (it could be for NFS access since you have no local disk on the
> compute nodes)
I'd *love* to be able to answer that question, but I don't know how to
find that out. If you can point me in the right direction, I'd love to
learn how to do it.
> btw, how many MPI tasks are you running ?
I ran it on all 48 cores per node.
> if your application can run on a small number of tasks, here is what
> you can do from your head node :
>
> mpirun -np 1 a.out
> mpirun -np 1 -host <compute> a.out
> run a job with one task on the compute node
> if there a performance drop ? if yes, when ?
>
> try again with -np 2 (using one host)
> if there a performance drop ? if yes, when ?
>
> and then try with two nodes and one task per node
> mpirun -np 2 -host <compute1,compute2> a.out
It'll take a bit of code modification, but let me try it and report
back.
Thanks a lot,
Andrej