Hi Gilles,

Thanks for your reply!

> by "running on the head node", shall i understand you mean
> "running mpirun command *and* all mpi tasks on the head node" ?

Precisely.

> by "running on the compute node", shall i understand you mean
> "running mpirun on the compute node *and* all mpi tasks on the *same* 
> compute node" ?

What I meant to say is "running mpirun on the head node and all mpi
tasks on a single worker node", i.e.

        mpirun --hostfile node1 -np 48 --byslot ./mpitest.py

where node1 contains:

        node1 slots=48 max_slots=48

I then ssh'd to node1 and ran htop to observe that processes are all
colored in red, i.e. kernel cycles. A curious observation is the load
average, which shows 700+, i.e. ridiculous given that there are only 48
processors.

> what i mean is that if your mpi program is very sensitive to 
> latency/bandwidth, TCP over a gigabit ethernet is way slower that
> shared memory, and that can explain the 20x drop in performance.

Is there a good way to profile this? I installed an evaluation version
of vampirtrace, but I didn't yet get so far to produce a trace with
mpirun.

> ideally, i would recommend you first try using openmpi v1.8 or the 
> latest v1.10.1rc2.

I tried 1.8.2 and needed to use --bind-to none to get the cores to even
run (with Jeff Squyres' help), most likely because of this L3 bios bug.
I'll try 1.10.1rc2 tomorrow and report back.

> btw, is your head node running on similar hardware than compute nodes 
> (e.g. 4 sockets 48 cores) ?

Yes, all nodes are identical, 4 sockets, 48 cores.

> are you using a batch manager to run on compute nodes ?
> if yes, did you check all tasks are running on different cores ?
> (depending on your configuration and how you submit your job, all mpi 
> tasks can endup sharing the same core ...)

I do have torque/maui installed and configured, but I disabled them for
these tests, and running mpirun directly.

> you reported a lot of time is spent in the kernel, is this time spent 
> for MPI communication ?
> (it could be for NFS access since you have no local disk on the
> compute nodes)

I'd *love* to be able to answer that question, but I don't know how to
find that out. If you can point me in the right direction, I'd love to
learn how to do it.

> btw, how many MPI tasks are you running ?

I ran it on all 48 cores per node.

> if your application can run on a small number of tasks, here is what
> you can do from your head node :
> 
> mpirun -np 1 a.out
> mpirun -np 1 -host <compute> a.out
> run a job with one task on the compute node
> if there a performance drop ? if yes,  when ?
> 
> try again with -np 2 (using one host)
> if there a performance drop ? if yes,  when ?
> 
> and then try with two nodes and one task per node
> mpirun -np 2 -host <compute1,compute2> a.out

It'll take a bit of code modification, but let me try it and report
back.

Thanks a lot,
Andrej

Reply via email to