Andrej,
by "running on the head node", shall i understand you mean
"running mpirun command *and* all mpi tasks on the head node" ?
by "running on the compute node", shall i understand you mean
"running mpirun on the compute node *and* all mpi tasks on the *same*
compute node" ?
or do you mean several compute nodes could be used by the same mpi job ?
what i mean is that if your mpi program is very sensitive to
latency/bandwidth, TCP over a gigabit ethernet
is way slower that shared memory, and that can explain the 20x drop in
performance.
ideally, i would recommend you first try using openmpi v1.8 or the
latest v1.10.1rc2.
iirc, process binding is default from v1.8 but must by manually
requested in v1.6.
that could explain some drop in performance too.
btw, is your head node running on similar hardware than compute nodes
(e.g. 4 sockets 48 cores) ?
are you using a batch manager to run on compute nodes ?
if yes, did you check all tasks are running on different cores ?
(depending on your configuration and how you submit your job, all mpi
tasks can endup sharing the same core ...)
you reported a lot of time is spent in the kernel, is this time spent
for MPI communication ?
(it could be for NFS access since you have no local disk on the compute
nodes)
btw, how many MPI tasks are you running ?
if your application can run on a small number of tasks, here is what you
can do from your head node :
mpirun -np 1 a.out
mpirun -np 1 -host <compute> a.out
run a job with one task on the compute node
if there a performance drop ? if yes, when ?
try again with -np 2 (using one host)
if there a performance drop ? if yes, when ?
and then try with two nodes and one task per node
mpirun -np 2 -host <compute1,compute2> a.out
Cheers,
Gilles
On 10/21/2015 11:09 AM, Andrej Prsa wrote:
Hi everyone,
We have a small cluster of 6 identical 48-core nodes for astrophysical
research. We are struggling on getting openmpi to run efficiently on
the nodes. The head node is running ubuntu and openmpi-1.6.5 on a local
disk. All worker nodes are booting from NFS exported root that resides
on a NAS, also with ubuntu and openmpi 1.6.5. All nodes have Gbit
ethernets and the NAS is connected to the switch with 4 NICs. The
motherboard is Supermicro H8QG6, processors are 2.6GHz AMD Opterons
6344.
When we run openmpi on the head node, everything works as expected. But
when we run in on any of the worker nodes, the execution is ~20+ times
longer, and htop shows that all processes spend the vast majority of
their time on kernel cycles (red symbols).
I have been trying to learn about the profilers and MCA optimization
and such, but it seems to me that a 20-fold hit in performance
indicates a much more serious problem. For example, it might have to do
with a buggy BIOS that doesn't report L3 cache correctly, and that
throws hwloc warnings that I reported in the past. I flashed the BIOS
to the latest version, we are running the latest kernel, and I tried
newer, manually compiled hwloc/openmpi to no avail. I am at my wits'
end on what to try next, and I would thoroughly appreciate any help and
guidance. Our cluster is idling till I resolve this, and quite a few
people are tapping on my shoulder impatiently. And yes, I'm an
astronomer, not a sys admin, so please excuse my ignorance.
Thanks a bunch,
Andrej
_______________________________________________
devel mailing list
[email protected]
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2015/10/18222.php