Hello all,

I'm trying to run my c++ Meep script in parallel. I've found little
documentation on the subject, so I'm hoping to make a record of how to do
it here on the mailing list as well as to clear up some of my own confusion
and questions about the issue.


My original, bland, serial c++ compilation command comes straight from the
Meep c++ tutorial page:

g++ `pkg-config --cflags meep`  main.cpp -o SIM `pkg-config --libs meep`
where I've used
export PKG_CONFIG_PATH = /usr/local/apps/meep/lib/pkgconfig
so that pkg-config knows where in the world the meep.pc file is.

then I can simply run the compiled code WITH:
./SIM



In parallel, the equivalent process I've settled upon using is as follows:

First, I've changed the #include statement at the beginning of main.cpp to
point to the header file from the parallel install (not sure if this is
necessary, but it works):
#include "/usr/local/apps/meep/1.2.1/mpi/lib/include/meep.hpp"

To compile:
mpic++ `pkg-config --cflags meep_mpi` par_main.cpp -o PAR_SIM `pkg-config
--libs meep_mpi`
where I've told pkg-config to instead look for meep_mpi.pc:
export PKG_CONFIG_PATH=/usr/local/apps/meep/1.2.1/mpi/lib/pkgconfig

to run this, I send this command to the job scheduler:
(...)/mpirun    -np $N    ./PAR_SIM

where I choose N depending on the kind of node(s) I'm submitting to.



This runs fine. Now I'm going to talk about performance:

When submitting a particular job to a single 16-core node with 72GB of
memory, if I set N=1, the memory usage is 30 GB, and the simulation runs at
about 8.7 sec/step. The job took about 35 minutes.

When I instead set N=8, the memory usage is 62GB, and it runs at about 2.8
sec/step. The total simulation takes about 12 minutes.

So! Are these numbers to be expected? A ~3x speedup going from 1 to 8 cores
is less than I'd hoped for, but perhaps reasonable. What concerns me more,
though, is that while I suspected that I'd see some memory usage increase,
I did not expect to see a twofold increase when I went from 1 to 8 cores. I
want to verify that this behavior is normal and I'm not misusing the code
or screwing up its compilation somehow.


Finally, just asking for some advice: I could feasibly break the job up and
instead of using a single 16 core, 72 GB node like I mention above, I could
use, for example, 9 dual-core, 8GB nodes instead. My guess is that doing so
would increase the overhead due to network communications between the
nodes. However, what about memory usage? Does anyone have experience with
this? Furthermore, are there any tips or best practices for conditioning
the simulation and/or configuration to maximize throughput?

Thanks in advance!
_______________________________________________
meep-discuss mailing list
meep-discuss@ab-initio.mit.edu
http://ab-initio.mit.edu/cgi-bin/mailman/listinfo/meep-discuss

Reply via email to