Re: [Meep-discuss] Assorted issues/questions with parallel c++ Meep

John Ball Fri, 14 Feb 2014 09:35:51 -0800

Thanks for the responses. I've hit a couple of snags that prevent me
from having all the information, but here is the TOP output for 1 of
the 4 nodes on which I just ran the simulation. (i.e., the job was run
on 4 nodes, using 3 of the cores on each node). The other 3 nodes
showed basically the same usage.


It appears that the total memory being used according to TOP is
approximately 75-80GB (this checks out with the system's utility that
reports memory usage for a job). I have yet to figure out how to run
FREE on a node where my job is currently running.


  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM   TIME  P COMMAND
14847 balljm    25   0 6430m 6.1g 6352 R 99.3 25.8  20:04 22 PARSIM
14848 balljm    25   0 8914m 8.5g 6356 R 99.3 36.1  20:03  9 PARSIM
14846 balljm    25   0 6717m 6.4g 6524 R 97.3 27.0  20:06 15 PARSIM
    1 root      15   0 10344  732  608 S  0.0  0.0   0:02  7 init

Unfortunately, I'm not able to run the job on a 72GB node right now
for comparison, because that queue is currently occupied by another
user. Previously, this same job took ~30 GB in serial mode.



>>Hello,
>>
>>Meep uses MPI-parallelization, i.e. assuming distributed memory for each
processor. Thus, when you increase the number of processors, each processor
gets its own chunk of the 3d simulation box. Of course, information must be
>>able to flow between these chunks, which means the chunks are a little
bit larger than just the subdivision of the space (they have a "halo") and
at each time step, information is exchanged between the neighboring chunks.

>>So the amount of memory increase with number of processors depends
largely on the ratio of "chunk volume" to "chunk surface" for your problem,
i.e. how many cells are in the simulation box and how many cells you need
per >>chunk. You should check by a back-of-envelope calculation if the
memory scaling to see is consistent with what you would expect.

>>The speedup with respect to the number of processors depends on the same
issues.

>>It is clear that parallelization always comes at a performance cost and
usually scales quite a bit worse than linear. Your jobs will take the least
computational time if you submit a many long-running single processor jobs
at the >>same time rather than multi-core jobs sequentially. If the
simulation just takes a few minutes or hours like you wrote, I would do
that; if they take much longer, they probably involve much more cells
(resolution) and will be more >>efficient to parallelize.

>>Best wishes,
>>  Georg


Unfortunately, the 10-30 minute simulations are only benchmarking
runs. The total run time is ~35 hours. Even a 3x speed-up is a welcome
improvement, but the memory increase introduces other challenges.

Here's my thinking: If I'm using N cores, and I normally use 100%
memory in serial mode, then the memory use I'd expect to see in
parallel would be 100% *(1 + N*halo_volume/chunk_volume). So the
memory "bloat" should be about 100%*N*halo_volume/chunk_volume.

Doing some rounding, my total simulation volume is about 500 x 1000 x
500 cells. A reasonable way to divide this with N=8 (don't know how
it's actually done) would be to make 8 x 500 x 250 x 250 chunks.

The volume of a chunk's halo should be ~ 2 x (500*250 + 500*250 +
250*250) * halo_thickness. This is ~625k cells for each "layer" of
cells in the halo.

A chunk's volume will be about 500 x 250 x 250 = 31250k cells. I don't
know how thick the halo really is, but with 8 cores, I should get
about a 16% increase for every layer of cells in the halo.

This would imply that the halo is about 6 cells thick. Of course there
is rounding error all over the place here, not to mention that I'm
ignoring the fact that the halo will be outside the normal boundaries
of a chunk, so will be slightly larger than what I've estimated here.

Re-running this math using the job I show above instead (N=12, 12
chunks of ~333x250x250 cells), I'd expect a 26% increase for every
layer of cells, and given the 77 GB figure, I again get a halo
thickness would be 6 cells. Consistent! That's nice.


Am I on the right track? Does this seem reasonable? I'd love to know
where I'm missing something.


Again, thanks to everyone for the help.



On Thu, Feb 13, 2014 at 3:42 AM, Tran Quyet Thang <
tranquyetthang3...@gmail.com> wrote:

> John Ball <ballman2010@...> writes:
>
> >
> >
> >
> >
> >
> >
> > Hello all,I'm trying to run my c++ Meep script in parallel. I've found
> little documentation on the subject, so I'm hoping to make a record of how
> to do it here on the mailing list as well as to clear up some of my own
> confusion and questions about the issue.
> >
> >
> > My original, bland, serial c++ compilation command comes straight from
> the
> Meep c++ tutorial page:
> >
> >
> >
> > g++ `pkg-config --cflags meep`  main.cpp -o SIM `pkg-config --libs meep`
> > where I've usedexport PKG_CONFIG_PATH =
> /usr/local/apps/meep/lib/pkgconfig
> >
> > so that pkg-config knows where in the world the meep.pc file is.
> >
> > then I can simply run the compiled code WITH:./SIM
> >
> > In parallel, the equivalent process I've settled upon using is as
> follows:
> >
> > First, I've changed the #include statement at the beginning of main.cpp
> to
> point to the header file from the parallel install (not sure if this is
> necessary, but it works):
> >
> > #include "/usr/local/apps/meep/1.2.1/mpi/lib/include/meep.hpp"
> >
> > To compile:
> >
> > mpic++ `pkg-config --cflags meep_mpi` par_main.cpp -o PAR_SIM `pkg-config
> --libs meep_mpi`
> >
> > where I've told pkg-config to instead look for meep_mpi.pc:
> >
> > export PKG_CONFIG_PATH=/usr/local/apps/meep/1.2.1/mpi/lib/pkgconfig
> >
> > to run this, I send this command to the job scheduler:
> >
> > (...)/mpirun    -np $N    ./PAR_SIM
> >
> > where I choose N depending on the kind of node(s) I'm submitting to.
> >
> > This runs fine. Now I'm going to talk about performance:When submitting a
> particular job to a single 16-core node with 72GB of memory, if I set N=1,
> the memory usage is 30 GB, and the simulation runs at about 8.7 sec/step.
> The job took about 35 minutes.When I instead set N=8, the memory usage is
> 62GB, and it runs at about 2.8 sec/step. The total simulation takes about
> 12
> minutes. So! Are these numbers to be expected? A ~3x speedup going from 1
> to
> 8 cores is less than I'd hoped for, but perhaps reasonable. What concerns
> me
> more, though, is that while I suspected that I'd see some memory usage
> increase, I did not expect to see a twofold increase when I went from 1 to
> 8
> cores. I want to verify that this behavior is normal and I'm not misusing
> the code or screwing up its compilation somehow.
> >
> > Finally, just asking for some advice: I could feasibly break the job up
> and instead of using a single 16 core, 72 GB node like I mention above, I
> could use, for example, 9 dual-core, 8GB nodes instead. My guess is that
> doing so would increase the overhead due to network communications between
> the nodes. However, what about memory usage? Does anyone have experience
> with this? Furthermore, are there any tips or best practices for
> conditioning the simulation and/or configuration to maximize throughput?
> >
> > Thanks in advance!
> >
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > meep-discuss mailing list
> > meep-discuss@...
> > http://ab-initio.mit.edu/cgi-bin/mailman/listinfo/meep-discuss
>
> It is well known that FDTD simulation is memory bandwidth bound - it scales
> well with increasing memory bandwidth in contrast with processing (FLOPS)
> power.
>
> As you try to increase the number of cores in an SMP configuration, the
> total amount of total available bandwidth of the system does not increase -
> the memory controller(s) is merely more effectively utilized (saturated).
> That is why a cluster with few(er) processing cores, but more memory
> controller, would provide better speedup (versus cost) in FDTD than a
> multiple core single CPU. In other words, multiple core CPUs are bad
> choices
> for FDTD calculations. In fact, my current FDTD server is a dual CPU, 4
> cores each, but with ~ 100GB/s memory bandwidth to feed the calculation.
>
> In my experience with mpi-meep, relatively little memory overhead was
> encountered, especially in larger simulations - perhaps you (double)
> counted
> the shared memory space. The unix free and top commands would provide a
> good
> estimation of real free memory. Could you post the result of free and top
> here?
>
>
>
> _______________________________________________
> meep-discuss mailing list
> meep-discuss@ab-initio.mit.edu
> http://ab-initio.mit.edu/cgi-bin/mailman/listinfo/meep-discuss

_______________________________________________
meep-discuss mailing list
meep-discuss@ab-initio.mit.edu
http://ab-initio.mit.edu/cgi-bin/mailman/listinfo/meep-discuss

Re: [Meep-discuss] Assorted issues/questions with parallel c++ Meep

Reply via email to