Thanks for the responses. I've hit a couple of snags that prevent me from having all the information, but here is the TOP output for 1 of the 4 nodes on which I just ran the simulation. (i.e., the job was run on 4 nodes, using 3 of the cores on each node). The other 3 nodes showed basically the same usage.
It appears that the total memory being used according to TOP is approximately 75-80GB (this checks out with the system's utility that reports memory usage for a job). I have yet to figure out how to run FREE on a node where my job is currently running. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME P COMMAND 14847 balljm 25 0 6430m 6.1g 6352 R 99.3 25.8 20:04 22 PARSIM 14848 balljm 25 0 8914m 8.5g 6356 R 99.3 36.1 20:03 9 PARSIM 14846 balljm 25 0 6717m 6.4g 6524 R 97.3 27.0 20:06 15 PARSIM 1 root 15 0 10344 732 608 S 0.0 0.0 0:02 7 init Unfortunately, I'm not able to run the job on a 72GB node right now for comparison, because that queue is currently occupied by another user. Previously, this same job took ~30 GB in serial mode. >>Hello, >> >>Meep uses MPI-parallelization, i.e. assuming distributed memory for each processor. Thus, when you increase the number of processors, each processor gets its own chunk of the 3d simulation box. Of course, information must be >>able to flow between these chunks, which means the chunks are a little bit larger than just the subdivision of the space (they have a "halo") and at each time step, information is exchanged between the neighboring chunks. >>So the amount of memory increase with number of processors depends largely on the ratio of "chunk volume" to "chunk surface" for your problem, i.e. how many cells are in the simulation box and how many cells you need per >>chunk. You should check by a back-of-envelope calculation if the memory scaling to see is consistent with what you would expect. >>The speedup with respect to the number of processors depends on the same issues. >>It is clear that parallelization always comes at a performance cost and usually scales quite a bit worse than linear. Your jobs will take the least computational time if you submit a many long-running single processor jobs at the >>same time rather than multi-core jobs sequentially. If the simulation just takes a few minutes or hours like you wrote, I would do that; if they take much longer, they probably involve much more cells (resolution) and will be more >>efficient to parallelize. >>Best wishes, >> Georg Unfortunately, the 10-30 minute simulations are only benchmarking runs. The total run time is ~35 hours. Even a 3x speed-up is a welcome improvement, but the memory increase introduces other challenges. Here's my thinking: If I'm using N cores, and I normally use 100% memory in serial mode, then the memory use I'd expect to see in parallel would be 100% *(1 + N*halo_volume/chunk_volume). So the memory "bloat" should be about 100%*N*halo_volume/chunk_volume. Doing some rounding, my total simulation volume is about 500 x 1000 x 500 cells. A reasonable way to divide this with N=8 (don't know how it's actually done) would be to make 8 x 500 x 250 x 250 chunks. The volume of a chunk's halo should be ~ 2 x (500*250 + 500*250 + 250*250) * halo_thickness. This is ~625k cells for each "layer" of cells in the halo. A chunk's volume will be about 500 x 250 x 250 = 31250k cells. I don't know how thick the halo really is, but with 8 cores, I should get about a 16% increase for every layer of cells in the halo. This would imply that the halo is about 6 cells thick. Of course there is rounding error all over the place here, not to mention that I'm ignoring the fact that the halo will be outside the normal boundaries of a chunk, so will be slightly larger than what I've estimated here. Re-running this math using the job I show above instead (N=12, 12 chunks of ~333x250x250 cells), I'd expect a 26% increase for every layer of cells, and given the 77 GB figure, I again get a halo thickness would be 6 cells. Consistent! That's nice. Am I on the right track? Does this seem reasonable? I'd love to know where I'm missing something. Again, thanks to everyone for the help. On Thu, Feb 13, 2014 at 3:42 AM, Tran Quyet Thang < tranquyetthang3...@gmail.com> wrote: > John Ball <ballman2010@...> writes: > > > > > > > > > > > > > > > Hello all,I'm trying to run my c++ Meep script in parallel. I've found > little documentation on the subject, so I'm hoping to make a record of how > to do it here on the mailing list as well as to clear up some of my own > confusion and questions about the issue. > > > > > > My original, bland, serial c++ compilation command comes straight from > the > Meep c++ tutorial page: > > > > > > > > g++ `pkg-config --cflags meep` main.cpp -o SIM `pkg-config --libs meep` > > where I've usedexport PKG_CONFIG_PATH = > /usr/local/apps/meep/lib/pkgconfig > > > > so that pkg-config knows where in the world the meep.pc file is. > > > > then I can simply run the compiled code WITH:./SIM > > > > In parallel, the equivalent process I've settled upon using is as > follows: > > > > First, I've changed the #include statement at the beginning of main.cpp > to > point to the header file from the parallel install (not sure if this is > necessary, but it works): > > > > #include "/usr/local/apps/meep/1.2.1/mpi/lib/include/meep.hpp" > > > > To compile: > > > > mpic++ `pkg-config --cflags meep_mpi` par_main.cpp -o PAR_SIM `pkg-config > --libs meep_mpi` > > > > where I've told pkg-config to instead look for meep_mpi.pc: > > > > export PKG_CONFIG_PATH=/usr/local/apps/meep/1.2.1/mpi/lib/pkgconfig > > > > to run this, I send this command to the job scheduler: > > > > (...)/mpirun -np $N ./PAR_SIM > > > > where I choose N depending on the kind of node(s) I'm submitting to. > > > > This runs fine. Now I'm going to talk about performance:When submitting a > particular job to a single 16-core node with 72GB of memory, if I set N=1, > the memory usage is 30 GB, and the simulation runs at about 8.7 sec/step. > The job took about 35 minutes.When I instead set N=8, the memory usage is > 62GB, and it runs at about 2.8 sec/step. The total simulation takes about > 12 > minutes. So! Are these numbers to be expected? A ~3x speedup going from 1 > to > 8 cores is less than I'd hoped for, but perhaps reasonable. What concerns > me > more, though, is that while I suspected that I'd see some memory usage > increase, I did not expect to see a twofold increase when I went from 1 to > 8 > cores. I want to verify that this behavior is normal and I'm not misusing > the code or screwing up its compilation somehow. > > > > Finally, just asking for some advice: I could feasibly break the job up > and instead of using a single 16 core, 72 GB node like I mention above, I > could use, for example, 9 dual-core, 8GB nodes instead. My guess is that > doing so would increase the overhead due to network communications between > the nodes. However, what about memory usage? Does anyone have experience > with this? Furthermore, are there any tips or best practices for > conditioning the simulation and/or configuration to maximize throughput? > > > > Thanks in advance! > > > > > > > > > > > > > > > > _______________________________________________ > > meep-discuss mailing list > > meep-discuss@... > > http://ab-initio.mit.edu/cgi-bin/mailman/listinfo/meep-discuss > > It is well known that FDTD simulation is memory bandwidth bound - it scales > well with increasing memory bandwidth in contrast with processing (FLOPS) > power. > > As you try to increase the number of cores in an SMP configuration, the > total amount of total available bandwidth of the system does not increase - > the memory controller(s) is merely more effectively utilized (saturated). > That is why a cluster with few(er) processing cores, but more memory > controller, would provide better speedup (versus cost) in FDTD than a > multiple core single CPU. In other words, multiple core CPUs are bad > choices > for FDTD calculations. In fact, my current FDTD server is a dual CPU, 4 > cores each, but with ~ 100GB/s memory bandwidth to feed the calculation. > > In my experience with mpi-meep, relatively little memory overhead was > encountered, especially in larger simulations - perhaps you (double) > counted > the shared memory space. The unix free and top commands would provide a > good > estimation of real free memory. Could you post the result of free and top > here? > > > > _______________________________________________ > meep-discuss mailing list > meep-discuss@ab-initio.mit.edu > http://ab-initio.mit.edu/cgi-bin/mailman/listinfo/meep-discuss
_______________________________________________ meep-discuss mailing list meep-discuss@ab-initio.mit.edu http://ab-initio.mit.edu/cgi-bin/mailman/listinfo/meep-discuss