Re: [Meep-discuss] Assorted issues/questions with parallel c++ Meep

John Ball Thu, 06 Mar 2014 08:54:39 -0800

Hello again,

I'm posting to update anyone who might interested in the future about what
I did & found out. In general, the memory usage was my own fault, having
not properly taken into account memory issues associated with files I was
loading prior to execution. However, I still ran into some strangeness with
memory bloat, which for now I'm willing to attribute to memory bandwidth
issues...though I can't really say for sure. Details are below. Thanks
again for everyone's help!


John



---------(quoting an earlier response from Georg Wachter)-------------


Hello,

Sounds reasonable!

My gut feeling is that I would not parallelize a 35h job further for a
factor 3 wallclock time decrease.

For checking the size of the halo and the partitioning of the simulation
space, the best reference is likely the source code itself.

Best,
  Georg
-------------------------------------------------------------------------------------------

Trawling the source code leads me to believe that "thickness" of the halo
is somewhat of a misnomer.  If I understand correctly, the boundary class &
related step method handle communication at the interfaces between field
chunks (from structure chunks) so that the math becomes the same as if
there were no split. So, pairs of voxels that span the divide are stored in
these boundary objects, and no other voxels are explicitly considered. So
as far as the halo is concerned, I'm not certain how much memory it would
take compared to an added layer of voxels on each chunk.

However, this is all somewhat of a moot point, due in part to the next
section:



------------(quoting an earlier response from Tran Quyet Thang)---------

This is my trial run for a comparable grid size (500x1000x500)
This is the source code for test2.cpp:

#include <meep.hpp>
using namespace meep;

#ifdef _OPENMP
extern int omp_get_num_threads();
#endif

double eps(const vec &p) {
  if (p.x() < 2 && p.y() < 3 && p.z() < 2)
    return 12.0;
  return 1.0;
}

int main(int argc, char **argv) {
    //eps = 2.0f

    initialize mpi(argc, argv); // do this even for non-MPI Meep
    double resolution = 100; // pixels per distance
    grid_volume v = vol3d(5,10,5, resolution); // 500x1000x500 3d cell
    structure s(v, eps, pml(1.0));
    fields f(&s);
    //f.output_hdf5(Dielectric, v.surroundings());

    double freq = 0.3, fwidth = 0.1;
    gaussian_src_time src(freq, fwidth);
    f.add_point_source(Ez, 0.8, 0.6, 0.0, 4.0, vec(0.751,0.5,0.601), 1.0);
    while (f.time() < 1.1) {
        f.step();
    }

    //f.output_hdf5(Hz, v.surroundings());

    return 0;
}



np=1 run (single core)
# mpirun -np 1 ./test2.dac

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
21124 tranthan  20   0 57.9g  57g 4208 R 100.0 45.8   5:31.83 test2.dac

on time step 1 (time=0.005), 247.346 s/step
on time step 2 (time=0.01), 19.0855 s/step
on time step 3 (time=0.015), 18.9015 s/step
on time step 4 (time=0.02), 18.8894 s/step
on time step 5 (time=0.025), 18.8865 s/step
on time step 6 (time=0.03), 18.8948 s/step
on time step 7 (time=0.035), 18.8861 s/step
on time step 8 (time=0.04), 18.8895 s/step

np=8 run
# mpirun -np 8 ./test2.dac

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
20948 tranthan  20   0 7940m 7.6g  12m R 100.0  6.1   3:54.26 test2.dac
20947 tranthan  20   0 7828m 7.5g  10m R 100.0  6.0   3:57.76 test2.dac
20950 tranthan  20   0 7323m 7.0g  12m R  97.8  5.6   3:58.27 test2.dac
20953 tranthan  20   0 7323m 7.0g  12m R 100.0  5.6   3:57.90 test2.dac
20951 tranthan  20   0 7311m 7.0g  12m R 100.0  5.6   3:57.72 test2.dac
20954 tranthan  20   0 7266m 7.0g  10m R 100.0  5.5   3:58.39 test2.dac
20952 tranthan  20   0 7256m 7.0g  10m R 100.0  5.5   3:58.38 test2.dac
20949 tranthan  20   0 7255m 7.0g  10m R 100.0  5.5   3:58.19
test2.dac

on time step 1 (time=0.005), 148.926 s/step
on time step 3 (time=0.015), 2.68053 s/step
on time step 5 (time=0.025), 2.67348 s/step
on time step 7 (time=0.035), 2.67746 s/step
on time step 9 (time=0.045), 2.67517 s/step


Total memory of the mpi version
6*7.0+7.5+7.6=57.1  => ~ 0.1GB overhead.

Speedup = 18.88/2.67 ~= 7.1

Server configuration: Two Xeon(R) CPU E5-2637 v2, 128GB DDR3 1866MHz,
theoretical bandwidth~ 100GB/s

For such large problem, in my experience, the speedup is quite nice, and
overhead is negligible.
Can you try to run my code and see if there is any differences?

PS: note that I used Intel compiler and MPI libraries for meep, and with
lowest optimization level (-O)

----------------------------------------------------------

Thank you, this was very helpful. I don't know why I hadn't considered
stripping the simulation down to its barest elements like this...I think I
was still focused too much with the mpi implementation working correctly
rather than with what the code itself was doing.

Without going into too much detail, I'm loading a large (> 1GB) file whose
contents are used by the eps() function in the structure declaration. I
create a 3D array from that data so it can be more easily used. You can
probably see where this is going. Of course, in mpi mode, each core loads
its own version of the file, AND each core stores a copy of this 3D array.
The resulting memory usage borders on the absurd. Unfortunately, each core
does need to access this data, and yes, there are probably better ways to
load and index the data, but for now, I've settled upon freeing all of this
memory after the structure is initialized. Doing this keeps running memory
usage between 18-22GB regardless of the number of cores used*** (for anyone
wondering about the lower memory use than that predicted above, I made a
mistake, the actual grid size is about 400 x 650 x 350).


*** The actual memory usage is not always ~20GB regardless of the number of
cores used. I am content with the performance gain and memory usage that
I'm seeing now, but I've noticed a curious phenomenon that I can't account
for. The system I'm access has a pretty diverse set of hardware. Some
really fast/many core/large memory nodes, some smaller/slower, a handful of
medium-speed nodes with massive physical memory. The following is a summary
of what I found using one particular hardware configuration (I've long
since deleted the stdout files, so the numbers are rough estimates):

----------
Test 1:
4 cores, 2x nodes with 24 cores and 24 GB each, for a total of 4/48 cores
active, 48GB memory available. (core usage is evenly split between two
nodes, for 2 active on each).
Result:
Speed is ~2.0 sec/step, memory use is ~20GB.


Test 2:
8 cores, 2x nodes with 24 cores and 24 GB each, for a total of 8/48 cores
active, 48GB memory available.
Result:
Speed is ~1.7 sec/step, memory use is ~20GB.


Test 3:
12 cores, 2x nodes with 24 cores and 24 GB each, for a total of 12/48 cores
active, 48GB memory available.
Result:
Speed is ~3-4 sec/step, memory use is ~30GB.


Test 4:
16 cores, 2x nodes with 24 cores and 24 GB each, for a total of 16/48 cores
active, 48GB memory available.
Result:
Speed is ~5-6 sec/step, memory use is ~45GB.
----------

Running similar tests on different sets of hardware, I get qualitatively
similar results. The number of requested cores that produces this bloating
effect is different in each case, but the same trend of memory use occurs,
where it is mostly the same until some threshold number of cores, after
which point it increases drastically.

Like I said before, this is not necessarily a problem I need to solve, as
I'm satisfied with the performance I see with configurations such as Test
2. My first thought is that it's a memory bandwidth issue as mentioned by
Tran Quyet Thang, but to me, that explains only the performance drain, but
not the memory bloat. Full disclosure: I don't know what OpenMPI is doing.
If the available bandwidth passes a critical point, does it cease to
properly exchange data from physical memory to swap space, or something
like that? It really is a mystery to me. Again, any insight is appreciated,
but not necessarily critical. Thanks again, and happy FDTD simulations to
all.

Best,
John

_______________________________________________
meep-discuss mailing list
meep-discuss@ab-initio.mit.edu
http://ab-initio.mit.edu/cgi-bin/mailman/listinfo/meep-discuss

Re: [Meep-discuss] Assorted issues/questions with parallel c++ Meep

Reply via email to