Hello again, I'm posting to update anyone who might interested in the future about what I did & found out. In general, the memory usage was my own fault, having not properly taken into account memory issues associated with files I was loading prior to execution. However, I still ran into some strangeness with memory bloat, which for now I'm willing to attribute to memory bandwidth issues...though I can't really say for sure. Details are below. Thanks again for everyone's help!
John ---------(quoting an earlier response from Georg Wachter)------------- Hello, Sounds reasonable! My gut feeling is that I would not parallelize a 35h job further for a factor 3 wallclock time decrease. For checking the size of the halo and the partitioning of the simulation space, the best reference is likely the source code itself. Best, Georg ------------------------------------------------------------------------------------------- Trawling the source code leads me to believe that "thickness" of the halo is somewhat of a misnomer. If I understand correctly, the boundary class & related step method handle communication at the interfaces between field chunks (from structure chunks) so that the math becomes the same as if there were no split. So, pairs of voxels that span the divide are stored in these boundary objects, and no other voxels are explicitly considered. So as far as the halo is concerned, I'm not certain how much memory it would take compared to an added layer of voxels on each chunk. However, this is all somewhat of a moot point, due in part to the next section: ------------(quoting an earlier response from Tran Quyet Thang)--------- This is my trial run for a comparable grid size (500x1000x500) This is the source code for test2.cpp: #include <meep.hpp> using namespace meep; #ifdef _OPENMP extern int omp_get_num_threads(); #endif double eps(const vec &p) { if (p.x() < 2 && p.y() < 3 && p.z() < 2) return 12.0; return 1.0; } int main(int argc, char **argv) { //eps = 2.0f initialize mpi(argc, argv); // do this even for non-MPI Meep double resolution = 100; // pixels per distance grid_volume v = vol3d(5,10,5, resolution); // 500x1000x500 3d cell structure s(v, eps, pml(1.0)); fields f(&s); //f.output_hdf5(Dielectric, v.surroundings()); double freq = 0.3, fwidth = 0.1; gaussian_src_time src(freq, fwidth); f.add_point_source(Ez, 0.8, 0.6, 0.0, 4.0, vec(0.751,0.5,0.601), 1.0); while (f.time() < 1.1) { f.step(); } //f.output_hdf5(Hz, v.surroundings()); return 0; } np=1 run (single core) # mpirun -np 1 ./test2.dac PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 21124 tranthan 20 0 57.9g 57g 4208 R 100.0 45.8 5:31.83 test2.dac on time step 1 (time=0.005), 247.346 s/step on time step 2 (time=0.01), 19.0855 s/step on time step 3 (time=0.015), 18.9015 s/step on time step 4 (time=0.02), 18.8894 s/step on time step 5 (time=0.025), 18.8865 s/step on time step 6 (time=0.03), 18.8948 s/step on time step 7 (time=0.035), 18.8861 s/step on time step 8 (time=0.04), 18.8895 s/step np=8 run # mpirun -np 8 ./test2.dac PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20948 tranthan 20 0 7940m 7.6g 12m R 100.0 6.1 3:54.26 test2.dac 20947 tranthan 20 0 7828m 7.5g 10m R 100.0 6.0 3:57.76 test2.dac 20950 tranthan 20 0 7323m 7.0g 12m R 97.8 5.6 3:58.27 test2.dac 20953 tranthan 20 0 7323m 7.0g 12m R 100.0 5.6 3:57.90 test2.dac 20951 tranthan 20 0 7311m 7.0g 12m R 100.0 5.6 3:57.72 test2.dac 20954 tranthan 20 0 7266m 7.0g 10m R 100.0 5.5 3:58.39 test2.dac 20952 tranthan 20 0 7256m 7.0g 10m R 100.0 5.5 3:58.38 test2.dac 20949 tranthan 20 0 7255m 7.0g 10m R 100.0 5.5 3:58.19 test2.dac on time step 1 (time=0.005), 148.926 s/step on time step 3 (time=0.015), 2.68053 s/step on time step 5 (time=0.025), 2.67348 s/step on time step 7 (time=0.035), 2.67746 s/step on time step 9 (time=0.045), 2.67517 s/step Total memory of the mpi version 6*7.0+7.5+7.6=57.1 => ~ 0.1GB overhead. Speedup = 18.88/2.67 ~= 7.1 Server configuration: Two Xeon(R) CPU E5-2637 v2, 128GB DDR3 1866MHz, theoretical bandwidth~ 100GB/s For such large problem, in my experience, the speedup is quite nice, and overhead is negligible. Can you try to run my code and see if there is any differences? PS: note that I used Intel compiler and MPI libraries for meep, and with lowest optimization level (-O) ---------------------------------------------------------- Thank you, this was very helpful. I don't know why I hadn't considered stripping the simulation down to its barest elements like this...I think I was still focused too much with the mpi implementation working correctly rather than with what the code itself was doing. Without going into too much detail, I'm loading a large (> 1GB) file whose contents are used by the eps() function in the structure declaration. I create a 3D array from that data so it can be more easily used. You can probably see where this is going. Of course, in mpi mode, each core loads its own version of the file, AND each core stores a copy of this 3D array. The resulting memory usage borders on the absurd. Unfortunately, each core does need to access this data, and yes, there are probably better ways to load and index the data, but for now, I've settled upon freeing all of this memory after the structure is initialized. Doing this keeps running memory usage between 18-22GB regardless of the number of cores used*** (for anyone wondering about the lower memory use than that predicted above, I made a mistake, the actual grid size is about 400 x 650 x 350). *** The actual memory usage is not always ~20GB regardless of the number of cores used. I am content with the performance gain and memory usage that I'm seeing now, but I've noticed a curious phenomenon that I can't account for. The system I'm access has a pretty diverse set of hardware. Some really fast/many core/large memory nodes, some smaller/slower, a handful of medium-speed nodes with massive physical memory. The following is a summary of what I found using one particular hardware configuration (I've long since deleted the stdout files, so the numbers are rough estimates): ---------- Test 1: 4 cores, 2x nodes with 24 cores and 24 GB each, for a total of 4/48 cores active, 48GB memory available. (core usage is evenly split between two nodes, for 2 active on each). Result: Speed is ~2.0 sec/step, memory use is ~20GB. Test 2: 8 cores, 2x nodes with 24 cores and 24 GB each, for a total of 8/48 cores active, 48GB memory available. Result: Speed is ~1.7 sec/step, memory use is ~20GB. Test 3: 12 cores, 2x nodes with 24 cores and 24 GB each, for a total of 12/48 cores active, 48GB memory available. Result: Speed is ~3-4 sec/step, memory use is ~30GB. Test 4: 16 cores, 2x nodes with 24 cores and 24 GB each, for a total of 16/48 cores active, 48GB memory available. Result: Speed is ~5-6 sec/step, memory use is ~45GB. ---------- Running similar tests on different sets of hardware, I get qualitatively similar results. The number of requested cores that produces this bloating effect is different in each case, but the same trend of memory use occurs, where it is mostly the same until some threshold number of cores, after which point it increases drastically. Like I said before, this is not necessarily a problem I need to solve, as I'm satisfied with the performance I see with configurations such as Test 2. My first thought is that it's a memory bandwidth issue as mentioned by Tran Quyet Thang, but to me, that explains only the performance drain, but not the memory bloat. Full disclosure: I don't know what OpenMPI is doing. If the available bandwidth passes a critical point, does it cease to properly exchange data from physical memory to swap space, or something like that? It really is a mystery to me. Again, any insight is appreciated, but not necessarily critical. Thanks again, and happy FDTD simulations to all. Best, John
_______________________________________________ meep-discuss mailing list meep-discuss@ab-initio.mit.edu http://ab-initio.mit.edu/cgi-bin/mailman/listinfo/meep-discuss