1) You didn't provide any details on the layout of your cluster. It is hard to guess if you have a several dual-cores with 16Gb memory, or they are 6 cores and you are running only on one of them. 2) I resemble that using 4 (or was it 8?) core Athlon in single PC there were no acceleration beyond 3 cores. In my case the limitation was most probably memory bandwidth, but it is not your case. I would anyway see in my case all cores running at almost 100%. 3) Are you sure that you are accounting for actual simulation and not the initialization? Populating the memory with epsilons is not parallel, I mean, that every core populates only some simulation space. If it is uniformly filled it completes fast. If it has some structure, especially if subpixel averaging is turned on, it may take much longer time, during which other cores will just simply wait. Print some debug information like "structure initialization" "simulation started" etc and compare the timing distribution. From "runtime0=2" I conclude that your simulation time is actually rather short.
With best regards, Shawkat Nizamov 2010/7/15, [email protected] <[email protected]>: > Dear Meep users and developer, > > I'm getting strange scaling performance using meep-mpi compiled with > IntelMPI on our cluster. When I go from 1 to 2 processors, I'm getting > an almost ideal scaling (i.e. runtime is divided by almost 2 as shown > below for various problem sizes), but the scaling becomes very weak > when using more than 2 processors. I should say that meep-mpi results > agree with the one I am getting on my PC with meep-serial (in other > words, our compilation seems all right). > > nb_proc runtime-res=20 runtime-res=40 runtime-res=60 runtime-res=80 > 1 20.5 135 449 1086 > 2 11.47 73 230 551 > 4 11.52 68 219 530 > 8 12.9 67 222 528 > > Let's go for some more details with a job size of ~3Gb (3D stuff). I > am showing below the stats obtained when requesting 4 processors: > mpirun -np 4 meep-mpi res=100 runtime0=2 norm-run?=true slab3D.ctl > > ------------------------------------------------------------------------- > Mem: 16411088k total, 4015216k used, 12395872k free, 256k buffers > Swap: 0k total, 0k used, 0k free, 283692k cached > PID PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND > 18175 25 0 353m 221m 6080 R 99.8 1.4 1:10.41 1 meep-mpi > 18174 25 0 354m 222m 6388 R 100.2 1.4 1:10.41 6 meep-mpi > 18172 25 0 1140m 1.0g 7016 R 99.8 6.3 1:10.41 2 meep-mpi > 18173 25 0 1140m 1.0g 6804 R 99.5 6.3 1:10.40 4 meep-mpi > > Tasks: 228 total, 5 running, 222 sleeping, 0 stopped, 1 zombie > Cpu1 : 23.9%us, 76.1%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, > 0.0%st > Cpu6 : 23.3%us, 76.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, > 0.0%st > Cpu2 : 99.7%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, > 0.0%st > Cpu4 : 99.7%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, > 0.0%st > [...] > ------------------------------------------------------------------------- > > So what we see here is that while the processors are all running flat > out, for CPU 1 and 6 (which are the two running processes light on > memory) only 1/4 of the time is in user code, and 3/4 is in system > time -- normally I/O, but here probably MPI communications. It > explains why I don't get shorter runtimes with more than 2 processors. > > So we have a fairly clear load-balance issue; Have you experienced > this kind of situation? I was wondering if there may be meep-mpi > parameters I can set to affect the domain decomposition into chunks in > a helpful way. > > I can send more details if needed. > > Thanks in advance! > > Best regards, > > Guillaume Demésy > > ---------------------------------------------------------------- > This message was sent using IMP, the Internet Messaging Program. > > > > _______________________________________________ > meep-discuss mailing list > [email protected] > http://ab-initio.mit.edu/cgi-bin/mailman/listinfo/meep-discuss > _______________________________________________ meep-discuss mailing list [email protected] http://ab-initio.mit.edu/cgi-bin/mailman/listinfo/meep-discuss

