Re: [Meep-discuss] meep-mpi scaling perf with more than 2 processors

gdemesy Fri, 16 Jul 2010 05:55:14 -0700

Hi Nizamov,

Thanks for your comments. I should mention the fact that the previous
job correspond to normalization run, where you only have freespace &
PMLs. My I have only one source plane term and a set of Bloch conditions. I
really don't know how meep splits the domain into chunks, but I was
figuring that this was done along the propagation direction.
You are right, I may have to look at the source :\


Below are the results for 8 procs.

Tasks: 237 total,   9 running, 227 sleeping,   0 stopped,   1 zombie
Cpu1  : 55.3%us, 44.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 54.4%us, 45.6%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  : 52.8%us, 47.2%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  : 52.6%us, 47.4%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  : 40.4%us, 59.6%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 : 54.0%us, 46.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 99.1%us,  0.9%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  : 99.1%us,  0.9%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16411088k total, 11423856k used,  4987232k free,      256k buffers
Swap:        0k total,        0k used,        0k free,   275088k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  5907 gdemesy   25   0  946m 751m 6840 R 104.9  4.7   2:06.09 meep-mpi
  5909 gdemesy   25   0  949m 755m 7000 R 104.9  4.7   2:06.11 meep-mpi
  5908 gdemesy   25   0  946m 751m 6856 R 104.6  4.7   2:06.12 meep-mpi
  5906 gdemesy   25   0  946m 751m 7068 R 102.1  4.7   2:06.02 meep-mpi
  5902 gdemesy   25   0  949m 755m 7036 R 101.8  4.7   2:06.02 meep-mpi
  5903 gdemesy   25   0  946m 751m 6892 R 101.8  4.7   2:06.02 meep-mpi
  5905 gdemesy   25   0 2798m 2.5g 6992 R 101.8 16.3   2:06.02 meep-mpi
  5904 gdemesy   25   0 2794m 2.5g 7096 R 101.5 16.2   2:06.02 meep-mpi

Again, my 10Gb load is not evenly split... And the run is even longerthan with 4 processors.If we modify the structure, say by removing Bloch conditions, the loadis again unevenly dispatched:


Cpu3  : 99.4%us,  0.6%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  : 99.4%us,  0.6%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu0  : 47.3%us, 52.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 47.9%us, 52.1%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 47.6%us, 52.4%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  : 47.9%us, 52.1%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  : 48.1%us, 51.9%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  : 47.1%us, 52.9%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16411088k total,  8093036k used,  8318052k free,      256k buffers
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 9116 gdemesy   25   0 2330m 2.1g 6224 R 100.9 13.3   0:35.67 meep-mpi
 9117 gdemesy   25   0 2332m 2.1g 6152 R 100.9 13.3   0:35.66 meep-mpi
 9119 gdemesy   25   0  561m 366m 6088 R 101.3  2.3   0:35.67 meep-mpi
 9118 gdemesy   25   0  561m 366m 6204 R 100.9  2.3   0:35.66 meep-mpi
 9120 gdemesy   25   0  558m 363m 5788 R 100.9  2.3   0:35.66 meep-mpi
 9114 gdemesy   25   0  563m 368m 6088 R 100.6  2.3   0:35.66 meep-mpi
 9115 gdemesy   25   0  561m 366m 6164 R 100.6  2.3   0:35.66 meep-mpi
 9121 gdemesy   25   0  560m 365m 5776 R 100.6  2.3   0:35.65 meep-mpi

Now let's add the slab to this dummy job... Does'nt change much:
Cpu11 : 99.7%us,  0.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  : 99.7%us,  0.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu0  : 46.8%us, 53.2%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 47.3%us, 52.7%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  : 43.9%us, 56.1%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  : 47.1%us, 52.9%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu13 : 47.4%us, 52.6%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 : 48.2%us, 51.8%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16411088k total,  8586420k used,  7824668k free,      256k buffers
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 9732 gdemesy   25   0  561m 367m 6884 R 100.3  2.3   1:19.13 meep-mpi
 9733 gdemesy   25   0 2571m 2.3g 7020 R 100.3 14.8   1:19.12 meep-mpi
 9731 gdemesy   25   0  563m 368m 6636 R 100.0  2.3   1:19.12 meep-mpi
 9734 gdemesy   25   0 2573m 2.3g 6832 R 100.0 14.8   1:19.12 meep-mpi
 9735 gdemesy   25   0  561m 367m 6900 R 100.0  2.3   1:19.12 meep-mpi
 9736 gdemesy   25   0  561m 367m 6624 R 100.0  2.3   1:19.11 meep-mpi
 9737 gdemesy   25   0  560m 365m 6152 R 100.0  2.3   1:19.12 meep-mpi
 9738 gdemesy   25   0  558m 363m 6116 R 100.0  2.3   1:19.13 meep-mpi

Thanks for your help anyway... I will keep you posted if I manage toget better perf.


Best,

Guillaume



Nizamov Shawkat <[email protected]> a écrit :

In your case, have you witnessed this kind of unbalanced behavior(unbalanced memory, I
mean)?


Sorry, I do not remember exact details.

Let's see once again:

18175    25   0  353m 221m 6080 R  99.8  1.4   1:10.41  1  meep-mpi
18174    25   0  354m 222m 6388 R 100.2  1.4   1:10.41  6  meep-mpi
18172    25   0 1140m 1.0g 7016 R  99.8  6.3   1:10.41  2  meep-mpi
18173    25   0 1140m 1.0g 6804 R  99.5  6.3   1:10.40  4  meep-mpi

Tasks: 228 total,   5 running, 222 sleeping,   0 stopped,   1 zombie

Cpu1 : 23.9%us, 76.1%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi,0.0%si, 0.0%stCpu6 : 23.3%us, 76.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi,0.0%si, 0.0%stCpu2 : 99.7%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi,0.0%si, 0.0%stCpu4 : 99.7%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi,0.0%si, 0.0%st


Well, it may be possible, that simulation space is divided unevenly.
In this case, results seem quite natural - bigger simulation volumes
(cpu2 and cpu4) run at their full speed, 3-4 times smaller volumes
(cpu1 and cpu6) complete their simulation steps circa 3 times faster
and waste the time waiting for two other cores.

If this is correct interpretation, then there is nothing wrong with
you setup and:

1) it should mean that splitting of overall simulation volume onto
separate per core simulation volumes was not performed optimally by
meep. Any meep developer to comment ? I remember that splitting
algorithms took into account the structure and optimized
correspondingly the splitting volumes. E.g., cores 1 and 6 may be
actually simulating the slab volume, while cores 2 and 4 are
calculating the free space/PML. Try without slab to see if in that
case the distribution will be even.

2) scaling might be much better when you further increase the  number
of cores, because simulation volume may be divided more evenly.  Can
you try it ?

Actually, it would be interesting to compare how simulation volume is
divided at different number of processor cores, with and without slab,
and this may give a clue how splitting works. Another option is to
look at the sources :)

With best regards
Shawkat Nizamov





----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.



_______________________________________________
meep-discuss mailing list
[email protected]
http://ab-initio.mit.edu/cgi-bin/mailman/listinfo/meep-discuss

Re: [Meep-discuss] meep-mpi scaling perf with more than 2 processors

Reply via email to