Hi Nizamov,
Thanks for your comments. I should mention the fact that the previous
job correspond to normalization run, where you only have freespace &
PMLs. My I have only one source plane term and a set of Bloch conditions. I
really don't know how meep splits the domain into chunks, but I was
figuring that this was done along the propagation direction.
You are right, I may have to look at the source :\
Below are the results for 8 procs.
Tasks: 237 total, 9 running, 227 sleeping, 0 stopped, 1 zombie
Cpu1 : 55.3%us, 44.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 54.4%us, 45.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 52.8%us, 47.2%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 52.6%us, 47.4%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu8 : 40.4%us, 59.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu15 : 54.0%us, 46.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 99.1%us, 0.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 99.1%us, 0.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16411088k total, 11423856k used, 4987232k free, 256k buffers
Swap: 0k total, 0k used, 0k free, 275088k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5907 gdemesy 25 0 946m 751m 6840 R 104.9 4.7 2:06.09 meep-mpi
5909 gdemesy 25 0 949m 755m 7000 R 104.9 4.7 2:06.11 meep-mpi
5908 gdemesy 25 0 946m 751m 6856 R 104.6 4.7 2:06.12 meep-mpi
5906 gdemesy 25 0 946m 751m 7068 R 102.1 4.7 2:06.02 meep-mpi
5902 gdemesy 25 0 949m 755m 7036 R 101.8 4.7 2:06.02 meep-mpi
5903 gdemesy 25 0 946m 751m 6892 R 101.8 4.7 2:06.02 meep-mpi
5905 gdemesy 25 0 2798m 2.5g 6992 R 101.8 16.3 2:06.02 meep-mpi
5904 gdemesy 25 0 2794m 2.5g 7096 R 101.5 16.2 2:06.02 meep-mpi
Again, my 10Gb load is not evenly split... And the run is even longer
than with 4 processors.
If we modify the structure, say by removing Bloch conditions, the load
is again unevenly dispatched:
Cpu3 : 99.4%us, 0.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 99.4%us, 0.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu0 : 47.3%us, 52.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 47.9%us, 52.1%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 47.6%us, 52.4%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 47.9%us, 52.1%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 48.1%us, 51.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 47.1%us, 52.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16411088k total, 8093036k used, 8318052k free, 256k buffers
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9116 gdemesy 25 0 2330m 2.1g 6224 R 100.9 13.3 0:35.67 meep-mpi
9117 gdemesy 25 0 2332m 2.1g 6152 R 100.9 13.3 0:35.66 meep-mpi
9119 gdemesy 25 0 561m 366m 6088 R 101.3 2.3 0:35.67 meep-mpi
9118 gdemesy 25 0 561m 366m 6204 R 100.9 2.3 0:35.66 meep-mpi
9120 gdemesy 25 0 558m 363m 5788 R 100.9 2.3 0:35.66 meep-mpi
9114 gdemesy 25 0 563m 368m 6088 R 100.6 2.3 0:35.66 meep-mpi
9115 gdemesy 25 0 561m 366m 6164 R 100.6 2.3 0:35.66 meep-mpi
9121 gdemesy 25 0 560m 365m 5776 R 100.6 2.3 0:35.65 meep-mpi
Now let's add the slab to this dummy job... Does'nt change much:
Cpu11 : 99.7%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 99.7%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu0 : 46.8%us, 53.2%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 47.3%us, 52.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 43.9%us, 56.1%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu9 : 47.1%us, 52.9%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu13 : 47.4%us, 52.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu14 : 48.2%us, 51.8%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16411088k total, 8586420k used, 7824668k free, 256k buffers
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9732 gdemesy 25 0 561m 367m 6884 R 100.3 2.3 1:19.13 meep-mpi
9733 gdemesy 25 0 2571m 2.3g 7020 R 100.3 14.8 1:19.12 meep-mpi
9731 gdemesy 25 0 563m 368m 6636 R 100.0 2.3 1:19.12 meep-mpi
9734 gdemesy 25 0 2573m 2.3g 6832 R 100.0 14.8 1:19.12 meep-mpi
9735 gdemesy 25 0 561m 367m 6900 R 100.0 2.3 1:19.12 meep-mpi
9736 gdemesy 25 0 561m 367m 6624 R 100.0 2.3 1:19.11 meep-mpi
9737 gdemesy 25 0 560m 365m 6152 R 100.0 2.3 1:19.12 meep-mpi
9738 gdemesy 25 0 558m 363m 6116 R 100.0 2.3 1:19.13 meep-mpi
Thanks for your help anyway... I will keep you posted if I manage to
get better perf.
Best,
Guillaume
Nizamov Shawkat <[email protected]> a écrit :
In your case, have you witnessed this kind of unbalanced behavior
(unbalanced memory, I
mean)?
Sorry, I do not remember exact details.
Let's see once again:
18175 25 0 353m 221m 6080 R 99.8 1.4 1:10.41 1 meep-mpi
18174 25 0 354m 222m 6388 R 100.2 1.4 1:10.41 6 meep-mpi
18172 25 0 1140m 1.0g 7016 R 99.8 6.3 1:10.41 2 meep-mpi
18173 25 0 1140m 1.0g 6804 R 99.5 6.3 1:10.40 4 meep-mpi
Tasks: 228 total, 5 running, 222 sleeping, 0 stopped, 1 zombie
Cpu1 : 23.9%us, 76.1%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu6 : 23.3%us, 76.7%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu2 : 99.7%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Cpu4 : 99.7%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi,
0.0%si, 0.0%st
Well, it may be possible, that simulation space is divided unevenly.
In this case, results seem quite natural - bigger simulation volumes
(cpu2 and cpu4) run at their full speed, 3-4 times smaller volumes
(cpu1 and cpu6) complete their simulation steps circa 3 times faster
and waste the time waiting for two other cores.
If this is correct interpretation, then there is nothing wrong with
you setup and:
1) it should mean that splitting of overall simulation volume onto
separate per core simulation volumes was not performed optimally by
meep. Any meep developer to comment ? I remember that splitting
algorithms took into account the structure and optimized
correspondingly the splitting volumes. E.g., cores 1 and 6 may be
actually simulating the slab volume, while cores 2 and 4 are
calculating the free space/PML. Try without slab to see if in that
case the distribution will be even.
2) scaling might be much better when you further increase the number
of cores, because simulation volume may be divided more evenly. Can
you try it ?
Actually, it would be interesting to compare how simulation volume is
divided at different number of processor cores, with and without slab,
and this may give a clue how splitting works. Another option is to
look at the sources :)
With best regards
Shawkat Nizamov
----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.
_______________________________________________
meep-discuss mailing list
[email protected]
http://ab-initio.mit.edu/cgi-bin/mailman/listinfo/meep-discuss