Re: [Wien] Parallel execution on new Intel CPUs

pluto via Wien Tue, 14 Feb 2023 15:18:51 -0800

Dear Prof. Blaha,

Thank you for comments.

At the moment I have 56 k-points in a big slab of one of the ternarymagnetic 2D materials. Perhaps I can reduce k-points, something to test.Also now I see that my 56 k-points are compatible with 1:localhost lines:-)

Also, for now it does not want to converge after 40 iterations with TEMP0.002, for a while I was trying TEMP 0.004, and now I am trying TEMP0.01. Maybe I should start with a smaller slab...


Some info you asked for:

The i7-13700K CPU has 8 P-cores (fast) and 8 E-cores (slow), so 16 totalphysical cores. Each P-core has 2 threads, so there are total of 24threads. Many other new Intel CPUs are the same. I don't think there isan easy way to enforce certain task on a certain core, and probably itmakes no sense, because the CPU for sure has thermal control overdifferent cores etc.

It seems this new Intel CPU is quite good in balancing the load ondifferent CPUs. In this slab, one lapw1 cycle takes approx. an hour,here a most recent example with 8x 1:localhost and OMP=3 (i.e. slightoverload, it is a bit faster than 16x 1:localhost):


Here a part of the case.dayfile:

lapw1 -dn -p (22:15:21) starting parallel lapw1 at Tue Feb 1410:15:21 PM CET 2023

->  starting parallel LAPW1 jobs at Tue Feb 14 10:15:21 PM CET 2023
running LAPW1 in parallel mode (using .machines.help)
8 number_of_parallel_jobs

localhost(7) 7846.370u 123.112s 57:21.99 231.54% 0+0k 0+0io0pf+0wlocalhost(7) 8073.008u 126.002s 56:16.88 242.80% 0+0k 0+0io0pf+0wlocalhost(7) 7859.701u 110.324s 54:47.53 242.43% 0+0k 0+0io0pf+0wlocalhost(7) 8073.152u 95.375s 56:33.84 240.69% 0+0k 0+0io0pf+0wlocalhost(7) 7531.787u 90.177s 57:48.78 219.73% 0+0k 0+0io0pf+0wlocalhost(7) 7883.831u 100.913s 55:39.61 239.09% 0+0k 0+0io0pf+0wlocalhost(7) 7980.689u 114.522s 56:04.84 240.58% 0+0k 0+0io0pf+0wlocalhost(7) 8113.984u 98.149s 56:10.74 243.63% 0+0k 0+0io0pf+0w

   Summary of lapw1para:
   localhost     k=56    user=63362.5    wallclock=27044.2
17.563u 48.090s 57:50.56 1.8%   0+0k 0+1520io 5pf+0w


Here a part of :log
Tue Feb 14 09:17:38 PM CET 2023> (x) mixer
Tue Feb 14 09:17:41 PM CET 2023> (x) lapw0 -p
Tue Feb 14 09:18:05 PM CET 2023> (x) lapw1 -up -p
Tue Feb 14 10:15:21 PM CET 2023> (x) lapw1 -dn -p
Tue Feb 14 11:13:11 PM CET 2023> (x) lapw2 -up -p
Tue Feb 14 11:18:32 PM CET 2023> (x) sumpara -up -d
Tue Feb 14 11:18:32 PM CET 2023> (x) lapw2 -dn -p
Tue Feb 14 11:23:42 PM CET 2023> (x) sumpara -dn -d
Tue Feb 14 11:23:42 PM CET 2023> (x) lcore -up
Tue Feb 14 11:23:42 PM CET 2023> (x) lcore -dn
Tue Feb 14 11:23:43 PM CET 2023> (x) mixer

.machines file:
omp_global:12
omp_lapw1:3
omp_lapw2:3
1:localhost
1:localhost
1:localhost
1:localhost
1:localhost
1:localhost
1:localhost
1:localhost
granularity:1

I think in my case mpi won't make much difference. With "so many" coreson a single CPU and seemingly quite decent automatic load balancing Ishould always be able to find a good way to find balance between propernumber of 1:localhosts and OMP. I don't think I will ever be calculatingreally big cases with very few k-points, these things should in any casebe done on a cluster.


Best,
Lukasz



On 2023-02-14 12:00, Peter Blaha wrote:

How many k-points do you have ? (And how many cores in total ?)

The number of lines (8 or 16) needs to be "compatible" with the number
of k-points. I have no experience how the memory-bus of this cpu is
and how "equal" the load is distributed. You need to check the dayfile
and check if eg. all 16 parallel lapw1 jobs finished at about the same
time, or 8 run much longer then the other set.

The mpi-code can be quite efficient and for medium sized cases of
similar speed, but for this it is mandatory to install the ELPA
library.
For large cases you usually have only few k-points and clearly only
with mpi you can use many cores/cpus. For a 36-atom slab I probably
would not run the regular scf cycle with more than 16 k-points in the
IBZ (at least if it is insulating) and thus mpi gives a chance to
speed-up things.
Again, I do not know what 16 mpi-jobs do, if 8 cores are fast and 8 areslow ?
Am 14.02.2023 um 11:32 schrieb pluto via Wien:
Dear Profs. Blaha, Marks,

Thank you for the information!
Could you give an estimate what could be a possible speed-up when Iuse mpi parallelization?
My tests on 36-inequivalent-atom slab so far indicate that there isnearly no difference between different k-parallel and OMP settings. Sofar I tried
8x 1:localhost with OMP=2
16x 1:localhost with OMP=1
16x 1:localhost with OMP=2 (means slight overloading)
and the time per SCF cycle (runsp without so) is practically the samein all these. Later I will also try higher OMP with less 1:localhost,but I doubt this can possibly be faster.
I have i7-13700K with 64 GB of RAM and NVMe SSD. During 36-atom-slabparallel calculation around 35 GB is used.
Best,
Lukasz
PS: Now omp_lapwso also works for me in .machines. I think it was aSOC issue with my test case (which was bulk Au). I am sorry for thisconfusion.
On 2023-02-14 10:23, Peter Blaha wrote:
I have no experience for such a CPU with fast and slow cores.

Simply test it out how you get the fastest turnaround for a fixed
number of k-points and different number of processes (should be
compatible with your k-points) and OMP=1-2 (4).
Previously, overloading (using more cores than the physical cores)was
NOT a good idea, but I don't know how this "fused" CPU behaves. Maybe
some "small" overloading is ok. This all depends on #-kpoints and
available cores.

PS:

I cannot verify your omp_lapwso:2 failure. My tests run fine and the
omp-setting is taken over properly.
I am now using a machine with i7-13700K. This CPU has 8 performancecores (P-cores) and 8 efficient cores (E-cores). In addition eachP-core has 2 threads, so there is 24 threads alltogether. It is hardto find some reasonable info online, but probably a P-core isapprox. 2x faster than an E-core:https://www.anandtech.com/show/17047/the-intel-12th-gen-core-i912900k-review-hybrid-performance-brings-hybrid-complexity/10This will of course depend on what is being calculated...
Do you have suggestions on how to optimize the .machines file forthe parallel execution of an scf cycle?
On my machine using OMP_NUM_THREADS leads to oscillations of the CPUuse (for a large slab maybe 40% of time is spent on a singlethread), suggesting that large OMP is not the optimal strategy.
Some examples of strategies:

One strategy would be to repeat the line
1:localhost
24 times, to have all the threads busy, and set OMP_NUM_THREADS=1.

Another would be set the line
1:localhost
8 times and set OMP_NUM_THREADS=2, this would mean using all 16physical cores.
Or perhaps one should better "overload" the CPU e.g. by doing1:localhost 16 times and OMP=2 ?
Over time I will try to benchmark some the different options, butperhaps there is some logic of how one should think about this.
In addition I have a comment on .machines file. It seems that forthe FM+SOC (runsp -so) calculations the
omp_global

setting in .machines is ignored. The

omp_lapw1
omp_lapw2
settings seem to work fine. So, I tried to set OMP for lapwsoseparately, by including the line like:
omp_lapwso:2

but this gives an error when executing parallel scf.

Best,
Lukasz
_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

Re: [Wien] Parallel execution on new Intel CPUs

Reply via email to