Shihao wrote: > In which I simulated a system of 256*256 atoms and averaged for 500 > samples. It will take about 20000 seconds with a dual E5 2696 v3 > server with 50 threads working parallely at about 2.7 GHz. I know that > part of the developers contributed to the termed Topological Anderson > Insulator, and I was wondering if there's some way to improve the > efficiency of my code?
Hi, happy to hear that Kwant is useful for you. I only looked briefly at your script, but it seems to me that each MPI rank is computing 500 S-matrices. When you say that you are using 50 threads in parallel, do you mean that you launched a single MPI job of size 50 (with 50 processes)? Or do you refer to multi-threaded BLAS/LAPACK computations? Averaging over disorder realizations is generally a task that lends itself very well to parallelization: it’s a so-called embarrassingly parallel workload. If you need to compute 500 realizations on a machine with 50 cores, that means that each core has to compute 10 realizations. That should not take 5 1/2 hours (=20000 seconds). Here are some points that might help you: • Make sure that you do not oversubscribe the machine. For example, by default OpenBLAS will utilize all the available cores, so if N processes are launched on a N-core machine, N*N threads will execute in total, which is very bad for performance. Check for oversubscription by monitoring the system load. OpenBLAS can be forced to use a single thread only by setting the OPENBLAS_NUM_THREADS environment variable. • Avoid recalculating the modes when only the disorder changes. Kwant provides a way to precalculate modes: https://kwant-project.org/doc/1/reference/generated/kwant.system.FiniteSystem#kwant.system.FiniteSystem.precalculate Hope this helps Christoph
smime.p7s
Description: S/MIME cryptographic signature