Shihao wrote:

> In which I simulated a system of 256*256 atoms and averaged for 500
> samples. It will take about 20000 seconds with a dual E5 2696 v3
> server with 50 threads working parallely at about 2.7 GHz. I know that
> part of the developers contributed to the termed Topological Anderson
> Insulator, and I was wondering if there's some way to improve the
> efficiency of my code?

Hi, happy to hear that Kwant is useful for you.

I only looked briefly at your script, but it seems to me that each MPI
rank is computing 500 S-matrices.  When you say that you are using 50
threads in parallel, do you mean that you launched a single MPI job of
size 50 (with 50 processes)?  Or do you refer to multi-threaded
BLAS/LAPACK computations?

Averaging over disorder realizations is generally a task that lends
itself very well to parallelization: it’s a so-called embarrassingly
parallel workload.  If you need to compute 500 realizations on a machine
with 50 cores, that means that each core has to compute 10 realizations.
That should not take 5 1/2 hours (=20000 seconds).

Here are some points that might help you:

• Make sure that you do not oversubscribe the machine.  For example, by
  default OpenBLAS will utilize all the available cores, so if
  N processes are launched on a N-core machine, N*N threads will execute
  in total, which is very bad for performance.  Check for
  oversubscription by monitoring the system load.  OpenBLAS can be
  forced to use a single thread only by setting the OPENBLAS_NUM_THREADS
  environment variable.

• Avoid recalculating the modes when only the disorder changes.  Kwant
  provides a way to precalculate modes:
  
https://kwant-project.org/doc/1/reference/generated/kwant.system.FiniteSystem#kwant.system.FiniteSystem.precalculate

Hope this helps
Christoph

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to