Dear Christoph, Thanks for your detailed explanation. Using the output of Wannier90, the scale of my calculation is considerable, and I have confirmed that OPENBLAS_NUM_THREADS = 1 can truly improve the efficiency, and added this instruction in my bashrc. Here I share an instance of parallelization using concurrent package: Assuming the time of single-CPU calculation is 't', then the time of 12-CPUs calculation is 't/5.5', which is affordable for clusters.
Thanks to all of you, this discussion is important to the DFT-TB calculations. Sincerely, Jiaqi