As you all know, Kwant does not provide any support for parallelism. There are several reasons for this:
• Parallelism is not very natural in Python, and has been even less so when Kwant was started.
• There are different kinds of parallelism (shared-memory, distributed-memory) and multiple different frameworks for each approach. When starting Kwant, it seemed therefore a good idea to remain agnostic.
• Often there is a large amount of trivial parallelism (e.g. disorder averaging) that can be easily taken advantage of even without any support from Kwant.
Now that Kwant has been around for some time, and given that most (all?) computers have multiple cores nowadays, I think it would be very useful if Kwant would by able to use them, preferably by default and without extra effort for the user. At a somewhat lower priority, I think that it would be certainly nice to have a solver that can take advantage of clusters.
----------------Let's make a quick review of the current situation, specifically Kwant's MUMPS solver. I've added some timings to it and made a quick benchmark script that computes transmission through a system of 500x500 sites with some disorder (patch attached, to be applied with "git am").
Note that Kwant is using non-MPI MUMPS, but that MUMPS may be using a multithreaded BLAS. This is actually the case on my machine (two cores, running Debian testing), where OpenBLAS is used. The MUMPS is from the official Debian package.
This is the output of running the script first with a single BLAS thread, and then with two:
$ OPENBLAS_NUM_THREADS=1 python3 examples/benchmark.py Hamiltonian evaluation: 4.264016151428223 Lead processing: 4.115847826004028 Factorization: 6.535533905029297 Solving 334 RHS: 20.2034330368042 --- Total: 35.25790596008301 $ OPENBLAS_NUM_THREADS=2 python3 examples/benchmark.py Hamiltonian evaluation: 4.19313383102417 Lead processing: 4.2185447216033936 Factorization: 6.271895885467529 Solving 334 RHS: 20.82533359527588 --- Total: 35.65378427505493When using two threads, CPU usage actually does go up to 200% during "lead processing" and "factorization", but as we can see this is counterproductive for "lead processing", and barely helpful for "factorization". Only a single core is used during "Hamiltonian evaluation" (that's Kwant code) and "solving" (that's MUMPS). Does anyone know (or could someone test) whether other BLASes (e.g. MKL) perform better?
----------------How can we speed up things starting from the above state of things?
First of all let's not worry too much about the time it takes to evaluate the Hamiltonian. This operation should become around 100 times faster once Kwant has been vectorized.
The "lead processing" (mostly calculation of modes) is of somewhat lesser importance, since parameters often affect only the scattering region and leave the modes unchanged. (Kwant supports precalculation of modes.) Still, usually there are at least two leads, and as we see from the above timings, it would be better to distribute the mode calculations onto separate cores instead of using parallel BLAS. Are there any libraries that could provide more parallelization? (It’s a separate issue that both leads are identical and hence the same calculation is done twice.)
It’s in the last two steps, factorization and solving of the sparse matrix, where there is the biggest potential for speed gains. Unfortunately, there seems to be no way to parallelize either without using MPI:
• The factorization consists of two black-box operations that can be only parallelized inside MUMPS using MPI, I believe. (For example using PTSCOTCH – did anyone ever do any tests?)
• The (typically many) right-hand-sides of the solution step are independent, but as far as I can tell MUMPS as a library is not thread-safe, i.e. a single "MUMPS context" object cannot be shared across multiple threads to solve different RHS simultaneously. (This remains to be verified.) One could probably use a separate "context" for each thread, but this would be wasteful in terms of memory.
----------------What could be done for sure to parallelize factorization & solving is to use the MPI version of MUMPS. Note that not only MUMPS itself, but also all of its dependencies (SCOTCH, METIS) rely on MPI for parallelism.
As far as I can tell MPI MUMPS works by spreading different parts of the matrix among the different MPI worker processes. I do not know whether (and to which extent) there’s a penalty for doing this on shared-memory machines. (The MUMPS people seem to be interested in better support for shared-memory machines, see the talks from their last user group meeting [1].)
Even if MPI MUMPS is efficient (has anyone done any tests?), MPI itself poses a problem to which I do not know a satisfactory solution: programs that use MPI cannot be simply run, they must be started using a program like "mpirun" that is part of the MPI runtime. Thus, a Python package like Kwant cannot simply use MPI internally in a way that is invisible to its users. However, I think it would be nice if users could keep using Kwant just as before. The best solution that I can think of is to have a function kwant.enable_mpi(), that must be called from each MPI rank, and returns only on MPI rank 0. On the other ranks, it would block and act as a worker. Thus, converting a Kwant script to MPI would require only putting "kwant.enable_mpi()" after "import kwant". Instead of being started with the usual "python script.py", one would have to use "mpirun python script.py".
Does anyone see a better way to use MPI in Kwant? ----------------We could also use some other library, but are there any free (as in free speech) libraries that can take it up with MUMPS and that do not use MPI?
[1] http://mumps.enseeiht.fr/ud_2013.php
From 9ee34b1b55167f8df974546b91d435c34444bf35 Mon Sep 17 00:00:00 2001 From: Christoph Groth <christoph.gr...@cea.fr> Date: Wed, 3 Aug 2016 16:23:23 +0200 Subject: [PATCH] simple MUMPS benchmark --- examples/benchmark.py | 70 +++++++++++++++++++++++++++++++++++++++++++++++++ kwant/solvers/common.py | 5 ++++ kwant/solvers/mumps.py | 5 ++++ 3 files changed, 80 insertions(+) create mode 100644 examples/benchmark.py diff --git a/examples/benchmark.py b/examples/benchmark.py new file mode 100644 index 0000000..3668944 --- /dev/null +++ b/examples/benchmark.py @@ -0,0 +1,70 @@ +# Tutorial 2.2.3. Building the same system with less code +# ======================================================= +# +# Physics background +# ------------------ +# Conductance of a quantum wire; subbands +# +# Kwant features highlighted +# -------------------------- +# - Using iterables and builder.HoppingKind for making systems +# - introducing `reversed()` for the leads +# +# Note: Does the same as tutorial1a.py, but using other features of Kwant. + +import time +import kwant +from kwant.digest import gauss + +# For plotting +from matplotlib import pyplot + + +def onsite(site): + return 0.05 * gauss(repr(site)) + 4 + + +def make_system(a=1, t=1.0, W=100, L=100): + # Start with an empty tight-binding system and a single square lattice. + # `a` is the lattice constant (by default set to 1 for simplicity. + lat = kwant.lattice.square(a) + + syst = kwant.Builder() + + #### Define the scattering region. #### + syst[(lat(x, y) for x in range(L) for y in range(W))] = onsite + syst[lat.neighbors()] = -t + + #### Define and attach the leads. #### + # Construct the left lead. + lead = kwant.Builder(kwant.TranslationalSymmetry((-a, 0))) + lead[(lat(0, j) for j in range(W))] = 4 * t + lead[lat.neighbors()] = -t + + # Attach the left lead and its reversed copy. + syst.attach_lead(lead) + syst.attach_lead(lead.reversed()) + + return syst + + +def main(): + syst = make_system(W=500, L=500) + + # Finalize the system. + syst = syst.finalized() + + # kwant.plotter.bands(syst.leads[0]) + + # Compute conductance + for energy in [1.01]: + t = time.time() + smatrix = kwant.smatrix(syst, energy) + print("---\nTotal: ", time.time() - t) + print("\nResult:", smatrix.transmission(1, 0)) + + +# Call the main function if the script gets executed (as opposed to imported). +# See <http://docs.python.org/library/__main__.html>. +if __name__ == '__main__': + main() diff --git a/kwant/solvers/common.py b/kwant/solvers/common.py index e1a19f8..e63b5a7 100644 --- a/kwant/solvers/common.py +++ b/kwant/solvers/common.py @@ -8,6 +8,7 @@ __all__ = ['SparseSolver', 'SMatrix', 'GreensFunction'] +import time from collections import namedtuple from itertools import product import abc @@ -150,8 +151,10 @@ class SparseSolver(metaclass=abc.ABCMeta): if not syst.lead_interfaces: raise ValueError('System contains no leads.') + t = time.time() lhs, norb = syst.hamiltonian_submatrix(args, sparse=True, return_norb=True)[:2] + print("Hamiltonian evaluation: ", time.time() - t) lhs = getattr(lhs, 'to' + self.lhsformat)() lhs = lhs - energy * sp.identity(lhs.shape[0], format=self.lhsformat) num_orb = lhs.shape[0] @@ -172,6 +175,7 @@ class SparseSolver(metaclass=abc.ABCMeta): # Process the leads, generate the eigenvector matrices and lambda # vectors. Then create blocks of the linear system and add them # step by step. + t = time.time() indices = [] rhs = [] lead_info = [] @@ -257,6 +261,7 @@ class SparseSolver(metaclass=abc.ABCMeta): # defer formation of true rhs until the proper system # size is known rhs.append((coords,)) + print("Lead processing: ", time.time() - t) # Resize the right-hand sides to be compatible with the full lhs for i, mats in enumerate(rhs): diff --git a/kwant/solvers/mumps.py b/kwant/solvers/mumps.py index f0b9ac0..5b18ff4 100644 --- a/kwant/solvers/mumps.py +++ b/kwant/solvers/mumps.py @@ -13,6 +13,7 @@ import numpy as np from . import common from ..linalg import mumps +import time class Solver(common.SparseSolver): """Sparse Solver class based on the sparse direct solver MUMPS.""" @@ -100,8 +101,10 @@ class Solver(common.SparseSolver): return old_opts def _factorized(self, a): + t = time.time() inst = mumps.MUMPSContext() inst.factor(a, ordering=self.ordering) + print("Factorization: ", time.time() - t) return inst def _solve_linear_sys(self, factorized_a, b, kept_vars): @@ -111,12 +114,14 @@ class Solver(common.SparseSolver): solve = factorized_a.solve sols = [] + t = time.time() for j in range(0, b.shape[1], self.nrhs): tmprhs = b[:, j:min(j + self.nrhs, b.shape[1])] if not self.sparse_rhs: tmprhs = tmprhs.todense() sols.append(solve(tmprhs)[kept_vars, :]) + print("Solving", b.shape[1], "RHS: ", time.time() - t) return np.concatenate(sols, axis=1) -- 2.8.1
smime.p7s
Description: S/MIME cryptographic signature