Mani Chandra wrote:

> We want to examine quantities on scales much larger than the lattice
> scale, and average over lattice scale features. See attached plot of
> the density in our simulation of transverse magnetic focussing. The
> left is the raw density data and the right is the interpolated data
> with the relwidth parameter set to 0.05. Evidently, the focussing
> peaks really stand out in the interpolated plot. I'm using the
> interpolator as follows:
>
> density = sum(density_operator(mode) for mode in modes)
> density_interpolated, boundaries =
> kwant.plotter.interpolate_density(sys, density, relwidth=0.05)

Smoothing out fine detail is a legitimate application of the
interpolation algorithm.  However, you use it at quite a massive scale.
Your input array being 5000^2 in size and relwidth=0.05 means that the
convolution kernel is 250^2 in size.

The way the interpolation algorithm works means that for each hopping
the 250^2 elements of the convolution kernel

  f(r) = (1-r^2)^2 Θ(1-r^2)

are computed.  (Remember that it’s a generic algorithm and works equally
well for any amorphous “lattice”.)  You seem to have 5000^2 horizontal
hoppings and as many vertical ones.  As a rough estimate (neglecting any
border effects) this boils down to

  2 * 5000^2 * 250^2 = 3.1e12

evaluations of the kernel!  Each involves multiple mathematical
operations and memory accesses, but the’re vectorized with NumPy in
batches of 250^2.

You said that this computation takes 6 h.  Let’s assume that this is on
a 3 GHz CPU core.  Then, each evaluation takes some 20 CPU clock cycles,
which is not that bad actually - given that this is pure Python.  The
NumPy vectorization is apparently doing it’s job.

An optimal CPU implementation of exactly the same generic algorithm
would still take a few CPU cycles per evaluation, it won’t help you
much.

A GPU would be well suited for such regular work, but we both agree that
it would be overkill and as we’ll see later there’s probably a more
intelligent way to do long-range smoothing.

> > A complementary approach would be to rewrite the inner loop in
> > a fast compiled language.  This would help especially in the case
> > where the interpolation bump does not cover many points of the
> > output array and as such the existing vectorization is not very good
> > (the involved subarrays are small).
>
> Do you think this can be achieved using Cython?

It can, but your application won’t benefit from it.  Cython would help
for use cases where the convolution Kernel is quite small.  In your case
of a huge kernel thanks to NumPy the CPU spends most of its time in
highly optimized BLAS already.

> > Yet another approach would be to create a special implementation for the
> > common case of a rectangular lattice that is commensurate with the
> > output array of the interpolator.
>
> We're indeed using a rectangular lattice!

Yes, so this might really help in your case.  One would split up all the
hoppings into classes where the hoppings are geometrically equivalent
modulo the output lattice.  In your case there would be only two such
classes: on for horizontal hoppings and one for vertical.

Then, for each hopping class, one would precompute the kernel and apply
it to all the hoppings in the class.  I guess that this would provide
a significant speedup (hopefully 5 times or more) without even
abandoning pure Python for the implementation.  And this could be
combined with parallelization among CPU cores.

> More generally, what do you think about simply using a filtering method
> (say gaussian filter) in scipy.ndimage (
> https://docs.scipy.org/doc/scipy/reference/reference/generated/scipy.ndimage.gaussian_filter.html#scipy.ndimage.gaussian_filter)
> or one of the many in scipy.signal (
> https://docs.scipy.org/doc/scipy/reference/reference/signal.html)? Are you
> using a special kind of filter?

Kwant’s interpolation scheme was designed such that

• it works for arbitrary irregular hopping graphs,
• and that it is physically correct (current is conserved).

Visually, the interpolation scheme can be imagined as follows:
Initially, a current field is established that is only non-zero along
straight lines that correspond to hoppings.  These lines are infinitely
thin (they have a delta-function-like cross section).  Then, in a second
step, the current field is smoothed by convoluting it with a polynomial
(for efficiency) bump function.

However, performing convolution in real space is inefficient, and as we
have seen it gets especially bad when smoothing is done over long
distances.

I think that a hybrid approach could help speed up the algorithm a lot
while keeping it fully generic:

• In a first step one would run the current routine but with a kernel
  that is only a few times larger than the longest hopping.  The result
  would be a regular interpolated current field but that is smoothed out
  only a little (only enough not to loose any local currents).

• In a second step, the output current field of the first step (that
  lives on a regular rectangular grid) is smoothed further by using an
  efficient convoluting routine that uses FFT or something like that.

> > If you or someone else would like to work on this I would be happy
> > to guide you.
>
> Thanks! I can try Cythonizing or vectorizing the loop.

I suggest the following approach:

• You test whether the hybrid approach sketched above works for you.
  This could already speed up things quite a lot.

• However, the vectorization will be less efficient now due to much
  smaller smoothing kernels.  It will be interesting to see how much
  slower it gets per kernel evaluation compared to the current “20 clock
  cycles”.  We could then try to speed up the current routine by using
  the classification approach and/or some Cython.

Hope this helps - please do keep us updated about your findings

Christoph

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to