Author: Antonio Cuni <anto.c...@gmail.com>
Branch: extradoc
Changeset: r5844:fc3e9ca8c4df
Date: 2017-10-26 18:10 +0200
http://bitbucket.org/pypy/extradoc/changeset/fc3e9ca8c4df/

Log:    draft of a new blog post

diff --git a/blog/draft/2017-10-how-to-make-50x-faster.rst 
b/blog/draft/2017-10-how-to-make-50x-faster.rst
new file mode 100644
--- /dev/null
+++ b/blog/draft/2017-10-how-to-make-50x-faster.rst
@@ -0,0 +1,238 @@
+How to make your code 80 times faster
+======================================
+
+I often hear people who are happy because PyPy makes their code 2 times faster
+or so. Here is a short personal story which shows that PyPy can go well beyond
+that.
+
+**DISCLAIMER**: this is not a silver bullet or a general recipe: it worked in
+this particular case, it might not work so well in other cases. But I think it
+is still an interesting technique. Moreover, the various steps and
+implementations are showed in the same order as I tried them during the
+development, so it is a real-life example of how to proceed when optimizing
+for PyPy.
+
+Some months ago I `played a bit`_ with evolutionary algorithms: the ambitious
+plan was to automatically evolve a logic which could control a (simulated)
+quadcopter, i.e. a `PID controller`_ (**spoiler**: it doesn't fly).
+
+.. _`played a bit`: https://github.com/antocuni/evolvingcopter
+.. _`PID controller`: https://en.wikipedia.org/wiki/PID_controller
+
+The idea is to have an initial population of random creatures: at each
+generation, the ones with the best fitness survive and reproduce with small,
+random variations.
+
+However, for the scope of this post, the actual task at hand is not so
+important, so let's jump straight to the code. To drive the quadcopter, a
+``Creature`` has a ``run_step`` method which runs at each delta-t (`full
+code`_)::
+
+    class Creature(object):
+        INPUTS = 2  # z_setpoint, current z position
+        OUTPUTS = 1 # PWM for all 4 motors
+        STATE_VARS = 1
+        ...
+
+        def run_step(self, inputs):
+            # state: [state_vars ... inputs]
+            # out_values: [state_vars, ... outputs]
+            self.state[self.STATE_VARS:] = inputs
+            out_values = np.dot(self.matrix, self.state) + self.constant
+            self.state[:self.STATE_VARS] = out_values[:self.STATE_VARS]
+            outputs = out_values[self.STATE_VARS:]
+            return outputs
+      
+- ``inputs`` is a numpy array containing the desired setpoint and the current
+  position on the Z axis;
+
+- ``outputs`` is a numpy array containing the thrust to give to the motors. To
+  start easy, all the 4 motors are constrained to have the same thrust, so
+  that the quadcopter only travels up and down the Z axis;
+
+- ``self.state`` contains an arbitrary amount of values which are passed from
+  one step to the next;
+
+- ``self.matrix`` and ``self.constant`` contains the actual logic: by putting
+  the "right" values there, in theory we could get a perfectly tuned PID
+  controller. These are randomly mutated between generations.
+
+.. _`full code`: 
https://github.com/antocuni/evolvingcopter/blob/master/ev/creature.py
+
+``run_step`` is run at 100Hz (in the virtual time of the simulation). At each
+generation, we test 500 creatures for a total of 12 virtual seconds each. So,
+we have a total of 600,000 executions of ``run_step`` at each generation.
+
+At first, I simply tried to run this code on CPython; here is the result::
+
+    $ python -m ev.main
+    Generation   1: ... [population = 500]  [12.06 secs]
+    Generation   2: ... [population = 500]  [6.13 secs]
+    Generation   3: ... [population = 500]  [6.11 secs]
+    Generation   4: ... [population = 500]  [6.09 secs]
+    Generation   5: ... [population = 500]  [6.18 secs]
+    Generation   6: ... [population = 500]  [6.26 secs]
+
+Which means ~6.15 seconds/generation, excluding the first.
+
+Then I tried with PyPy 5.9::
+
+    $ pypy -m ev.main
+    Generation   1: ... [population = 500]  [63.90 secs]
+    Generation   2: ... [population = 500]  [33.92 secs]
+    Generation   3: ... [population = 500]  [34.21 secs]
+    Generation   4: ... [population = 500]  [33.75 secs]
+
+Ouch! We are ~5.5x slower than CPython. This was kind of expected: numpy is
+based on cpyext, which is infamously slow (although `we are working on that`_).
+
+So, let's try to avoid cpyext. The first obvious step is to use numpypy
+instead of numpy; for more info about numpypy, see the next section. In the
+meantime, here is a link to the hack_ which enables for PyPy. Let's see if the
+speed improves::
+
+    $ pypy -m ev.main   # using numpypy
+    Generation   1: ... [population = 500]  [5.60 secs]
+    Generation   2: ... [population = 500]  [2.90 secs]
+    Generation   3: ... [population = 500]  [2.78 secs]
+    Generation   4: ... [population = 500]  [2.69 secs]
+    Generation   5: ... [population = 500]  [2.72 secs]
+    Generation   6: ... [population = 500]  [2.73 secs]
+
+So, ~2.7 seconds on average: this is 12x faster than PyPy+numpy, and more than
+2x faster than the original CPython. At this point, most people would be happy
+and go tweeting how good is PyPy.
+
+.. _`we are working on that`: 
https://morepypy.blogspot.it/2017/10/cape-of-good-hope-for-pypy-hello-from.html
+.. _hack: 
https://github.com/antocuni/evolvingcopter/blob/master/ev/pypycompat.py
+
+In general, when talking of CPython vs PyPy, I am rarely satified of a 2x
+speedup: I know that PyPy can do much better than this, especially if you
+write code which is specifically optimized for the JIT. For a real-life
+example, have a look at `capnpy benchmarks`_, in which the PyPy version is
+~15x faster than the heavily optimized CPython+Cython version (both have been
+written by me, and I tried hard to write the fastest code for both
+implementations).
+
+.. _`capnpy benchmarks`: http://capnpy.readthedocs.io/en/latest/benchmarks.html
+
+So, let's try to do better. As usual, the first thing to do is to profile and
+see where we spend most of the time. Here is the `vmprof profile`_. We spend a
+lot of time inside the internals of numpypy, and to allocate tons of temporary
+arrays to store the results of the various operations.
+
+Also, let's look at the `jit traces`_ and search for the function ``run``:
+this is loop in which we spend most of the time, and it is composed by a total
+of 1796 operations.  The operations emitted for the line ``np.dot(...) +
+self.constant`` are listed between lines 1217 and 1456; 239 low level
+operations are a lot: if we look at them, we can see for example that there is
+a call to the RPython function `descr_dot`_, at line 1232. But there are also
+calls to ``raw_malloc``, at line 1295, which allocates the space to store the
+result of ``... + self.constant``.
+
+.. _`vmprof profile`: http://vmprof.com/#/449ca8ee-3ab2-49d4-b6f0-9099987e9000
+.. _`jit traces`: 
http://vmprof.com/#/28fd6e8f-f103-4bf4-a76a-4b65dbd637f4/traces
+.. _`descr_dot`: 
https://bitbucket.org/pypy/pypy/src/89d1f31fabc86778cfaa1034b1102887c063de66/pypy/module/micronumpy/ndarray.py?at=default&fileviewer=file-view-default#ndarray.py-1168
+
+All of this is very suboptimal: in this particular case, we know that the
+shape of ``self.matrix`` is always ``(3, 2)``: so, we are doing an incredible
+amount of work and ``malloc()``ing a temporary array just to call an RPython
+function which ultimately does a total of 6 multiplications and 8 additions.
+
+One possible solution to this nonsense is a well known compiler optimization:
+loop unrolling.  From the compiler point of view, unrolling the loop is always
+risky because if the matrix is too big you might end up emitting a huge blob
+of code, possibly uselss if the shape of the matrices change frequently: this
+is the main reason why the PyPy JIT does not even try to do it in this case.
+
+However, we **know** that the matrix is small, and always of the same
+shape. So, let's unroll the loop manually::
+
+    class SpecializedCreature(Creature):
+
+        def __init__(self, *args, **kwargs):
+            Creature.__init__(self, *args, **kwargs)
+            # store the data in a plain Python list, which pypy is able to
+            # optimize as a float array
+            self.data = list(self.matrix.ravel()) + list(self.constant)
+            self.data_state = [0.0]
+            assert self.matrix.shape == (2, 3)
+            assert len(self.data) == 8
+
+        def run_step(self, inputs):
+            # state: [state_vars ... inputs]
+            # out_values: [state_vars, ... outputs]
+            k0, k1, k2, q0, q1, q2, c0, c1 = self.data
+            s0 = self.data_state[0]
+            z_sp, z = inputs
+            #
+            # compute the output
+            out0 = s0*k0 + z_sp*k1 + z*k2 + c0
+            out1 = s0*q0 + z_sp*q1 + z*q2 + c1
+            #
+            self.data_state[0] = out0
+            outputs = [out1]
+            return outputs
+
+In the `actual code`_ there is also a sanity check which asserts that the
+computed output is the very same as the one returned by ``Creature.run_step``.
+
+Note that is code is particularly PyPy-friendly, because ``self.data`` is a
+simple list of floats: thanks to list strategies, it is internally represented
+as a flat array of C doubles, i.e. very fast and compact.
+
+.. _`actual code`: 
https://github.com/antocuni/evolvingcopter/blob/master/ev/creature.py#L100
+.. _`list strategies`: 
https://morepypy.blogspot.it/2011/10/more-compact-lists-with-list-strategies.html
+
+So, let's try to see how it performs. First, with CPython::
+
+    $ python -m ev.main
+    Generation   1: ... [population = 500]  [7.61 secs]
+    Generation   2: ... [population = 500]  [3.96 secs]
+    Generation   3: ... [population = 500]  [3.79 secs]
+    Generation   4: ... [population = 500]  [3.74 secs]
+    Generation   5: ... [population = 500]  [3.84 secs]
+    Generation   6: ... [population = 500]  [3.69 secs]
+
+This looks good: 60% faster than the original CPython+numpy
+implementation. Let's try on PyPy::
+
+    Generation   1: ... [population = 500]  [0.39 secs]
+    Generation   2: ... [population = 500]  [0.10 secs]
+    Generation   3: ... [population = 500]  [0.11 secs]
+    Generation   4: ... [population = 500]  [0.09 secs]
+    Generation   5: ... [population = 500]  [0.08 secs]
+    Generation   6: ... [population = 500]  [0.12 secs]
+    Generation   7: ... [population = 500]  [0.09 secs]
+    Generation   8: ... [population = 500]  [0.08 secs]
+    Generation   9: ... [population = 500]  [0.08 secs]
+    Generation  10: ... [population = 500]  [0.08 secs]
+    Generation  11: ... [population = 500]  [0.08 secs]
+    Generation  12: ... [population = 500]  [0.07 secs]
+    Generation  13: ... [population = 500]  [0.07 secs]
+    Generation  14: ... [population = 500]  [0.08 secs]
+    Generation  15: ... [population = 500]  [0.07 secs]
+
+Yes, it's not an error. After a couple of generations, it stabilizes at around
+~0.07-0.08 seconds per generation. This is around **80 (eighty) times faster**
+than the original CPython+numpy implementation, and around 35-40x faster than
+the naive PyPy+numpypy one.
+
+Let's look at the trace_ again: it no longer contains expensive calls, and
+certainly no more temporary ``malloc()``s: the core of the logic is between
+lines XX-YY, where we can see that it does fast C-level multiplications and
+additions.
+
+.. _trace: http://vmprof.com/#/402af746-2966-4403-a61d-93015abac033/traces
+
+As I said before, this is a very particular example, and the techniques
+described here do not always apply: it is not realistic to expect an 80x
+speedup, unfortunately. However, it clearly shows the potential of PyPy when
+it comes to high-speed computing. And most importantly, it's not a toy
+benchmark which was designed specifically to have good performance on PyPy:
+it's a real world example, albeit small.
+
+Numpy vs numpypy
+-----------------
+
+bla bla
_______________________________________________
pypy-commit mailing list
pypy-commit@python.org
https://mail.python.org/mailman/listinfo/pypy-commit

Reply via email to