Re: [pypy-dev] Differences performance Julia / PyPy on very similar codes

PIERRE AUGIER Tue, 22 Dec 2020 07:34:35 -0800

----- Mail original -----
> De: "David Edelsohn" <dje....@gmail.com>
> À: "PIERRE AUGIER" <pierre.aug...@univ-grenoble-alpes.fr>
> Cc: "pypy-dev" <pypy-dev@python.org>
> Envoyé: Lundi 21 Décembre 2020 23:47:22
> Objet: Re: [pypy-dev] Differences performance Julia / PyPy on very similar 
> codes


> You did not state on exactly what system you are conducting the
> experiment, but "a factor of 4" seems very close to the
> auto-vectorization speedup of a vector of floats.

The problem is described in details in the repository 
https://github.com/paugier/nbabel and in the related issue 
https://foss.heptapod.net/pypy/pypy/-/issues/3349

>> I think it would be very interesting to understand why PyPy is much slower 
>> than
>> Julia in this case (a factor 4 slower than very simple Julia). I'm wondering 
>> if
>> it is an issue of the language or a limitation of the implementation.
> 
> If the performance gap is caused by auto-vectorization, I would
> recommend that you use consider Numpy with Numba LLVM-based JIT.  Or,
> for a "pure" Python solution, you can experiment with an older release
> of PyPy and NumPyPy.

There is already an implementation based on Numba (which is slower and in my 
point of view less elegant that what can be done with Transonic-Pythran). 

Here, it is really about what can be done with PyPy, nowadays and in future. 

About NumPyPy, I'm sorry about this story, but I'm not interested to play with 
an unsupported project.

> If the problem is the abstraction penalty, then the suggestion from
> Anto should help.

I tried to use a list to store the data but unfortunatelly, it's slower (1.5 
times slower than with attributes and 6 times slower than Julia on my slow 
laptop):

Measurements with Julia 
(https://github.com/paugier/nbabel/blob/master/py/microbench_ju4.jl):

pierre@voyage ~/Dev/nbabel/py master $ julia microbench_ju4.jl                  
Main.NB.MutablePoint3D  17.833 ms (1048576 allocations: 32.00 MiB)
Main.NB.Point3D  5.737 ms (0 allocations: 0 bytes)
Main.NB.Point4D  4.984 ms (0 allocations: 0 bytes)

Measurements with PyPy objects with x, y, z attributes (like Julia, 
https://github.com/paugier/nbabel/blob/master/py/microbench_pypy4.py):

pierre@voyage ~/Dev/nbabel/py master $ pypy microbench_pypy4.py                 
Point3D: 22.503 ms
Point4D: 45.127 ms

Measurements with PyPy, lists and @property 
(https://github.com/paugier/nbabel/blob/master/py/microbench_pypy_list.py):

pierre@voyage ~/Dev/nbabel/py master $ pypy microbench_pypy_list.py             
Point3D: 34.115 ms
Point4D: 59.646 ms

> But, for the question of why, you can examine the code for the inner
> loop generated by Julia and the code for the inner loop generate by
> PyPy and analyze the reason for the performance gap.  It should be
> evident if the difference is abstraction or SIMD.

Sorry for this naive question but how can I examine the code for the inner loop 
generated by PyPy ?

Pierre

> 
> On Mon, Dec 21, 2020 at 5:20 PM PIERRE AUGIER
> <pierre.aug...@univ-grenoble-alpes.fr> wrote:
>>
>>
>> ----- Mail original -----
>> > De: "David Edelsohn" <dje....@gmail.com>
>> > À: "PIERRE AUGIER" <pierre.aug...@univ-grenoble-alpes.fr>
>> > Cc: "pypy-dev" <pypy-dev@python.org>
>> > Envoyé: Vendredi 18 Décembre 2020 21:00:42
>> > Objet: Re: [pypy-dev] Differences performance Julia / PyPy on very similar 
>> > codes
>>
>> > Does Julia based on LLVM auto-vectorize the code?  I assume yes
>> > because you specifically mention SIMD design of the data structure.
>>
>> Yes, Julia auto-vectorizes the code. Can't PyPy do the same in some case?
>>
>> > Have you tried NumPyPy?  Development on NumPyPy has not continued, but
>> > it probably would be a better comparison of what PyPy with
>> > auto-vectorization could accomplish to compare with Julia.
>>
>> I haven't tried NumPyPy because I can't import _numpypy with PyPy3.6.
>>
>> Anyway, for this experiment, my attempt was to stay in pure Python and to
>> compare with what is done in pure Julia.
>>
>> I think it would be very interesting to understand why PyPy is much slower 
>> than
>> Julia in this case (a factor 4 slower than very simple Julia). I'm wondering 
>> if
>> it is an issue of the language or a limitation of the implementation.
>>
>> Moreover, I would really be interested to know if an extension compatible 
>> with
>> PyPy (better, not only compatible with PyPy) could be written to make such 
>> code
>> faster (a code involving an array of instances of a very simple class). Could
>> we gain anything compare to using a Python list?
>>
>> Are there some tools to understand what is done by PyPy to speedup some 
>> code? Or
>> to know more on the data structures used under the hood by PyPy?
>>
>> For example,
>>
>> class Point3D:
>>     def __init__(self, x, y, z):
>>         self.x = x
>>         self.y = y
>>         self.z = z
>>
>>     def norm_square(self):
>>         return self.x**2 + self.y**2 + self.z**2
>>
>> I guess it would be good for efficiency to store the 3 floats as native 
>> floats
>> aligned in memory and to vectorized the power computation. How can one know
>> what is done by PyPy for a particular code?
>>
>> Pierre
>>
>> >
>> > Thanks, David
>> >
>> > On Fri, Dec 18, 2020 at 2:56 PM PIERRE AUGIER
>> > <pierre.aug...@univ-grenoble-alpes.fr> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I post on this list a message written in PyPy issue tracker
>> >> (https://foss.heptapod.net/pypy/pypy/-/issues/3349#note_150255). It is 
>> >> about
>> >> some experiments I did on writing efficient implementations of the NBody
>> >> problem https://github.com/paugier/nbabel to potentially answer to this 
>> >> article
>> >> https://arxiv.org/pdf/2009.11295.pdf.
>> >>
>> >> I get from a PR an [interesting optimized implementation in
>> >> Julia](https://github.com/paugier/nbabel/blob/master/julia/nbabel4_serial.jl).
>> >> It is very fast (even slightly faster than in Pythran). One idea is to 
>> >> store
>> >> the 3 floats of a 3d physical vector, (x, y, z), in a struct `Point4D`
>> >> containing 4 floats to better use simd instructions.
>> >>
>> >> I added a pure Python implementation inspired by this new Julia 
>> >> implementation
>> >> (but with a simple `Point3D` with 3 floats because with PyPy, the 
>> >> `Point4D`
>> >> does not make the code faster) and good news it is with PyPy a bit faster 
>> >> than
>> >> our previous PyPy implementations (only 3 times slower than the old C++
>> >> implementation).
>> >>
>> >> However, it is much slower than with Julia (while the code is very 
>> >> similar). I
>> >> coded a simplified version in Julia with nearly nothing else that what 
>> >> can be
>> >> written in pure Python (in particular, no `@inbounds` and `@simd` 
>> >> macros). It
>> >> seems to me that the comparison of these 2 versions could be interesting. 
>> >> So I
>> >> again simplified these 2 versions to keep only what is important for
>> >> performance, which gives
>> >>
>> >> - https://github.com/paugier/nbabel/blob/master/py/microbench_pypy4.py
>> >> - https://github.com/paugier/nbabel/blob/master/py/microbench_ju4.jl
>> >>
>> >> The results are summarized in
>> >> https://github.com/paugier/nbabel/blob/master/py/microbench.md
>> >>
>> >> An important point is that with `Point3D` (a mutable class in Python and 
>> >> an
>> >> immutable struct in Julia), Julia is 3.6 times faster than PyPy. Same 
>> >> code and
>> >> nothing really fancy in Julia so I guess that PyPy might be missing some
>> >> optimization opportunities. At least it would be interesting to 
>> >> understand what
>> >> is slower in PyPy (and why). I have to admit that I don't know how to get
>> >> interesting information on timing and what is happening with PyPy JIT in a
>> >> particular case. I only used cProfile and it's of course clearly not 
>> >> enough. I
>> >> can run vmprof but I'm not able to visualize the data because the website
>> >> http://vmprof.com/ is down. I don't know if I can trust values given by 
>> >> IPython
>> >> `%timeit` for particular instructions since I don't know if PyPy JIT does 
>> >> the
>> >> same thing in `%timeit` and in the function `compute_accelerations`.
>> >>
>> >> I also feel that I really miss in pure Python an efficient fixed size
>> >> homogeneous mutable sequence (a "Vector" in Julia words) that can contain 
>> >> basic
>> >> numerical types (as Python `array.array`) but also instances of 
>> >> user-defined
>> >> classes and instances of Vectors. The Python code uses a [pure Python
>> >> implementation using a
>> >> list](https://github.com/paugier/nbabel/blob/master/py/vector.py). I 
>> >> think it
>> >> would be reasonable to have a good implementation highly compatible with 
>> >> PyPy
>> >> (and potentially other Python implementations) in a package on PyPI. It 
>> >> would
>> >> really help to write PyPy compatible numerical codes. What would be the 
>> >> good
>> >> tool to implement such package? HPy? I wonder whether we can get some 
>> >> speedup
>> >> compared to the pure Python version with lists. For very simple classes 
>> >> like
>> >> `Point3d` and `Point4d`, I wonder if the data could be saved continuously 
>> >> in
>> >> memory and if some operations could be done without boxing/unboxing.
>> >>
>> >> However, I really don't know what is slower in PyPy / faster in Julia.
>> >>
>> >> I would be very interested to get the points of view of people knowing 
>> >> well
>> >> PyPy.
>> >>
>> >> Pierre
>> >> _______________________________________________
>> >> pypy-dev mailing list
>> >> pypy-dev@python.org
>> > > https://mail.python.org/mailman/listinfo/pypy-dev
>> _______________________________________________
>> pypy-dev mailing list
>> pypy-dev@python.org
> > https://mail.python.org/mailman/listinfo/pypy-dev
_______________________________________________
pypy-dev mailing list
pypy-dev@python.org
https://mail.python.org/mailman/listinfo/pypy-dev

Re: [pypy-dev] Differences performance Julia / PyPy on very similar codes

Reply via email to