You did not state on exactly what system you are conducting the experiment, but "a factor of 4" seems very close to the auto-vectorization speedup of a vector of floats.
> I think it would be very interesting to understand why PyPy is much slower > than Julia in this case (a factor 4 slower than very simple Julia). I'm > wondering if it is an issue of the language or a limitation of the > implementation. If the performance gap is caused by auto-vectorization, I would recommend that you use consider Numpy with Numba LLVM-based JIT. Or, for a "pure" Python solution, you can experiment with an older release of PyPy and NumPyPy. If the problem is the abstraction penalty, then the suggestion from Anto should help. But, for the question of why, you can examine the code for the inner loop generated by Julia and the code for the inner loop generate by PyPy and analyze the reason for the performance gap. It should be evident if the difference is abstraction or SIMD. Thanks, David On Mon, Dec 21, 2020 at 5:20 PM PIERRE AUGIER <pierre.aug...@univ-grenoble-alpes.fr> wrote: > > > ----- Mail original ----- > > De: "David Edelsohn" <dje....@gmail.com> > > À: "PIERRE AUGIER" <pierre.aug...@univ-grenoble-alpes.fr> > > Cc: "pypy-dev" <pypy-dev@python.org> > > Envoyé: Vendredi 18 Décembre 2020 21:00:42 > > Objet: Re: [pypy-dev] Differences performance Julia / PyPy on very similar > > codes > > > Does Julia based on LLVM auto-vectorize the code? I assume yes > > because you specifically mention SIMD design of the data structure. > > Yes, Julia auto-vectorizes the code. Can't PyPy do the same in some case? > > > Have you tried NumPyPy? Development on NumPyPy has not continued, but > > it probably would be a better comparison of what PyPy with > > auto-vectorization could accomplish to compare with Julia. > > I haven't tried NumPyPy because I can't import _numpypy with PyPy3.6. > > Anyway, for this experiment, my attempt was to stay in pure Python and to > compare with what is done in pure Julia. > > I think it would be very interesting to understand why PyPy is much slower > than Julia in this case (a factor 4 slower than very simple Julia). I'm > wondering if it is an issue of the language or a limitation of the > implementation. > > Moreover, I would really be interested to know if an extension compatible > with PyPy (better, not only compatible with PyPy) could be written to make > such code faster (a code involving an array of instances of a very simple > class). Could we gain anything compare to using a Python list? > > Are there some tools to understand what is done by PyPy to speedup some code? > Or to know more on the data structures used under the hood by PyPy? > > For example, > > class Point3D: > def __init__(self, x, y, z): > self.x = x > self.y = y > self.z = z > > def norm_square(self): > return self.x**2 + self.y**2 + self.z**2 > > I guess it would be good for efficiency to store the 3 floats as native > floats aligned in memory and to vectorized the power computation. How can one > know what is done by PyPy for a particular code? > > Pierre > > > > > Thanks, David > > > > On Fri, Dec 18, 2020 at 2:56 PM PIERRE AUGIER > > <pierre.aug...@univ-grenoble-alpes.fr> wrote: > >> > >> Hi, > >> > >> I post on this list a message written in PyPy issue tracker > >> (https://foss.heptapod.net/pypy/pypy/-/issues/3349#note_150255). It is > >> about > >> some experiments I did on writing efficient implementations of the NBody > >> problem https://github.com/paugier/nbabel to potentially answer to this > >> article > >> https://arxiv.org/pdf/2009.11295.pdf. > >> > >> I get from a PR an [interesting optimized implementation in > >> Julia](https://github.com/paugier/nbabel/blob/master/julia/nbabel4_serial.jl). > >> It is very fast (even slightly faster than in Pythran). One idea is to > >> store > >> the 3 floats of a 3d physical vector, (x, y, z), in a struct `Point4D` > >> containing 4 floats to better use simd instructions. > >> > >> I added a pure Python implementation inspired by this new Julia > >> implementation > >> (but with a simple `Point3D` with 3 floats because with PyPy, the `Point4D` > >> does not make the code faster) and good news it is with PyPy a bit faster > >> than > >> our previous PyPy implementations (only 3 times slower than the old C++ > >> implementation). > >> > >> However, it is much slower than with Julia (while the code is very > >> similar). I > >> coded a simplified version in Julia with nearly nothing else that what can > >> be > >> written in pure Python (in particular, no `@inbounds` and `@simd` macros). > >> It > >> seems to me that the comparison of these 2 versions could be interesting. > >> So I > >> again simplified these 2 versions to keep only what is important for > >> performance, which gives > >> > >> - https://github.com/paugier/nbabel/blob/master/py/microbench_pypy4.py > >> - https://github.com/paugier/nbabel/blob/master/py/microbench_ju4.jl > >> > >> The results are summarized in > >> https://github.com/paugier/nbabel/blob/master/py/microbench.md > >> > >> An important point is that with `Point3D` (a mutable class in Python and an > >> immutable struct in Julia), Julia is 3.6 times faster than PyPy. Same code > >> and > >> nothing really fancy in Julia so I guess that PyPy might be missing some > >> optimization opportunities. At least it would be interesting to understand > >> what > >> is slower in PyPy (and why). I have to admit that I don't know how to get > >> interesting information on timing and what is happening with PyPy JIT in a > >> particular case. I only used cProfile and it's of course clearly not > >> enough. I > >> can run vmprof but I'm not able to visualize the data because the website > >> http://vmprof.com/ is down. I don't know if I can trust values given by > >> IPython > >> `%timeit` for particular instructions since I don't know if PyPy JIT does > >> the > >> same thing in `%timeit` and in the function `compute_accelerations`. > >> > >> I also feel that I really miss in pure Python an efficient fixed size > >> homogeneous mutable sequence (a "Vector" in Julia words) that can contain > >> basic > >> numerical types (as Python `array.array`) but also instances of > >> user-defined > >> classes and instances of Vectors. The Python code uses a [pure Python > >> implementation using a > >> list](https://github.com/paugier/nbabel/blob/master/py/vector.py). I think > >> it > >> would be reasonable to have a good implementation highly compatible with > >> PyPy > >> (and potentially other Python implementations) in a package on PyPI. It > >> would > >> really help to write PyPy compatible numerical codes. What would be the > >> good > >> tool to implement such package? HPy? I wonder whether we can get some > >> speedup > >> compared to the pure Python version with lists. For very simple classes > >> like > >> `Point3d` and `Point4d`, I wonder if the data could be saved continuously > >> in > >> memory and if some operations could be done without boxing/unboxing. > >> > >> However, I really don't know what is slower in PyPy / faster in Julia. > >> > >> I would be very interested to get the points of view of people knowing well > >> PyPy. > >> > >> Pierre > >> _______________________________________________ > >> pypy-dev mailing list > >> pypy-dev@python.org > > > https://mail.python.org/mailman/listinfo/pypy-dev > _______________________________________________ > pypy-dev mailing list > pypy-dev@python.org > https://mail.python.org/mailman/listinfo/pypy-dev _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev