----- Mail original ----- > De: "David Edelsohn" <dje....@gmail.com> > À: "PIERRE AUGIER" <pierre.aug...@univ-grenoble-alpes.fr> > Cc: "pypy-dev" <pypy-dev@python.org> > Envoyé: Lundi 21 Décembre 2020 23:47:22 > Objet: Re: [pypy-dev] Differences performance Julia / PyPy on very similar > codes
> You did not state on exactly what system you are conducting the > experiment, but "a factor of 4" seems very close to the > auto-vectorization speedup of a vector of floats. The problem is described in details in the repository https://github.com/paugier/nbabel and in the related issue https://foss.heptapod.net/pypy/pypy/-/issues/3349 >> I think it would be very interesting to understand why PyPy is much slower >> than >> Julia in this case (a factor 4 slower than very simple Julia). I'm wondering >> if >> it is an issue of the language or a limitation of the implementation. > > If the performance gap is caused by auto-vectorization, I would > recommend that you use consider Numpy with Numba LLVM-based JIT. Or, > for a "pure" Python solution, you can experiment with an older release > of PyPy and NumPyPy. There is already an implementation based on Numba (which is slower and in my point of view less elegant that what can be done with Transonic-Pythran). Here, it is really about what can be done with PyPy, nowadays and in future. About NumPyPy, I'm sorry about this story, but I'm not interested to play with an unsupported project. > If the problem is the abstraction penalty, then the suggestion from > Anto should help. I tried to use a list to store the data but unfortunatelly, it's slower (1.5 times slower than with attributes and 6 times slower than Julia on my slow laptop): Measurements with Julia (https://github.com/paugier/nbabel/blob/master/py/microbench_ju4.jl): pierre@voyage ~/Dev/nbabel/py master $ julia microbench_ju4.jl Main.NB.MutablePoint3D 17.833 ms (1048576 allocations: 32.00 MiB) Main.NB.Point3D 5.737 ms (0 allocations: 0 bytes) Main.NB.Point4D 4.984 ms (0 allocations: 0 bytes) Measurements with PyPy objects with x, y, z attributes (like Julia, https://github.com/paugier/nbabel/blob/master/py/microbench_pypy4.py): pierre@voyage ~/Dev/nbabel/py master $ pypy microbench_pypy4.py Point3D: 22.503 ms Point4D: 45.127 ms Measurements with PyPy, lists and @property (https://github.com/paugier/nbabel/blob/master/py/microbench_pypy_list.py): pierre@voyage ~/Dev/nbabel/py master $ pypy microbench_pypy_list.py Point3D: 34.115 ms Point4D: 59.646 ms > But, for the question of why, you can examine the code for the inner > loop generated by Julia and the code for the inner loop generate by > PyPy and analyze the reason for the performance gap. It should be > evident if the difference is abstraction or SIMD. Sorry for this naive question but how can I examine the code for the inner loop generated by PyPy ? Pierre > > On Mon, Dec 21, 2020 at 5:20 PM PIERRE AUGIER > <pierre.aug...@univ-grenoble-alpes.fr> wrote: >> >> >> ----- Mail original ----- >> > De: "David Edelsohn" <dje....@gmail.com> >> > À: "PIERRE AUGIER" <pierre.aug...@univ-grenoble-alpes.fr> >> > Cc: "pypy-dev" <pypy-dev@python.org> >> > Envoyé: Vendredi 18 Décembre 2020 21:00:42 >> > Objet: Re: [pypy-dev] Differences performance Julia / PyPy on very similar >> > codes >> >> > Does Julia based on LLVM auto-vectorize the code? I assume yes >> > because you specifically mention SIMD design of the data structure. >> >> Yes, Julia auto-vectorizes the code. Can't PyPy do the same in some case? >> >> > Have you tried NumPyPy? Development on NumPyPy has not continued, but >> > it probably would be a better comparison of what PyPy with >> > auto-vectorization could accomplish to compare with Julia. >> >> I haven't tried NumPyPy because I can't import _numpypy with PyPy3.6. >> >> Anyway, for this experiment, my attempt was to stay in pure Python and to >> compare with what is done in pure Julia. >> >> I think it would be very interesting to understand why PyPy is much slower >> than >> Julia in this case (a factor 4 slower than very simple Julia). I'm wondering >> if >> it is an issue of the language or a limitation of the implementation. >> >> Moreover, I would really be interested to know if an extension compatible >> with >> PyPy (better, not only compatible with PyPy) could be written to make such >> code >> faster (a code involving an array of instances of a very simple class). Could >> we gain anything compare to using a Python list? >> >> Are there some tools to understand what is done by PyPy to speedup some >> code? Or >> to know more on the data structures used under the hood by PyPy? >> >> For example, >> >> class Point3D: >> def __init__(self, x, y, z): >> self.x = x >> self.y = y >> self.z = z >> >> def norm_square(self): >> return self.x**2 + self.y**2 + self.z**2 >> >> I guess it would be good for efficiency to store the 3 floats as native >> floats >> aligned in memory and to vectorized the power computation. How can one know >> what is done by PyPy for a particular code? >> >> Pierre >> >> > >> > Thanks, David >> > >> > On Fri, Dec 18, 2020 at 2:56 PM PIERRE AUGIER >> > <pierre.aug...@univ-grenoble-alpes.fr> wrote: >> >> >> >> Hi, >> >> >> >> I post on this list a message written in PyPy issue tracker >> >> (https://foss.heptapod.net/pypy/pypy/-/issues/3349#note_150255). It is >> >> about >> >> some experiments I did on writing efficient implementations of the NBody >> >> problem https://github.com/paugier/nbabel to potentially answer to this >> >> article >> >> https://arxiv.org/pdf/2009.11295.pdf. >> >> >> >> I get from a PR an [interesting optimized implementation in >> >> Julia](https://github.com/paugier/nbabel/blob/master/julia/nbabel4_serial.jl). >> >> It is very fast (even slightly faster than in Pythran). One idea is to >> >> store >> >> the 3 floats of a 3d physical vector, (x, y, z), in a struct `Point4D` >> >> containing 4 floats to better use simd instructions. >> >> >> >> I added a pure Python implementation inspired by this new Julia >> >> implementation >> >> (but with a simple `Point3D` with 3 floats because with PyPy, the >> >> `Point4D` >> >> does not make the code faster) and good news it is with PyPy a bit faster >> >> than >> >> our previous PyPy implementations (only 3 times slower than the old C++ >> >> implementation). >> >> >> >> However, it is much slower than with Julia (while the code is very >> >> similar). I >> >> coded a simplified version in Julia with nearly nothing else that what >> >> can be >> >> written in pure Python (in particular, no `@inbounds` and `@simd` >> >> macros). It >> >> seems to me that the comparison of these 2 versions could be interesting. >> >> So I >> >> again simplified these 2 versions to keep only what is important for >> >> performance, which gives >> >> >> >> - https://github.com/paugier/nbabel/blob/master/py/microbench_pypy4.py >> >> - https://github.com/paugier/nbabel/blob/master/py/microbench_ju4.jl >> >> >> >> The results are summarized in >> >> https://github.com/paugier/nbabel/blob/master/py/microbench.md >> >> >> >> An important point is that with `Point3D` (a mutable class in Python and >> >> an >> >> immutable struct in Julia), Julia is 3.6 times faster than PyPy. Same >> >> code and >> >> nothing really fancy in Julia so I guess that PyPy might be missing some >> >> optimization opportunities. At least it would be interesting to >> >> understand what >> >> is slower in PyPy (and why). I have to admit that I don't know how to get >> >> interesting information on timing and what is happening with PyPy JIT in a >> >> particular case. I only used cProfile and it's of course clearly not >> >> enough. I >> >> can run vmprof but I'm not able to visualize the data because the website >> >> http://vmprof.com/ is down. I don't know if I can trust values given by >> >> IPython >> >> `%timeit` for particular instructions since I don't know if PyPy JIT does >> >> the >> >> same thing in `%timeit` and in the function `compute_accelerations`. >> >> >> >> I also feel that I really miss in pure Python an efficient fixed size >> >> homogeneous mutable sequence (a "Vector" in Julia words) that can contain >> >> basic >> >> numerical types (as Python `array.array`) but also instances of >> >> user-defined >> >> classes and instances of Vectors. The Python code uses a [pure Python >> >> implementation using a >> >> list](https://github.com/paugier/nbabel/blob/master/py/vector.py). I >> >> think it >> >> would be reasonable to have a good implementation highly compatible with >> >> PyPy >> >> (and potentially other Python implementations) in a package on PyPI. It >> >> would >> >> really help to write PyPy compatible numerical codes. What would be the >> >> good >> >> tool to implement such package? HPy? I wonder whether we can get some >> >> speedup >> >> compared to the pure Python version with lists. For very simple classes >> >> like >> >> `Point3d` and `Point4d`, I wonder if the data could be saved continuously >> >> in >> >> memory and if some operations could be done without boxing/unboxing. >> >> >> >> However, I really don't know what is slower in PyPy / faster in Julia. >> >> >> >> I would be very interested to get the points of view of people knowing >> >> well >> >> PyPy. >> >> >> >> Pierre >> >> _______________________________________________ >> >> pypy-dev mailing list >> >> pypy-dev@python.org >> > > https://mail.python.org/mailman/listinfo/pypy-dev >> _______________________________________________ >> pypy-dev mailing list >> pypy-dev@python.org > > https://mail.python.org/mailman/listinfo/pypy-dev _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev