Hi, tl;dr I found a way to make CPython 3.6 faster and I validated that there is no performance regression. I'm requesting approval of core developers to start pushing changes.
In 2014 during a lunch at Pycon, Larry Hasting told me that he would like to get rid of temporary tuples to call functions in Python. In Python, positional arguments are passed as a tuple to C functions: "PyObject *args". Larry wrote Argument Clinic which gives more control on how C functions are called. But I guess that Larry didn't have time to finish his implementation, since he didn't publish a patch. While trying to optimize CPython 3.6, I wrote a proof-of-concept patch and results were promising: https://bugs.python.org/issue26814#msg264003 https://bugs.python.org/issue26814#msg266359 C functions get a C array "PyObject **args, int nargs". Getting the nth argument become "arg = args[n];" at the C level. This format is not new, it's already used internally in Python/ceval.c. A Python function call made from a Python function already avoids a temporary tuple in most cases: we pass the stack of the first function as the list of arguments to the second function. My patch generalizes the idea to C functions. It works in all directions (C=>Python, Python=>C, C=>C, etc.). Many function calls become faster than Python 3.5 with my full patch, but even faster than Python 2.7! For multiple reasons (not interesting here), tested functions are slower in Python 3.4 than Python 2.7. Python 3.5 is better than Python 3.4, but still slower than Python 2.7 in a few cases. Using my "FASTCALL" patch, all tested function calls become faster or as fast as Python 2.7! But when I ran the CPython benchmark suite, I found some major performance regressions. In fact, it took me 3 months to understand that I didn't run benchmarks correctly and that most benchmarks of the CPython benchmark suite are very unstable. I wrote articles explaining how benchmarks should be run (to be stable) and I patched all benchmarks to use my new perf module which runs benchmarks in multiple processes and computes the average (to make benchmarks more stable). At the end, my minimum FASTCALL patch (issue #27128) doesn't show any major performance regression if you run "correctly" benchmarks :-) https://bugs.python.org/issue27128#msg272197 Most benchmarks are not significant, 14 are faster, and only 4 are slower. According to benchmarks on the "full" FASTCALL patch, the slowdown are temporary and should become quickly speedup (with further changes). My question is now: can I push fastcall-2.patch of the issue #27128? This patch only adds the infrastructure to start working on more useful optimizations, more patches will come, I expect more exciting benchmark results. Overview of the initial FASTCALL patch, see my first message on the issue: https://bugs.python.org/issue27128#msg266422 -- Note: My full FASTCALL patch changes the C API: this is out of the scope of my first simple FASTCALL patch. I will open a new discussion to decide if it's worth it and if yes, how it should be done. Victor _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com