tl;dr Python 3.7 is going to be faster without breaking backward
compatibility, say hello to the new "tp_fastcall" slot!
Python 3.6 got a new "FASTCALL" calling convention which allows to
avoid the creation a temporary tuple to pass positional arguments and
a temporary dictionary to pass keyword arguments. But callable objects
having a __call__() method implemented in Python don't benefit of
I tried to reuse the tp_call slot with a new flag in tp_flags, but I
had two major blocker issues:
* Deeply break the backward compatibility of the C API: calling
directly tp_call (with tuple/dict) would crash immediately if the
object uses FASTCALL
* Need to duplicate each "tp_call" function to get a new "tp_fastcall"
flavor. It wasn't easy to share the function body.
Good news, I found a new design which don't have any of these issues!
I chose to add a new tp_fastcall field to PyTypeObject and use a tiny
wrapper calling tp_fastcall for tp_call, to keep the backward
The goal is to get optimizations "for free" when calling functions.
The best expected speedup on a microbenchmark is around 1.56x faster
(-36%) when calling an object supporting FASTCALL. Example with
property_descr_get() without its "cached args" hack, result without
fastcall ("py34") compared to fastcall ("fastcall_wrapper"):
Median +- std dev: [py34] 75.0 ns +- 1.7 ns -> [fastcall_wrapper] 48.2
ns +- 1.5 ns: 1.56x faster (-36%)
But please don't expect such large speedup on macro-benchmark.
tp_fastcall allows to remove the "cached args" optimization used in
various parts of Python core, old optimizations used in performance
critical code. This hack causes various kinds of complex bugs in
corner cases which can lead to crash in the worst case.
The patch to support tp_fastcall is tiny, but you should expect a long
list of tiny changes to replace tp_call with tp_fastcall in various
Final bonus point: existing code (calling functions) doesn't need to
be modified (nor recompiled) to get speedup. Even if tp_call is
directly directly, fastcall will provide speedup, but only if it is
called only with positional arguments.
About the tp_call wrapper: keyword arguments require to convert a
Python dictionary to a C array which might be more expensive. I didn't
try to measure the performance, since this case is very rare. Almost
no C code calls functions with keyword arguments, just because it's
much more complex to pass keyword arguments, it requires too much C
code (and it's not simpler with fastcall, sorry).
Python-Dev mailing list