You missed the point of the PEP: "It becomes possible to experiment
with more advanced optimizations in CPython than just
micro-optimizations, like tagged pointers."

IMHO it's time to stop wasting our limited developer resources on
micro-optimizations and micro-benchmarks, but think about overall
Python performance and major Python internals redesign to find a way
to make Python overall 2x faster, rather than making a specific
function 10% faster.

I don't think that the performance of accessing namedtuple attributes
is a known bottleneck of Python performance.


Le lun. 29 juin 2020 à 23:37, Raymond Hettinger
<raymond.hettin...@gmail.com> a écrit :
> $ python3.8 -m timeit -s 'from collections import namedtuple' -s 
> 'Point=namedtuple("Point", "x y")'  -s 'p=Point(10,20)' 'p.x; p.y; p.x; p.y; 
> p.x; p.y'
> 2000000 loops, best of 5: 119 nsec per loop
>
> $ python3.9 -m timeit -s 'from collections import namedtuple' -s 
> 'Point=namedtuple("Point", "x y")'  -s 'p=Point(10,20)' 'p.x; p.y; p.x; p.y; 
> p.x; p.y'
> 2000000 loops, best of 5: 152 nsec per loop

Measuring benchmarks which take less than 1 second requires being very
careful. For a microbenchmark which takes around 100 ns like this one,
you are very close to the CPU limit and "everything" becomes
important.

Python performance depends on the C compiler, on compiler options, how
you run the microbenchmark, if --enable-shared is used, etc. Giving
microbenchmark results without these information isn't helpful.

On Fedora 32, Python binaries are built by GCC with Link Time
Optimization (LTO) and Profile Guided Optimization (PGO). I simply get
the same performance between Python 3.8.3 and Python 3.9.0b3:

$ python3.9 -m pyperf timeit --compare-to=python3.8 -s 'from
collections import namedtuple' -s 'Point=namedtuple("Point", "x y")'
-s 'p=Point(10,20)' 'p.x; p.y; p.x; p.y; p.x; p.y'
python3.8: ..................... 138 ns +- 2 ns
python3.9: ..................... 136 ns +- 3 ns

Mean +- std dev: [python3.8] 138 ns +- 2 ns -> [python3.9] 136 ns +- 3
ns: 1.01x faster (-1%)

(A difference smaller than 10% on a microbenchmark is not significant.)

The compiler decides to inline or not a static inline function
depending on many complex things. I don't think that there is any need
to elaborate here.

The idea to force inlining was discussed but rejected when first C API
macros have been converted to static inline functions:
https://bugs.python.org/issue35059

C compilers are now really smart to emit the most efficient machine code.

By the way, if you configure Python with --enable-shared, function
calls from libpython to libpython have to go through a procedure
linkage table (PLT) indirection. Python 3.8 and 3.9, on Fedora 32 and
Python 3.8 on RHEL8 are built with -fno-semantic-interposition to
avoid this indirection and so make Python faster. More about this
linker flag:
https://developers.redhat.com/blog/2020/06/25/red-hat-enterprise-linux-8-2-brings-faster-python-3-8-run-speeds/

Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/5AAO45Y276AS5EZDDKTRP6QZ6K5SOOO6/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to