On 17/05/2015 13:25, Jonas Wielicki wrote:
On 16.05.2015 02:55, Gregory Ewing wrote:
BartC wrote:
For example, there is a /specific/ byte-code called BINARY_ADD, which
then proceeds to call a /generic/ binary-op handler! This throws away
the advantage of knowing at byte-code generation time exactly which
operation is needed.
While inlining the binary-op handling might give you a
slightly shorter code path, it wouldn't necessarily speed
anything up. It's possible, for example, that the shared
binary-op handler fits in the instruction cache, but the
various inlined copies of it don't, leading to a slowdown.
The only way to be sure about things like that is to try
them and measure. The days when you could predict the speed
of a program just by counting the number of instructions
executed are long gone.
That, and also, the days where you could guess the number of
instructions executed from looking at the code are also gone. Compilers,
and especially C or C++ compilers, are huge beasts with an insane number
of different optimizations which yield pretty impressive results. Not to
mention that they may know the architecture you’re targeting and can
optimize each build for a different architecture; which is not really
possible if you do optimizations which e.g. rely on cache
characteristics or instruction timings or interactions by hand.
I changed my habits to just trust my compiler a few years ago and have
more readable code in exchange for that. The compiler does a fairly
great job, although gcc still outruns clang for *my* usecases.
YMMV.
It does. For my interpreter projects, gcc -O3 does a pretty good job.
For running a suite of standard benchmarks ('spectral', 'fannkuch',
'binary-tree', all that lot) in the bytecode language under test, then
gcc is 30% faster than my own language/compiler. (And 25% faster than
clang.)
(In that project, gcc can do a lot of inlining, which doesn't seem to be
practical in CPython as functions are all over the place.)
However, when I plug in an ASM dispatcher to my version (which tries to
deal with simple bytecodes or some common object types before passing
control to the HLL to deal with), then I can get /twice as fast/ as gcc
-O3. (For real programs the difference is narrower, but usually still
faster than gcc.)
(This approach I don't think will work with CPython, because there don't
appear to be any simple cases for ASM to deal with! The ASM dispatcher
keeps essential globals such as the stack pointer and program counter in
registers, and uses chained 'threaded' code rather than function calls.
A proportion of byte-codes need to be handled in this environment,
otherwise it could actually slow things down, as the switch to/from HLL
code is expensive.)
--
Bartc
--
https://mail.python.org/mailman/listinfo/python-list