New submission from Dave Malcolm <[email protected]>:
In msg#120541 of issue#1346238 Raymond suggested to "aim high", so here goes...
I'm opening this as a separate bug as it's a very different approach to the
patches in that bug; adding those on the nosy list for that bug. Sorry in
advance about the epic length of this comment.
I've been experimenting with AST optimizations, and have a (very) primitive
implementation of function-call inlining. As is, it's deeply flawed, but in a
spirit of "Release Early, Release Often" I though I'd post what I have so far,
and try to enumerate the rest of the work that would need doing to get this
into, say Python 3.3
The attached patch adds an AST optimization pass to Python/compiler.c. It does
this by adding an __optimizer__.py module: the compiler converts the AST to a
Python representation using ast2mod, and calls out to
__optimizer__.optimize_ast(). This can then (potentially) apply a series of
manipulations to the tree. The result is then converted back from python to
ast objects, and compilation proceeds as normal on the modified AST tree.
I initially was experimenting with a .c implementation, but moving to .py makes
things _much_ easier to develop and debug. In particular, I'm using graphviz's
"dot" tool to generate before/after visualizations of the AST.
The optimizer doesn't try to optimize itself (or anything that it imports), to
avoid infinite recursion when we have a cold .pyo cache.
Currently I'm doing the AST optimization before symbol table generation. This
means that the inlining is deeply flawed, as it has no knowledge of the scope
of names. A robust implementation would compare the scopes of the callsite and
that of the function body, and remap locals accordingly. The current
implementation renames all name references in the function body, which is
clearly wrong for e.g. references to globals. See below for notes on that.
Here's my test code::
def function_to_be_inlined(x, y, z):
return (2 * x * y) + z
print(function_to_be_inlined(3, 4, 5))
Here's what it compiles to after going through the inliner (clearly, line
numbering needs some work). Note the removal of the CALL_FUNCTION of our
target call site; the remaining CALL_FUNCTION is to "print":
2 0 LOAD_CONST 0 (<code object function_to_be_inlined
at 0x1e9b998, file "<dis>", line 2>)
3 MAKE_FUNCTION 0
6 STORE_NAME 0 (function_to_be_inlined)
4 9 LOAD_CONST 1 (3)
12 STORE_NAME 1 (__inline1f22840__x)
15 LOAD_CONST 2 (4)
18 STORE_NAME 2 (__inline1f22840__y)
21 LOAD_CONST 3 (5)
24 STORE_NAME 3 (__inline1f22840__z)
259 27 LOAD_CONST 4 (2)
30 LOAD_NAME 1 (__inline1f22840__x)
33 BINARY_MULTIPLY
34 LOAD_NAME 2 (__inline1f22840__y)
37 BINARY_MULTIPLY
38 LOAD_NAME 3 (__inline1f22840__z)
41 BINARY_ADD
42 STORE_NAME 4 (__inline1f22840____returnval__)
260 45 LOAD_NAME 5 (print)
48 LOAD_NAME 4 (__inline1f22840____returnval__)
51 CALL_FUNCTION 1
54 POP_TOP
55 LOAD_CONST 5 (None)
58 RETURN_VALUE
The idea is that a further optimization pass would go through and ellide the
unnecessary store-to-local/load-from-local instructions, followed by const
folding, getting this down to:
LOAD_CONST (<code object>)
MAKE_FUNCTION
STORE_NAME (function_to_be_inlined)
; inlined callsite:
LOAD_NAME (print)
LOAD_CONST (29)
CALL_FUNCTION 1
The biggest issue here is dealing with runtime differences between which
function was inlined, and which function is being called, either via
monkeypatching, or in method calls - we can inline intra-method calls within
one class, but we have to cope with overriding of those methods in subclasses.
Thinking aloud of a way of solving this (this isn't implemented yet):
add to AST:
Specialize(expr name,
stmt* specialized,
stmt* generalized)
where you have some interpretation of name (e.g. "self.do_something"), and
carry two different implementations.
so that e.g.
class Something:
def foo(self, x, y, z):
... # some code
self.bar(x, y, z)
... # more code
the:
Call(Attribute(Name(id='self'), id='bar'), args=[etc])
becomes
Specialize(name=Attribute(Name(id='self'), id='bar'),
specialized=[inlined implementation of Something.bar for
"self"],
generalized=[Call(Attribute(Name(id='self'), id='bar'),
args=[etc])])
Similarly, would need a new bytecode, say "JUMP_IF_SPECIALIZABLE"
LOAD_NAME (self)
GET_ATTR ('bar')
; method self.bar is now on top of stack
LOAD_CONST (<code object for the regular implementation of "bar">)
JUMP_IF_SPECIALIZABLE -> label_inline_body
; Start of non-inlined call
eval and push args
CALL_FUNCTION
JUMP_ABSOLUTE -> label_after
; ...end of non-inlined call; return value is top-of-stack
label_inline_body:
eval args (potentially optimizing with knowledge of what's to come)
inlined call
push "return value"
; ...end of inlined call; "return value" is top-of-stack
label_after:
; etc
JUMP_IF_SPECIALIZABLE
Inputs:
TOS : A: code object we inlined for (via a LOAD_CONST, injected by the
inliner)
TOS -1 : B: function object via runtime lookup (LOAD_GLOBAL, or e.g. LOAD_LOCAL
"self"; GET_ATTR "do_something")
Action:
if B's __code__ is A, we can specialize:
PC += oparg
POP
else:
PC += 1
POP
POP
I'm not sure if this covers all cases (e.g. specializing a CFunction such as
the builtins), or if this a sane approach that actually works; need to have a
standard interpretation of what a name lookup means, and to inject LOAD_CONST
of that value into the bytecode, for use by the JUMP_IF_SPECIALIZABLE operation.
As I understand it, functions are "name-resolved" before the arguments are
evaluated, so if argument evaluation leads to the name being bound to a
different function (as a side effect), the status quo in CPython currently is
that the old function is used for that call; JUMP_IF_SPECIALIZABLE would
preserve that behavior.
It could also slightly slow down overridden method calls, as it would add a
test for "am I overridden" that would generally take the "overridden" branch;
perhaps we could detect classes with subclasses in the same AST and suppress
inlining for this case. Similarly, it could spot the
def override_me(self):
raise NotImplementedError
pattern and not try to inline.
Other issues:
- Somehow need to tie together symbol tables with ast2mod. One approach: do
the symtable generation first, then adjust ast2mod so that it adds
PySymtableEntry references to the generated python tree as appopriate. This
would expose the name scope structure. But after the tree changes, do we need
to regenerate the symbol table? Or we could annotate the python ast.Name
objects with scope information, and the ast optimizer would be responsible for
keeping this correct. The compiler could then rerun the symtable analysis on
the returned AST, or could trust the optimizer's info.
- What to do with the "locals()" builtin? Inlined functions could save a
copy of locals with duplicate names, keeping their existing names for the
locals in the inlined copy. This would lead to "locals()" having additional
items when called from an inlined function.
- Many optimizations would have to assume that locals don't change value.
Can we assume that in an optimized build that it's acceptable that code that
use the inspect module and manipulates the f_locals of a frame may have
surprising behavior in the face of optimization? (e.g. I'd like to be able to
optimize away locals if we can prove that absence of side-effects).
- PyAST_obj2mod enforces checking for a particular top-level element. To
cope with this, I needed to pass this around in a few places, so I introduced
"enum PyCompilationMode" to avoid having the magic 0, 1, or 2 to designate the
grammar variant. This better distinguishes these values from other places
which use the ID of the start token: semantically equivalent, but with
different constants. I haven't yet fixed this up in Modules/parsermodule.c
- Plenty of FIXMEs in the code, many of which are in themselves major issues
- I'm not yet checking the optimization flag; that should probably be
required to be on for this to be called.
- Is "__optimizer__" a sane name? perhaps just "optimizer"
- Too unstable to benchmark as yet.
- There are probably other issues I haven't thought of yet.
Notes:
- the patch also contains a fix for issue 10391 (error handling is important
when manipulating the AST)
- potentially could add an optimize_cfg() hook, and expose the CFG struct
basicblock and struct instr as Python objects in a similar way, as a later
optimization pass
Thanks to Red Hat for giving me time to experiment on this
----------
components: Interpreter Core
files: py3k-ast-optimization-inlining-2010-11-12-001.patch
keywords: patch
messages: 121068
nosy: alex, belopolsky, benjamin.peterson, dmalcolm, jhylton, nnorwitz,
rhettinger, sdahlbac, thomas.lee, titanstar
priority: normal
severity: normal
status: open
title: AST Optimization: inlining of function calls
type: performance
versions: Python 3.3
Added file:
http://bugs.python.org/file19583/py3k-ast-optimization-inlining-2010-11-12-001.patch
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue10399>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com