For the last few days ideas, Ahas and progress have come so fast that I
have not had time to document them properly. Now is the time. A reminder:
all work is being done in the fstringify branch.
*Short summary*: There have been big Aha's, the code keeps getting simpler,
and further collapses in complexity beckon. The testing framework continues
to evolve. I won't speculate on schedule. Any "delays" have been worthwhile.
This is a long post. Feel free to skim or skip entirely.
The rest of this post discusses status and documents important design and
coding details. It will be pre-writing for a revised Theory of Operation,
now in LeoDocs.leo.
*Status, visualized*
I have spent a lot of time on testing and visualization tools. The
following are the untouched results from a unit test in unitTest.leo. This
test is a proof of concept for a tree-based fstringify.
The 'contents' dump shows the incoming source string, the 'tree' dump shows
the resulting (annotated) tree.
Contents...
1 """DS."""
2 print('test %s=%s'%(a, 2))
3 print(f"test {a}={2}")
Patched tree...
parent lines node tokens
====== ===== ==== ======
0 Module
0 Module 1 Expr
1 Expr 0 2 Str: s='DS.' string(DS.)
0 Module 3 Expr
3 Expr 4 Call
4 Call 1..2 5 Name: id='print' string(
"""DS.""") newline(1:10) name(print)
4 Call 2 6 BinOp: % op( string('test
%s=%s') op%
6 BinOp 0 7 Str: s='test %s=%s' string(test %s
=%s)
6 BinOp 8 Tuple
8 Tuple 2 9 Name: id='a' op( name(a)
8 Tuple 2 10 Num: n=2 op, number(2)
0 Module 11 Expr
11 Expr 12 Call
12 Call 2..3 13 Name: id='print' op) op) newline
(2:27) name(print)
12 Call 14 JoinedStr
*Note*: parentheses are in the wrong place. Fixing should be
straightforward, if not pretty ;-)
*About the results array*
An earlier post was ambivalent about the results array. Would it ever be
useful on its own? The answer is now completely clear*:*
*The results array is for internal use only.*
* It exists only to transfer data between the tree and the incoming
tokens list.*
Indeed, the name 'results' is a bit of a misnomer, but I can't think of a
better name.
The immediate consequence:
*Visitors should not add non-syncing (insignificant) tokens to the
results array.*
This greatly simplifies the visitors. In particular, visitors never add
comma tokens. Visitors only add required parentheses, such as the parens
that appear in def statements.
*About the asttokens tool*
Early yesterday morning I suddenly wondered whether all the work I have
done so far might have been a waste. Could the asttokens tool be a better
foundation? The short answer is "no", but I looked again at what asttokens
does, and how it does it.
Again, let's start with a dump. Here is the exact same unit test as above,
with the 'asttokens' switched enabled. So the test uses asttokens to
annotate the tree, not the TokenOrderGenerator class:
Contents...
1 """DS."""
2 print('test %s=%s'%(a, 2))
3 print(f"test {a}={2}")
Patched tree...
Module 0..16
Expr 0..0
Str 0..0
Expr 2..11 'print' '(' "'test %s=%s'" '%' '(' 'a' ',' '2' ')'
Call 2..11 'print' '(' "'test %s=%s'" '%' '(' 'a' ',' '2' ')'
Name 2..2
BinOp 4..10 "'test %s=%s'" '%' '(' 'a' ',' '2'
Str 4..4
Tuple 6..10 '(' 'a' ',' '2'
Name 7..7
Num 9..9
Expr 13..16 'print' '(' 'f"test {a}={2}"'
Call 13..16 'print' '(' 'f"test {a}={2}"'
Name 13..13
JoinedStr 15..15
The problems are apparent: the annotations aren't very useful. However,
parentheses are in better places.
*Pros and cons of the asttokens tool*
Pros:
- Thoroughly debugged.
- Uses generators everywhere.
- The code is concise.
- Arguably it is elegant, though argument are possible ;-)
- Works with trees built by ast *and *with astroid, but this is a nit, for
this project.
Cons:
- Imo, the present TOG generators are clearer. They are certainly more
flexible.
- asttokens.MarkTokens class doesn't do what is needed, nor can it easily
be made to do so.
- It's still not clear whether asttokens traverses the tree in token order.
Neither the asttokens sources nor the asttokens docs are clear on this last
point. I have created a unit test to investigate this question, but results
are not yet conclusive. Further (easy) unit tests will eventually answer
this question.
I continue to study the asttokens sources. After all the work I have done,
I know what to look for :-) Some parts are clever, perhaps too clever.
Other parts are essential hacks that the TOG will have to emulate, as
discussed below.
To summarize: Imo, the TOG classes are clearer and more flexible than the
asttokens code. YMMV.
*A new code pattern for visitors*
Yesterday I spent several happy hours revising the visitors. They all now
use a common pattern for example:
def do_Call(self, node):
yield from self.gen(node.func)
yield from self.gen_op('(')
yield from self.gen(node.args)
yield from self.gen(node.keywords)
# The visitors puts the '**' if there is no name field.
if hasattr(node, 'starargs'):
# The visitor puts the '*'.
yield from self.gen(node.starargs)
if hasattr(node, 'kwargs'):
# The visitor puts the '**'.
yield from self.gen(node.kwargs)
yield from self.gen_op(')')
Visitors must follow three simple rules:
Rule 1. Visitors always use 'yield from', never 'yield'.
This allows subclasses to change visitors or other members at will. I'll
give an important example later.
Rule 2. Visitors call self.gen_op, self.gen_name, etc to add tokens to the
results list.
Rule 3. Visitors call self.gen to generate results from subtrees of the
parse trees.
*Making the pattern work*
The code has been generalized. This eliminates a lot of cruft.
The gen* methods are new. They wrap calls to self.visitor, and usually
also self.put* methods...
def gen(self, z):
yield from self.visitor(z)
def gen_blank(self):
yield from self.visitor(self.put_blank())
...
The all-important visitor method has been generalized:
def visitor(self, node):
"""Given an ast node, return a *generator* from its visitor."""
# This saves a lot of tests.
if node is None:
return
# More general, more convenient.
if isinstance(node, (list, tuple)):
for z in node or []:
if isinstance(z, ast.AST):
yield from self.visitor(z)
else:
# Some fields contain ints or strings.
assert isinstance(z, (int, str)), z.__class__.__name__
return
# We *do* want to crash if the visitor doesn't exist.
method = getattr(self, 'do_' + node.__class__.__name__)
# Allow begin/end visitor to be generators.
val = self.begin_visitor(node)
if isinstance(val, types.GeneratorType):
yield from val
# method(node) is a generator, not a recursive call!
val = method(node)
if isinstance(val, types.GeneratorType):
yield from method(node)
else:
raise ValueError(f"Visitor is not a generator: {method!r}")
val = self.end_visitor(node)
if isinstance(val, types.GeneratorType):
yield from val
As you can see, all *visitors* must actually be generators. But the special
tests at the end mean that the begin/end_visitor methods can either regular
methods or generators. This is an important generalization. For example...
*The TokenOrderNodeGenerator class*
This class is the foundation for a unit test investigating whether
asttokens actually traverses the tree in token order. Here it is:
class TokenOrderNodeGenerator(TokenOrderGenerator):
"""A class that yields a stream of nodes."""
def generate_nodes(self, tree):
"""Entry: yield a stream of nodes."""
yield from self.visitor(tree)
# Overrides...
def begin_visitor(self, node):
if node:
yield node
def end_visitor(self, node):
pass
def put_token(self, kind, val):
pass
Nothing could possibly be more elegant. The begin_visitor method has
become a generator(!).
*A reminder about speed*
A note to myself, and any other interested party ;-) *Please* remember that
speed is of tertiary importance. Moreover, the TOG classes are about 30
percent faster than the concise, supposedly elegant, classes in the
asttokens tool.
Simplicity and correctness of code in the visitors are *infinitely *more
important than speed. The simplified visitor pattern has *no* special
cases. The new tests in the visitor method allows the
TokenOrderNodeGenerator class to be dead simple.
*Remaining problems*
I'll just mention the problems. Solutions should be fairly
straightforward. I thought I had found a brilliant solution to some of
them, but serious doubts arose while writing this post, so I'll investigate
further...
1. Allocating non-syncing tokens, especially parentheses.
asttokens has a method that replaces tokens in a (tree of) ast nodes. This
will only work if all tokens are properly allocated to the correct node.
Clearly this can be done, perhaps in a post pass.
2. The If visitor can't distinguish between "else if" and "elif" (!!).
It appears that in some cases these two constructs result in *exactly the
same parse tree.* If that's true (I'm still investigating, but I think it
is true), then Linker.check will have to do a late fixup. It should be no
big deal. It's certainly not a show stopper.
3. 'comment' and 'string' tokens are difficult special cases.
Comments are not easily accessible in the tree, and neither are strings
arising from docstrings. Furthermore, only tokens contain reliable
"spellings" of comments and strings. Finally, the Str visitor relies on
special case code in Linker.set_links.
*Summary*
The project is going well. The code continues to improve. Some questions
and problems remain, but I see no gotchas.
Edward
--
You received this message because you are subscribed to the Google Groups
"leo-editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/leo-editor/e413cd1d-604c-4390-ba45-5d0244241d29%40googlegroups.com.