Brief status report; long ENB re Ahas and code

Edward K. Ream Fri, 22 Nov 2019 07:49:00 -0800

For the last few days ideas, Ahas and progress have come so fast that I 
have not had time to document them properly. Now is the time. A reminder: 
all work is being done in the fstringify branch.


*Short summary*: There have been big Aha's, the code keeps getting simpler, 
and further collapses in complexity beckon. The testing framework continues 
to evolve. I won't speculate on schedule. Any "delays" have been worthwhile.

This is a long post. Feel free to skim or skip entirely.

The rest of this post discusses status and documents important design and 
coding details. It will be pre-writing for a revised Theory of Operation, 
now in LeoDocs.leo.


*Status, visualized*

I have spent a lot of time on testing and visualization tools. The 
following are the untouched results from a unit test in unitTest.leo. This 
test is a proof of concept for a tree-based fstringify.

The 'contents' dump shows the incoming source string, the 'tree' dump shows 
the resulting (annotated) tree.

Contents...

1    """DS."""
2    print('test %s=%s'%(a, 2))
3    print(f"test {a}={2}")

Patched tree...

parent           lines    node                               tokens
======           =====    ====                               ======
                          0   Module
0   Module                1     Expr
1   Expr         0        2       Str: s='DS.'               string(DS.)
0   Module                3     Expr
3   Expr                  4       Call
4   Call         1..2     5         Name: id='print'         string(
"""DS.""") newline(1:10) name(print)

4   Call         2        6         BinOp: %                 op( string('test 
%s=%s') op%
6   BinOp        0        7           Str: s='test %s=%s'    string(test %s
=%s)
6   BinOp                 8           Tuple
8   Tuple        2        9             Name: id='a'         op( name(a)
8   Tuple        2        10            Num: n=2             op, number(2)
0   Module                11    Expr
11  Expr                  12      Call
12  Call         2..3     13        Name: id='print'         op) op) newline
(2:27) name(print)
12  Call                  14        JoinedStr

*Note*: parentheses are in the wrong place. Fixing should be 
straightforward, if not pretty ;-)

*About the results array*

An earlier post was ambivalent about the results array.  Would it ever be 
useful on its own? The answer is now completely clear*:*

    *The results array is for internal use only.*

*    It exists only to transfer data between the tree and the incoming 
tokens list.*

Indeed, the name 'results' is a bit of a misnomer, but I can't think of a 
better name.

The immediate consequence:

    *Visitors should not add non-syncing (insignificant) tokens to the 
results array.*

This greatly simplifies the visitors.  In particular, visitors never add 
comma tokens. Visitors only add required parentheses, such as the parens 
that appear in def statements.

*About the asttokens tool*

Early yesterday morning I suddenly wondered whether all the work I have 
done so far might have been a waste.  Could the asttokens tool be a better 
foundation? The short answer is "no", but I looked again at what asttokens 
does, and how it does it.

Again, let's start with a dump.  Here is the exact same unit test as above, 
with the 'asttokens' switched enabled. So the test uses asttokens to 
annotate the tree, not the TokenOrderGenerator class:

Contents...

1    """DS."""
2    print('test %s=%s'%(a, 2))
3    print(f"test {a}={2}")

Patched tree...

      Module    0..16
        Expr    0..0
         Str    0..0
        Expr    2..11   'print' '(' "'test %s=%s'" '%' '(' 'a' ',' '2' ')'
        Call    2..11   'print' '(' "'test %s=%s'" '%' '(' 'a' ',' '2' ')'
        Name    2..2
       BinOp    4..10   "'test %s=%s'" '%' '(' 'a' ',' '2'
         Str    4..4
       Tuple    6..10   '(' 'a' ',' '2'
        Name    7..7
         Num    9..9
        Expr   13..16   'print' '(' 'f"test {a}={2}"'
        Call   13..16   'print' '(' 'f"test {a}={2}"'
        Name   13..13
   JoinedStr   15..15

The problems are apparent: the annotations aren't very useful. However, 
parentheses are in better places.

*Pros and cons of the asttokens tool*

Pros:
- Thoroughly debugged.
- Uses generators everywhere.
- The code is concise.
- Arguably it is elegant, though argument are possible ;-)
- Works with trees built by ast *and *with astroid, but this is a nit, for 
this project.

Cons:
- Imo, the present TOG generators are clearer. They are certainly more 
flexible.
- asttokens.MarkTokens class doesn't do what is needed, nor can it easily 
be made to do so.
- It's still not clear whether asttokens traverses the tree in token order.

Neither the asttokens sources nor the asttokens docs are clear on this last 
point. I have created a unit test to investigate this question, but results 
are not yet conclusive. Further (easy) unit tests will eventually answer 
this question.

I continue to study the asttokens sources.  After all the work I have done, 
I know what to look for :-) Some parts are clever, perhaps too clever. 
Other parts are essential hacks that the TOG will have to emulate, as 
discussed below.

To summarize: Imo, the TOG classes are clearer and more flexible than the 
asttokens code. YMMV.


*A new code pattern for visitors*

Yesterday I spent several happy hours revising the visitors.  They all now 
use a common pattern for example:

def do_Call(self, node):

    yield from self.gen(node.func)
    yield from self.gen_op('(')
    yield from self.gen(node.args)
    yield from self.gen(node.keywords)
        # The visitors puts the '**' if there is no name field.
    if hasattr(node, 'starargs'):
        # The visitor puts the '*'.
        yield from self.gen(node.starargs)
    if hasattr(node, 'kwargs'):
        # The visitor puts the '**'.
        yield from self.gen(node.kwargs)
    yield from self.gen_op(')')

Visitors must follow three simple rules:

Rule 1. Visitors always use 'yield from', never 'yield'.

This allows subclasses to change visitors or other members at will.  I'll 
give an important example later.

Rule 2. Visitors call self.gen_op, self.gen_name, etc to add tokens to the 
results list.

Rule 3. Visitors call self.gen to generate results from subtrees of the 
parse trees.

*Making the pattern work*

The code has been generalized.  This eliminates a lot of cruft. 

The gen* methods are new.  They wrap calls to self.visitor, and usually 
also self.put* methods...

def gen(self, z):
    yield from self.visitor(z)

def gen_blank(self):
    yield from self.visitor(self.put_blank())
...


The all-important visitor method has been generalized:

def visitor(self, node):
    """Given an ast node, return a *generator* from its visitor."""
    # This saves a lot of tests.
    if node is None:
        return
    # More general, more convenient.
    if isinstance(node, (list, tuple)):
        for z in node or []:
            if isinstance(z, ast.AST):
                yield from self.visitor(z)
            else:
                # Some fields contain ints or strings.
                assert isinstance(z, (int, str)), z.__class__.__name__
        return
    # We *do* want to crash if the visitor doesn't exist.
    method = getattr(self, 'do_' + node.__class__.__name__)
    # Allow begin/end visitor to be generators.
    val = self.begin_visitor(node)
    if isinstance(val, types.GeneratorType):
        yield from val
    # method(node) is a generator, not a recursive call!
    val = method(node)
    if isinstance(val, types.GeneratorType):
        yield from method(node)
    else:
        raise ValueError(f"Visitor is not a generator: {method!r}")
    val = self.end_visitor(node)
    if isinstance(val, types.GeneratorType):
        yield from val

As you can see, all *visitors* must actually be generators. But the special 
tests at the end mean that the begin/end_visitor methods can either regular 
methods or generators. This is an important generalization. For example...

*The TokenOrderNodeGenerator class*

This class is the foundation for a unit test investigating whether 
asttokens actually traverses the tree in token order.  Here it is:

class TokenOrderNodeGenerator(TokenOrderGenerator):
    """A class that yields a stream of nodes."""

    def generate_nodes(self, tree):
        """Entry: yield a stream of nodes."""
        yield from self.visitor(tree)

    # Overrides...
    
    def begin_visitor(self, node):
        if node:
            yield node
        
    def end_visitor(self, node):
        pass

    def put_token(self, kind, val):
        pass

Nothing could possibly be more elegant.  The begin_visitor method has 
become a generator(!).

*A reminder about speed*

A note to myself, and any other interested party ;-) *Please* remember that 
speed is of tertiary importance. Moreover, the TOG classes are about 30 
percent faster than the concise, supposedly elegant, classes in the 
asttokens tool.

Simplicity and correctness of code in the visitors are *infinitely *more 
important than speed. The simplified visitor pattern has *no* special 
cases.  The new tests in the visitor method allows the 
TokenOrderNodeGenerator class to be dead simple.

*Remaining problems*

I'll just mention the problems. Solutions should be fairly 
straightforward.  I thought I had found a brilliant solution to some of 
them, but serious doubts arose while writing this post, so I'll investigate 
further...

1. Allocating non-syncing tokens, especially parentheses.

asttokens has a method that replaces tokens in a (tree of) ast nodes. This 
will only work if all tokens are properly allocated to the correct node.  
Clearly this can be done, perhaps in a post pass.

2. The If visitor can't distinguish between "else if" and "elif" (!!).

It appears that in some cases these two constructs result in *exactly the 
same parse tree.* If that's true (I'm still investigating, but I think it 
is true), then Linker.check will have to do a late fixup.  It should be no 
big deal.  It's certainly not a show stopper.

3. 'comment' and 'string' tokens are difficult special cases.

Comments are not easily accessible in the tree, and neither are strings 
arising from docstrings.  Furthermore, only tokens contain reliable 
"spellings" of comments and strings. Finally, the Str visitor relies on 
special case code in Linker.set_links.  

*Summary*

The project is going well.  The code continues to improve. Some questions 
and problems remain, but I see no gotchas.

Edward

-- 
You received this message because you are subscribed to the Google Groups 
"leo-editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/leo-editor/e413cd1d-604c-4390-ba45-5d0244241d29%40googlegroups.com.

Brief status report; long ENB re Ahas and code

Reply via email to