Re: ENB: The unification of the token and ast worlds

Edward K. Ream Tue, 12 Nov 2019 00:15:21 -0800

On Monday, November 11, 2019 at 4:10:39 PM UTC-6, Edward K. Ream wrote:

After about 10 hours of work, starting very early this morning, I realized 
> that my initial approach to "syncing" tokens with ast nodes needed a 
> rethink. The initial idea was to *verify *that tokens matched ast 
> nodes.  But that's too late.
>

*Note*: this continues an Engineering Notebook post. Feel free to ignore.

This post is a milestone. It records notes to myself. It also celebrates
a clearing of my mental fog. I want to write this as an important
historical note. I also record it so I can go back to sleep ;-)

When I awoke very early this morning I realized that I had been making
things much more complicated than they need to be, because I had forgotten
what I was trying to do :-)

The present work is not supposed to generate *new* tokens, it is supposed
to *annotate the existing* tokens created by make_all_tokens. This
collapses the complexity of the crucial code. Here are some principles
that I awoke with:

1. Add fields to Token class: index, level, node.

The purpose of the TokenOrderTraverser is to add these links. Nothing more!

2. Improve the visit method

- assert isinstance(node, ast.Ast) (None, list, tuple) are not valid.

Unlike all "elegant" traversal classes, the TokenOrderTraverser class *must*
explicitly handle all fields that are lists or tuple, and must check for
empty fields. There is absolutely no choice about this. It's the only way
to retain the correct traversal order.

- Inject parent, ordered_children fields in ast nodes.

These are not needed to annotate the tokens, but they will be of great
value for the clients of the TokenOrderTraverser class.

- compute max_level, max_stack_level.

These are important data for development. max_level is the max indentation
level of python blocks. max_stack_level is the max recursion level in
tot.visit. The visit method can easily update these date.

The asttokens tool supposedly uses generators to avoid overflowing python's
runtime stack. I want to make sure we never come close to this.

3. The following three items look innocuous. In fact, they are supremely
important. They arise because the task is now clearer:

- Remove all calls to put_indent and put_dedent. Replace them with
self.level +- 1.
- Remove all "speculative" calls to do_newline.
- Remove conditional_newline.

You could say that all of the code above is a brain spike :-) Again, the
code is *not* creating new tokens, it is annotating existing tokens. This
is actually a huge Aha:

The put* methods simply "eat" zero or more tokens in the token *list*,
adding fields to those tokens in the process.

Most put methods will eat "ws" tokens if they are next, and then eat the
"matching" token. The put_newline method will *also* any following "indent"
token. It's totally simple! There should be no such thing as a
conditional_newline!

I'm not sure how "dedent" tokens will be eaten, but it shouldn't be a big
deal to do so.

And now there is a second huge Aha:

Eating a token naturally associates exactly one ast node with the token.

Indeed, the self.node (carefully set and restored in tot.visit, using a
stack) will be injected into the token's node field. That's all there is
to it!

And one last Aha:

Newlines are associated with *statements*, not blocks.

This ends some massive confusion, and will simplify the code considerably.

*Summary*

The task of the TokenOrderTraverser class is merely to annotate already
existing tokens.

The put methods will "eat" one or more tokens by advancing a pointer to the
token list and by injecting data into the eaten tokens. There is no need
for complex synchronization!

put_newline will eat a "newline" or "nl" token and any following "indent"
token, and probably any preceding "dedent" token. The
put_conditional_comma method is still required. It will eat a comma if it
exists, but will issue no complaint if it does not.

Eating a token naturally associates a token with the correct ast node. At
last I clearly and fully understand the two-way correspondence.

The self.level ivar represents indentation level, and will be injected into
all tokens. That's all that needs to be done regarding indentation! There
is no need to generate "indent" and "dedent" tokens!

It is rare for a to-do list to have such import, but these are wonderful
times :-)

Edward

P. S. Hehe. The TokenOrderFormatter is trivial because it doesn't do
anything. True, a proper code beautifier or fstringifier would be a
subclass of TokenOrderTraverser, but those tools would be anything but
trivial.

EKR

--
You received this message because you are subscribed to the Google Groups
"leo-editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/leo-editor/cb73cb38-b404-4218-a18a-8605f70bce53%40googlegroups.com.

Re: ENB: The unification of the token and ast worlds

Reply via email to