ENB: The unification of the token and ast worlds

Edward K. Ream Sun, 10 Nov 2019 01:37:42 -0800

A few days a new goal appeared: to define and create a *token-order tree 
traversal*.This would be yet another kind of tree traversal 
<https://www.geeksforgeeks.org/tree-traversals-inorder-preorder-and-postorder/>,
 
whose purpose would be to unify the token and ast worlds. This *goal* 
started a chain of discovery and experimentation.  Iirc, this is the first 
time I have stated this goal.


This post 
<https://groups.google.com/d/msg/leo-editor/ZSo8_fW_cmA/Pfvq5KV0AQAJ> is a 
primer of the token and ast based worlds. Please study that post if have no 
idea what "unifying the token and ast worlds" might mean :-)

Yesterday I spent many happy hours exploring an excellent, elegant, 
asttokens <https://github.com/gristlabs/asttokens> package. At first I 
thought asttokens might be an elegant way to define token-order 
traversals.  But then problems arose...

Last night, in conversation with Rebecca, I had two Ahas, each of which 
will forever change my view of the world:

*Aha 1*. Any token-order tree traversal *must* be isomorphic to Leo's 
AstFormatter class, in leoAst.py.

At last, token-order tree traversals are well defined.

*Aha 2*. Tree-based code could be an alternative front-end to Leo's 
token-based beautifier.

I am writing this post now, before the old world view becomes completely 
inaccessible. This post is an important part of Leo's history.  It will 
also be pre-writing for a new theory of operation.

*Background*

Three previous threads record the background of the Ahas:

*Thread 1*: October 25: A small pause for a better fstringify 
<https://groups.google.com/forum/#!topic/leo-editor/ZSo8_fW_cmA> states:

"It is my strong (informed) opinion that parse trees are inappropriate for 
text-based manipulations such as black <https://github.com/psf/black>and 
fstringify <https://github.com/jacktasia/fstringify>".

The two new Ahas alter that opinion in large and small ways.

*Thread 2*: November 1: ENB about tokens and related commands 
<https://groups.google.com/d/msg/leo-editor/aivhFnXW85Q/b2a8GHvEDwAJ> 
reiterates my contention that python programmers often downplay the 
significance of token-based code.  That's still my opinion.

*Thread 3*: November 3: ENB A much better untokenizer 
<https://groups.google.com/d/msg/leo-editor/DpZ2cMS03WE/VPqtB9lTEAAJ> 
discusses as spectacular replacement for the untokenize function in 
tokenize.py, part of Python's standard library.  The new code is the 
foundation of Leo's fstringify commands, and all other token-based code in 
Leo.

The python devs responsible for tokenize.py were underwhelmed by the new 
untokenize, but no matter :-)  It was part of the background for the new 
Ahas. And new untokenize is the foundation of Leo's new fstringify commands.

*About asttokens*

The asttokens package embeds new data into the 5-tuples created by Pythons 
tokenize function.  They become 8-tuples. The new data contain links to the 
ast nodes "responsible" for the tokens.

Alas, the new data does not suffice to create a two way mapping between 
tokens and ast nodes.  

Rebecca asked whether such a two-way mapping was possible, and both Aha's 
appeared immediately!

Any proper *tree-to-token mapping* must have two parts:

1. Links from each token to *exactly one* tree node, the node that 
"generates" the token.

2. Links from tree nodes to zero or more tokens, in the proper token order.

*Aha 1: clever code has no chance of working*

I saw that I was trying to "cheat" yesterday.  That is, I was trying to 
make asttokens do more than it possibly could.  This was becoming clearer 
as I wrestled with the new TokenOrderTraverser class in leoAst.py.  The 
present version of this class will move to the attic.

There is no real need to discuss the problems in detail.  Instead, let's 
just consider the AstFormatter class in leoAst.py.  Aha1 is simply the 
realization that the AstFormatter class *already* defines token-order!

If I want a two-way mapping between tokens and tree nodes, something that 
works exactly like the visitors in AstFormatter is not only the *simplest 
*thing 
that could possibly work, it is the *only* thing that could possibly work.  
Indeed, there is *exactly one* traversal that will format text properly. 
But (Aha!) the formatted text must be (except for easily handled special 
cases) isomorphic to the stream of tokens!

For the first time, token-order is well defined.

*Aha 2: AstFormatter could be the front end to Leo's beautify commands*

This is a corollary to the first Aha.  If a tree traversal can produce 
tokens in the same order that tokenize does, then it could create the 
so-called input tokens used by Leo's beautify and fstringify commands.

*Strategy*

I'll modify the AstFormatter class so that it injects two-way links between 
tokens and nodes.  Aha1 proves that this is possible!

The AstFormatter class contains one "visitor" function for every single ast 
node that could possibly generate output text.  Crucially, these visitors 
*must* call other visitors in the correct order.  That order *defines* 
token order.

Instead of creating output text, the rewritten TokenOrderTraverser class 
will insert links.  So simple.  The answer was staring at me all this time.

There are a few complications that are easily handled.  Commas after tuples 
with two or more elements are optional.  Therefore, the do_Tuple visitor* 
must test the token stream*.

The old TokenOrderTraverser class tried to use a dict to specify token 
order.  But *cruft tokens* doomed that approach.  Examples:

- The trailing colon in class and def statements.
- The commas in lists and tuples.

Only explicit code, *exactly* as in the visitors in AstFormatter, could 
possibly inject links into cruft tokens.

*Summary*

It's getting very late, so I'll be brief here.

I don't remember exactly how notion of a token-order traversal appeared.  
In retrospect, this was a crucial "invention".

Initially, all details were fuzzy.  It wasn't entirely clear whether the 
notion *could* be well defined, though I strongly suspected that it could 
be.

Now, everything is clear. The code for a token-order traversal class must 
be isomorphic to the code in AstFormatter class.  All the messy, picky, 
details of that class *define* token-order traversals!

Lest anyone doubt the importance of these Ahas, consider the status quo 
ante:

- The old (deprecated) TokenSync class in leoAst.py.
- The horrendous code in the "real" black and fstringify tools.
- The token-level parsing in Leo's fstringify commands.

A proper tree-to-token mapping would be of great value to any tool that 
munges text. It allows tools to use both tree and token representations 
interchangeably *in the same program.*

That's all for now.  I've covered enough here to make sure the crucial 
details behind the Ahas don't fade away.

Edward

-- 
You received this message because you are subscribed to the Google Groups 
"leo-editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/leo-editor/f8468fdb-91ea-48da-b847-0f22a75ec89e%40googlegroups.com.

ENB: The unification of the token and ast worlds

Reply via email to