Success! High-level overview and status report

Edward K. Ream Mon, 18 Nov 2019 08:10:38 -0800

The most risky phase of this project is complete. Significant tests pass.

The crucial Linker class is unexpectedly simple and elegant. It is a linear 
time diff of two lists of tokens.


The rest of this post is a high-level overview of the code and algorithms. 
It will be pre-writing for a Theory of Operation chapter, to become part of 
LeoDocs.leo.


*High-level overview*

The *contents *are a string, a snippet of python code.

Given a contents string, *TokenOrderGenerator *(tog) class does the 
following:

1. Tokenize:

Use python's tokenize module to create the *tokens list*, a list of *Token *
objects.

2. Create the parse tree:

Use python's ast module to create the *parse tree*, a tree of ast.AST nodes.

3. Generate the results list:

The tog traverses the parse tree, generating the *results list*, again a 
list of Token objects.

Crucially, this traversal happens in *token order*. This is the *one and 
only *order that guarantees the token and results lists will match *exactly*, 
except for *optional tokens*. Optional tokens may appear in either the 
tokens list or results list.

To generate result tokens in the proper order, the tog class *must *have a 
separate *visitor* for every kind of parse tree node.

All visitors are generators, ensuring that the traversal can't overflow 
python's run-time stack. Visitors use "yield" or "yield from" to generate 
results from sub-trees. These yields must be done in token order, which is 
why a separate visitor is required for each kind of parse tree node.

All visitors end with *begin_visitor* and end with the *end_visitor* 
method. Among other things, these methods save, set and restore the 
*tog.node* ivar. *This ivar is the tree node corresponding to each visitor.*

Visitors call *tog.put* to create new result tokens. Crucially, the* 
tog.node* ivar tells tog.put which parse-tree node is responsible for the 
token. The *node ivar* of each result token remembers that node.

4. Assigns two-way links between input tokens and parse-tree nodes...

*Syncing and linking*

The *Linker *class is the heart of the entire enterprise. It must be bullet 
proof.

To review, the tree traversal generates results list, a list of Token 
objects. All result tokens contain a link (token.node) to the parse tree 
node that generated the token.

The linker runs in a post pass. The linker *syncs *both lists, looking *ahead 
*in either list as necessary. 

Syncing is easy because *significant *(non-optional) tokens appear in the 
same order in the two lists! The linker need only ignore optional tokens. 
When syncing two nodes, the linker transfers the token.node data from the 
result list to the corresponding node in the tokens list.

I am thrilled to report that the linker is a complete success:

- It runs in time proportional to the number of tokens.

- It is a gem. It is unexpectedly simple, elegant and flexible.

- It passes significant tests. It is now revealing bugs elsewhere.

*The tokens and results lists are both required*

The *tokens *list, *not *the results list, must be the primary "product" of 
the tog class. 

Indeed, the tokens list is the "ground truth", containing details that are 
not (at present) present in the parse trees. These details include comment 
tokens (missing entirely from the parse tree) and the spelling of 
whitespace tokens (more reliable in the tokens list).

Note: Despite being essential for the algorithm, the results list is 
unlikely to be otherwise useful. For most apps, filtering the tokens list 
should almost always be simpler and better. Or so I say now.


*Executive summary*

At last we can view what happens from the highest vantage point:

- The tokens list contains the ground truth. Its data are more reliable 
than data in the results list.

- The results list exists *only* to hold data to be transferred to the 
tokens list.

- Syncing the tokens and results lists is the *only* way make this transfer!

- Syncing is feasible and fast because significant (non-optional) tokens 
appear in exactly the same order in the two lists. The linker can look 
ahead in either list because the linker runs in a post pass, after all 
tokens have been generated.

That's all anyone needs to know.


*Testing*

Much testing remains. Many tree visitors are buggy. Happily, testing is 
easy...

The linker issues detailed error messages if the token and results lists 
can't be properly synced. 

Using difflib, a *diff reporter* shows the diffs between the token and 
results list. These diffs usually highlight which visitor is the culprit.

*Summary*

The notion of token-order tree traversals got this project started.

The TokenOrderGenerator class is nothing but straightforward 
infrastructure. Almost all of it is already elegant and simple. I'll 
improve the remaining ugly bits next.

Everything depends on the Linker class. I am thrilled to report that this 
class is dead simple.

Testing has now easy. Errors in visitors cause the Linker class to fail, as 
it should. The diff reporter then clearly shows why the token and results 
list can't be synced.

The remaining phases of the project look straightforward:

- Create unit tests covering all visitors.
- Tweak the code to handle weird special cases involving tokens and parse 
trees.
- Verify that the code works on all of Leo's source files.
- The acid test: simplify Leo's fstringify commands using this project's 
classes.
- Document everything and package the code for use outside of Leo.

Edward

-- 
You received this message because you are subscribed to the Google Groups 
"leo-editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/leo-editor/6c3f762f-1caa-45e2-aa5e-7099a063967e%40googlegroups.com.

Success! High-level overview and status report

Reply via email to