The most risky phase of this project is complete. Significant tests pass. The crucial Linker class is unexpectedly simple and elegant. It is a linear time diff of two lists of tokens.
The rest of this post is a high-level overview of the code and algorithms. It will be pre-writing for a Theory of Operation chapter, to become part of LeoDocs.leo. *High-level overview* The *contents *are a string, a snippet of python code. Given a contents string, *TokenOrderGenerator *(tog) class does the following: 1. Tokenize: Use python's tokenize module to create the *tokens list*, a list of *Token * objects. 2. Create the parse tree: Use python's ast module to create the *parse tree*, a tree of ast.AST nodes. 3. Generate the results list: The tog traverses the parse tree, generating the *results list*, again a list of Token objects. Crucially, this traversal happens in *token order*. This is the *one and only *order that guarantees the token and results lists will match *exactly*, except for *optional tokens*. Optional tokens may appear in either the tokens list or results list. To generate result tokens in the proper order, the tog class *must *have a separate *visitor* for every kind of parse tree node. All visitors are generators, ensuring that the traversal can't overflow python's run-time stack. Visitors use "yield" or "yield from" to generate results from sub-trees. These yields must be done in token order, which is why a separate visitor is required for each kind of parse tree node. All visitors end with *begin_visitor* and end with the *end_visitor* method. Among other things, these methods save, set and restore the *tog.node* ivar. *This ivar is the tree node corresponding to each visitor.* Visitors call *tog.put* to create new result tokens. Crucially, the* tog.node* ivar tells tog.put which parse-tree node is responsible for the token. The *node ivar* of each result token remembers that node. 4. Assigns two-way links between input tokens and parse-tree nodes... *Syncing and linking* The *Linker *class is the heart of the entire enterprise. It must be bullet proof. To review, the tree traversal generates results list, a list of Token objects. All result tokens contain a link (token.node) to the parse tree node that generated the token. The linker runs in a post pass. The linker *syncs *both lists, looking *ahead *in either list as necessary. Syncing is easy because *significant *(non-optional) tokens appear in the same order in the two lists! The linker need only ignore optional tokens. When syncing two nodes, the linker transfers the token.node data from the result list to the corresponding node in the tokens list. I am thrilled to report that the linker is a complete success: - It runs in time proportional to the number of tokens. - It is a gem. It is unexpectedly simple, elegant and flexible. - It passes significant tests. It is now revealing bugs elsewhere. *The tokens and results lists are both required* The *tokens *list, *not *the results list, must be the primary "product" of the tog class. Indeed, the tokens list is the "ground truth", containing details that are not (at present) present in the parse trees. These details include comment tokens (missing entirely from the parse tree) and the spelling of whitespace tokens (more reliable in the tokens list). Note: Despite being essential for the algorithm, the results list is unlikely to be otherwise useful. For most apps, filtering the tokens list should almost always be simpler and better. Or so I say now. *Executive summary* At last we can view what happens from the highest vantage point: - The tokens list contains the ground truth. Its data are more reliable than data in the results list. - The results list exists *only* to hold data to be transferred to the tokens list. - Syncing the tokens and results lists is the *only* way make this transfer! - Syncing is feasible and fast because significant (non-optional) tokens appear in exactly the same order in the two lists. The linker can look ahead in either list because the linker runs in a post pass, after all tokens have been generated. That's all anyone needs to know. *Testing* Much testing remains. Many tree visitors are buggy. Happily, testing is easy... The linker issues detailed error messages if the token and results lists can't be properly synced. Using difflib, a *diff reporter* shows the diffs between the token and results list. These diffs usually highlight which visitor is the culprit. *Summary* The notion of token-order tree traversals got this project started. The TokenOrderGenerator class is nothing but straightforward infrastructure. Almost all of it is already elegant and simple. I'll improve the remaining ugly bits next. Everything depends on the Linker class. I am thrilled to report that this class is dead simple. Testing has now easy. Errors in visitors cause the Linker class to fail, as it should. The diff reporter then clearly shows why the token and results list can't be synced. The remaining phases of the project look straightforward: - Create unit tests covering all visitors. - Tweak the code to handle weird special cases involving tokens and parse trees. - Verify that the code works on all of Leo's source files. - The acid test: simplify Leo's fstringify commands using this project's classes. - Document everything and package the code for use outside of Leo. Edward -- You received this message because you are subscribed to the Google Groups "leo-editor" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/leo-editor/6c3f762f-1caa-45e2-aa5e-7099a063967e%40googlegroups.com.
