ENB: Using parse tree data in the Orange class

Edward K. Ream Thu, 09 Jan 2020 07:42:54 -0800

This Engineering Notebook post discusses how the Orange class might use the 
two-way links that tog.init_from_file creates between tokens (the list of 
input tokens) and tree (the tree of parse nodes).


This post is a bit short on explanation. It is primarily for my own use. 
Feel free to ignore.

*Background*

The Orange class implements Leo's new beautifier. (Orange is the new 
black). The Orange class is based on the now-retired PythonTokenBeautifier 
class. At present, the code uses *no* tree-related data.

The Orange class is a stand-alone class. It does, however, *use* the TOG 
class as follows:

tog = TokenOrderGenerator()
contents, encoding, tokens, tree = tog.init_from_file(filename)

At present, the code uses only the encoding and tokens values.  

*The legacy code almost suffices*

Just as with the Fstringify class, the present code could be used as it is. 
The present code already does a good-to-excellent job of regularizing 
whitespace.  

The only significant improvement would be to use the parse tree to analyze 
the context of tokens. At present, the code uses *token-based state vars*.

This looks like a dubious scheme. In fact, it is surprisingly sound. In 
particular, "name" tokens for keywords are guaranteed to *be* keywords. 
Ditto for op tokens representing parens and curly and square brackets. 
Tokens "hide" the contents of strings and comments, so there is no 
possibility of confusion.

*The big questions*

1. To what extent would using the parse tree simplify state analysis?

Four "input token handlers" contain "if" statements that depend 
lexical/parse state. About 10 "output token handlers" contain similar 
tests. I'll investigate what the code would look like if the token-based 
state vars were replaced by an analysis of the parse tree corresponding to 
recent tokens.

2. Will token pointers be useful when analyzing the list of output tokens?

Unlike the TOG class, the Orange class uses *two* lists of tokens. 
TOG.init_from_file creates the *input token list*. The input node handlers 
then create a separate *output token list*.

Having two token lists is convenient, because the output token handlers may 
delete or change output tokens after they are first created. In essence, 
output token handlers form a very fast peephole optimizer. This peephole 
only looks backward, never forward.

Alas, pointers only exist only between the tree and the *input* token list, 
so some new invention is needed.

1. Orange.add_token creates output tokens. It could copy the token.node 
field from input tokens to the output tokens.

2. Links from the tree to tokens might not be needed. If they are needed, 
it will probably be easy enough to get the required data either from the 
input token list (as at present) or in some other fairly straightforward 
way.

*Splitting and joining lines*

The only remaining task is to split and join lines, as black does. I plan 
to do this in a separate post pass on the output token list. This will 
simplify the code, provided that all needed data are available.

Black uses a horribly complex scheme to determine the length of lines. 
Instead, it will be much easier to call the global function 
tokens_to_string for the tokens comprising one output line. This will be 
straightforward.

The present code contains a prototype of splitting and joining tokens. It 
is probably necessary to split tokens based on data from the parse tree. 
Indeed, the old code will fail if there the to-be-split lines do not lie 
between parens.  A parse-tree-based version could look up the tree, looking 
for top-level statements. Parens could then be inserted based on the type 
of statement.  For example:

a = << very long RHS >>

could be split into:

a = ( << lines split by meaning >> )

A similar analysis could be used for other kinds of statements.

*Summary*

Using two-way links in the Orange class presents new challenges because the 
Orange class creates *two* tokens lists.

Splitting lines properly requires a parse-tree-based analysis of the 
to-be-split lines. Joining lines is easier, but it probably also requires a 
proper analysis of the parse tree.

Completing the Orange class will provide the last necessary "road test" of 
the TOG class.

Edward

-- 
You received this message because you are subscribed to the Google Groups 
"leo-editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/leo-editor/499c3602-fa57-49aa-a01b-dad205c16547%40googlegroups.com.

ENB: Using parse tree data in the Orange class

Reply via email to