ENB: Two Ahas re unifying the token and parse-tree worlds

Edward K. Ream Fri, 15 Nov 2019 03:31:01 -0800

This Engineering Notebook post records what may be the last Ahas required 
to complete #1440 <https://github.com/leo-editor/leo-editor/issues/1440>.  
This will be pre-writing for a how-to guide.


*Aha 1: Use difflib*

Yesterday I saw that using difflib could make testing easier. This was the 
first big Aha.

I played around with diffing the results of the tree traversal against the 
incoming tokens. This immediately revealed some problems.  More 
importantly, it showed that I had misunderstood what the "eat" method must 
do:

- Eat *must* use comment tokens from the token list.  Comments do not exist 
in any easy-to-find form in parse trees!
- Eat probably should take the "spellings" of whitespace from the token 
list.  Those spellings are unreliable/different in parse trees.
- Eat might *optionally* use conditional results from the parse tree.

It all seemed complicated, so I took a break.

*Aha 2: Replace eat with a post pass*

When I awoke this morning I saw how to eliminate tot.eat using difflib. 
This is likely the last Aha needed to complete this project.

A new *post_pass* method will use difflib to check the results and perform 
any other "late" adjustments.  Something like this:

def post_pass(self):
    """
    Use difflib to test self.results, adjusting the parse tree and creating
    output tokens as required.
    
    Subclasses should override this method.
    """
    tokens = [(z.kind, z.value) for z in self.tokens)]
    for z in difflib.ndiff(tokens, self.results):
        print(z)

I'll override this method in the TokenOrder*Injector* (TOI) class, the 
class I use for testing. TOI injects parent/child links into each node of 
the parse tree. Note that children appear *in token-traversal order*, 
something that no code based on ast.walk can possibly do.

The TOI class will be the base class for a roster of *example classes*.  
Each example will tailor the TOI class for a particular real-world 
application.

*Important*: As its name implies, the post pass happens "late", after 
everything has been generated. Unlike the ill-fated "get" method, the 
post-pass can look *ahead* in *both* the token and results array. This is a 
big deal.

You can think of the post pass as a simple peephole optimizer, made even 
simpler by the ability to look ahead as well as behind.


*The put method will remain*

The put method no longer calls eat.  Instead, it simply appends values to 
self.results:

def put(self, kind, val):
    """Handle a token whose kind & value are given."""
    val2 = val if isinstance(val, str) else str(val)
    self.results.append((kind,val2),)

The computation of val2 ensures that self.results will match self.tokens as 
much as possible.

We could even eliminate the put method entirely.  Tree visitors would call 
`yield (x,y)` instead `yield self.put(x,y`. But this would be a *big 
mistake*.  Subclasses should be free to override the put method!

*About conditional results*

The tree node visitors make a "generalized best guess" about calling 
self.put.  Some examples:

- The visitors call/yield put_blank() as needed to "ensure" whitespace 
appears around 'name' tokens.

I put "ensure" in quotes, because subclasses may eliminate whitespace later.

- The do_Tuple visitor calls put_conditional_comma() to put the optional 
comma after tuples with more than one element.

The post pass makes it easy to handle such details:

- put_blank could append ('blank', ' ') to the results list instead of 
('ws', ' '). The pseudo "blank" kind is a flag for the post pass.

Similarly, put_conditional_comma could append('conditional-op', ', ') to 
the results list. Again, the 'conditional-op" kind is a flag to the post 
pass.

*Important*: the calls to put_blank() may be ignored later.  The 'blank' op 
is only a *potentially *useful optional feature.  Subclasses can define 
do-nothing versions of put_blank if they like.  Furthermore, the subclasses 
may define a post-pass could ignore any whitespace:

def post_pass(self):
    tokens = [z.kind, z.value) for z in self.tokens
        if self.kind != 'ws']
    results = [z.kind, z.value) for z in self.results
        if self.kind not in ('ws', 'blank')]
    for z in difflib.ndiff(tokens, self.results):
        print(z)

*Non issues*

Generators are required *only* to ensure that python's run-time stack 
doesn't overflow.  There is no harm whatever in having the results array be 
a true array.

Speed will be extremely fast, but that's a tertiary issue. GC issues are 
likewise of no great concern.  Being able to use difflib is crucial.

*Summary*

The way forward is now completely clear.  No difficult parts remain.

Using difflib has already accelerated development, and will continue to do 
so. difflib has highlighted details that would otherwise have been 
difficult to spot.

A* post pass*, based on difflib, will replace the infamous "eat method. The 
post pass is, in effect, a simple peephole optimizer, that can look both 
behind *and* ahead. Most importantly, the post pass can easily be seen to 
be correct.

The post pass will allow subclasses to:

- Verify that the parse tree is in reasonable accord with the list of 
incoming tokens.
- Make arbitrary adjustments (specialization) to the "generalized" results 
in self.results.
- Make any needed adjustments to the parse tree (There are two way links 
between tokens and tree nodes.)
- Create a list of output tokens, if desired.

I'll create several *example classes* showing how to subclass 
TokenOrderInjector for real-world applications.  These example will contain 
nothing but simple overrides of base class methods such as post_pass and 
put.  Example classes will form the bulk of the "marketing" for this 
project.
  
Edward

-- 
You received this message because you are subscribed to the Google Groups 
"leo-editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/leo-editor/5d111094-4ef4-482e-9fa5-dc2192abdb5a%40googlegroups.com.

ENB: Two Ahas re unifying the token and parse-tree worlds

Reply via email to