ENB: Leo's line-oriented importers: the last collapse in complexity

Edward K. Ream Fri, 04 Nov 2016 01:03:59 -0700

I am writing this in the middle of the night, too excited to sleep.

This is an Engineering Notebook post. Feel free to ignore.


Here, I'll write down my initial plan for a remarkable collapse in the 
complexity of the code.  I'll also record the genesis of this idea. I know 
from experience that big Aha's warp my memory, making it difficult to 
remember just how the Aha came about.

*Summary*

Aha 1. There is no need for a rescanning the incoming lines!  BLS.gen_lines 
can do the job in one pass, *without recursion*, using only simple helpers.

Aha 2. The StateScan class can be simplified greatly.  No need for push/pop 
methods.  No need to remember previous states.

*History*

Yesterday I futzed with the code without much success.  That code is now 
stashed, and will likely disappear forever.

I took two baths yesterday, noodling about two seemingly small details.  
The Aha was the product of considering those details.

*Detail 1*: It was clumsy to switch parser states.  How, exactly, does one 
indicate that we have just entered or left a block (class, method or 
function)?  Previously, the scanners used "base" counts.  Yesterday, I say 
that temporary "delta" counts simplified things a bit.  But scanning was 
still way too confusing.

*Detail 2*: As I rewrote parts of ScanState and PythonScanState, I became 
aware that there is a problem with the continues_state and begin_state 
predicates for the javascript language.  Indeed, the js scanner will begin 
a block when either the curly bracket count *or* the paren counts 
increase.  All well and good.  But what to do when only one of them 
decreases??? In the usual round-about way, this lead to the Aha's.

*The algorithm*

The rewritten gen_lines (lines, parent) method will generate the body text 
of the parent node and all created descendant nodes *line by line*.  
Something like this:

<< define Block class, containing various *permanent* data >>

def gen_lines(self, lines, parent):
    '''Parse all lines, adding to parent.b, creating child nodes as 
necessary.'''
    state = self.state # The subclass of ScanState for this importer
    prev_state = state.initial_state()
    stack = [Block(parent, prev_state)]
    target = parent
    for s in lines:
        new_state = state.scan_line(s)
        if new_state > prev_state:
            << create @others or section ref in target >>
            target = << new child of the present target node >>
            stack.append(Block(target, new_state))
        elif new_state < prev_state:
            << cut back stack until stack[-1] matches new_state >>
     # Common code
     target = stack[-1].target
     prev_stat = new_state
     target.b = target.b + s # Or an optimized version of this.
       
That's all!!! The following will discuss details.

*Handling tail lines*

Tail lines are lines following the lines of a class, method or function.  
When using section references, they can be written immediately to the 
target.  When using @others, they must be written to the defining node.

The code above doesn't handle tail lines properly when using @others.  A 
relatively minor hack will be required.

*Cutting back the stack*

The line: << cut back stack until stack[-1] matches new_state >> handles 
the complication mentioned in detail 2.

In particular, for javascript, there are *two* counts involved, for curly 
brackets and for parens.  This is why an explicit stack is required. There 
is *no way* to know what the next target node is going to be after a block 
ends!  It could be the previous block.  Or the stack may have to be cut 
back to the original parent.

*Comparing states*

The ScanState and all subclasses will define the standard 2.7/3.x rich 
comparison operators. __eq__, __lt__, etc. These will replace the 
starts/ends/continues_block methods. No more push/pop methods.  

__eq__ will return True if the scanner is in any multi-line construct. 
Otherwise, it returns true if all the counts match. And similarly for the 
other comparison.

What happens if one count increases and the other decreases?  I think the 
answer has to be that the count of curly brackets must have priority.  
Believe it or not, this was the question that lead to the Ahas.



*Injecting Block's into targets*The post_pass method may want data 
contained in the Blocks. For instance, indentation info. This can be done 
by injecting a reference to a block into the target vnode.  Something like 
this:

    target.v._importer_block = stack[-1]

*Whitespace*

As in the present code, the code above writes lines to targets, without 
changing the leading whitespace of the line.  The post_pass will eliminate 
as much common whitespace as possible, *but no more*.

There is a bug in the present code.  The undent method removes the *maximal* 
amount of leading whitespace from each node.  But this could remove too 
much, depending on the cumulative indentation in all parent nodes.

In the new scheme, this cumulative indentation will either exist in the 
injected block of a node, or can can be computed from the injected blocks 
of a nodes and all its parent nodes.

Indentation of real javascript code is a nightmare.  We have an unpleasant 
choice.  We can leave imported code unchanged, which may result in many 
underindented lines, or we can regularize leading whitespace, thereby 
changing the original file if the user rewrites it.

*Summary*

Nothing simpler than the new code can be imagined. A big help for 
maintainers, including me.

The code shown above doesn't handle tail lines properly when using 
@others.  A hack will be needed.

The ScanState classes no longer remembers "base" counts. They simply 
maintain one or more counts, and the context. This is a surprisingly 
important simplification.  It resolves Detail 1. A new initial_state() 
method will return a state with empty context and zero counts.

The new code explicitly handles edge cases discussed in Detail 2. Nothing 
more flexible is possible.

Enough for now.  Time for some sleep :-)

Edward

-- 
You received this message because you are subscribed to the Google Groups 
"leo-editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/leo-editor.
For more options, visit https://groups.google.com/d/optout.

ENB: Leo's line-oriented importers: the last collapse in complexity

Reply via email to