Engineering notebook: code generation for new importers

Edward K. Ream Sun, 30 Oct 2016 09:25:37 -0700

Here, I'll be thinking out loud about how to make code generation work when 
using the new-style importers.


This is an engineering notebook post.  Feel free to ignore.

importers/python.py contains both the old and new python importers. The 
new_scanner switch enables the new importer. The new python importer fails 
miserably when new_scanner is True. Everything works as before (including 
all unit tests) when new_scanner is False.

*Background*

In another thread 
<https://groups.google.com/forum/#!topic/leo-editor/RDi2jffWjzI> I wrote

If we were to convert the Python importer to use the new scheme, the entire 
> ScanState class would have to be rewritten.  The reason should be 
> clear--Python uses indentation levels to indicate structure, not curly 
> brackets.
>

Rev 9755cf introduces the PythonScanState class.  It also moves the 
scan_block method out of the ScanState class and into the BaseLineScanner 
(BLS) class where it belongs.

The PythonScanState class is surprisingly simple. In particular, it handles 
backspace-newlines more simply than does the old-style importer. This is 
tricky to get exactly right.

Happily, rewriting the ScanState class is *all* that would be required.  
> The BLS class would remain completely unchanged, and the importer would be 
> just as simple as the perl and javascript importers.
>

This statement was wildly optimistic. It has gradually dawned on me that 
there are serious problems with the code generation in the BLS class.

*Code Generation*

Code generation for javascript is easier than for python because nodes may 
contain multiple section references.  For the python (and perl) importers, 
only one @others directive is allowed per node.  This has important 
implications. The entire algorithm for breaking the input file into nodes 
may have to be revised.

As a practical matter, I have found the block scanning and rescanning code 
to be almost impossible to understand.  This is surprising, but not 
distressing.  The algorithm was always going to be complex.

I have derided the old-style importers as way too complicated.  I may have 
to revise that assessment :-) 

The great advantage of the old-style code generators is that they handle 
indentation correctly in *all* situations.  In particular, they handle 
underindented python *comment* lines properly.  Such comments do *not* 
terminate defs or classes.  I am willing to add extra indentation for such 
lines (with a warning), but even doing that has repercussions throughout 
the code.

I plan to study the old code generators today, to remind myself how they 
work. But before doing that, let's see what the code generators *must* do.  
In fact, the answer is relatively straightforward.  Each generated node, 
including the top-level node, will look like this:

    One or more *leading lines*
    @others, indented as discussed below
    zero or more

* trailing lines*The top-level node will be

    @language python
    @others

Nodes that have *no *children will consist only of the *properly indented *body 
of the class or def.  This indentation depends on the *cumulative* 
indentation of all @others nodes in the node's parents.

Nodes that *do* have children are the hard case.  To repeat, they will look 
like:

    One or more leading lines
    @others, *properly *indented
    zero or more trailing lines

There are three problem that must be solved completely:

1. Determining the leading lines.
2. Determining the indentation of the @others directive.
3. Determining the trailing lines.

None of these tasks is trivial.  Furthermore, the post pass may move lines 
around from the end of one block to the start of the next. Alas, this could 
affect the proper indentation of the @others directive!

*The way forward*

Clearly, the new-style code generators can do as well as the old code 
generators. In fact, the task of the new-style generators is *easier* than 
for the old-style code generators because the new code generators work on 
whole lines.

In the worst case, the new importers can simply mirror the old code 
generators. Having said that, doing code generation the "old" way may 
require a complete rewrite of the code that allocates lines to nodes. 
Happily, adapting the old code generators to a line-oriented scheme must 
surely simplify them. 

*Summary*

Code generation is much more challenging than I first imagined.

The ScanState class is *not* the problem.  It is a brilliant invention, if 
I do say so myself. It completely eliminates the need to parse the imported 
language. It will remain a foundation of the BLS class.

Much of BLS class may have to be written, including BLS.scan and many of 
its helpers.

The new code generators may be based on the old. No changes *whatever* will 
be tolerated in the old code generators.  Instead, I'll copy any needed 
code from the BaseScanner class to the BLS class.

Rewriting the old generators to work with the line-by-line scanner will 
simplify them. I relish such tasks.

The BLS class is a fundamentally important part of Leo. It should be used 
for *all* of Leo's importers.  It is worth *any* amount of work make the 
new importers as beautiful and accurate as possible.

Edward

-- 
You received this message because you are subscribed to the Google Groups 
"leo-editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/leo-editor.
For more options, visit https://groups.google.com/d/optout.

Engineering notebook: code generation for new importers

Reply via email to