Breakthrough importing javascript

Edward K. Ream Wed, 05 Oct 2016 04:21:35 -0700

The new code hasn't been upped yet. There are a few bugs remaining, but 
already the new base is a more robust and flexible way of handling the 
myriad complexities of javascript.


It's obvious to me that none of this would have happened if Tom had not 
suggested 
<https://groups.google.com/d/msg/leo-editor/Ct3ZKqTo_KE/78R-I4-UAwAJ>that 
the importers be re-imagined. The specifics weren't quite on target, but 
they must have primed my subconscious.

There are two main parts of the new code, simpler scanning and regex 
pattern matching. The new code leaves everything unchanged (or unused), 
except the all-important scan method.



*Scanning*The new code is based on scanning text, not parsing it.  You 
could call it the most important breakthrough.  It came to me in the shower.

Imo, there would be no way to handle the myriad possible javascript 
patterns in a parser, even if one had a complete parse tree. Instead, the 
new scanner breaks the code into *blocks* based on counts of parens and 
curly bracket.  So a block starts with a line that *ends* with an 
unbalanced parenthesis or curly bracket and continues up to and including a 
line that ends with both parens and curly brackets being balanced.

At a single stroke, all parsing difficulties disappear.  This is *exactly* 
the kind of line-oriented approach that the javascript importer must have.



*The ScanState class*All my previous scanners have used collections of 
variables/ivars to keep track of scan state.  But this is the hard way.  
Instead, a new ScanState class handles all the details.  The main methods:

- scan_line: scans a line, updating the internal state, including whether 
the line is in a string or block comment.
- at_top_level: returns True if parens/brackets are matched and not in a 
string/comment.

These helper methods greatly simply the process of breaking lines into 
blocks.

*Regex pattern matching*

Each block naturally becomes the body text of a new outline node.  But what 
should the headline be?

Just as in the coffeescript importer, the new javascript importer scans 
*start* of the block's text, trying to match a regex pattern from a table 
of such patterns.  The first pattern found specifies the outline in a 
straightforward way.

Again, this is *exactly* what is needed.  It is simple and extensible, and 
completely replaces parsing or other language-specific information.  Here 
is the heart of the code:

proto1 = re.compile(
        r'(\s*)Object.create(\s*)=(\s*)function(.*)\n' +
        r'(\s*)var(\s+)(\w+)(\s*)=(\s*)function',
        re.MULTILINE)
table = (
        (7, 'proto', proto1),
        (0, 'proto', r'(\s*)Object.create(\s*)=(\s*)function(\s*)\('),
        (0, 'proto', r'Function\.prototype\.method(\s*)=(\s*)function'),
        (3, 'func',  r'(\s*)function(\s+)(\w+)'),
            # function x
        (3, 'func',  r'(\s*)var(\s+)(\w[\w\.]*)(\s*)=(\s*)function\('),
        (3, 'var',   r'(\s*)var(\s+)(\w[\w\.]*)(\s*)=(\s*)new(\s+)(\w+)'),
        (3, 'var',   r'(\s*)var(\s+)(\w[\w\.]*)(\s*)=(\s*){'),
        (2, 'func',  r'(\s*)(\w[\w\.]*)(\s*)=(\s*)function(\s*)\('),
        (6, 'class', r'(\s*)define(\*s)\((\s+)function(\s*)\((\s*)(\w+)'),
        (0, 'class', r'(\s*)define(\s*)\((.*),(\s*)function\('),
    )
    s = ''.join(block)
    for i, prefix, pattern in table:
        m = re.match(pattern, s)
        if m:
            name = prefix + ' ' + (m.group(i) if i else '')
            return n, name.strip()
    return n+1, 'block %s' % (n)

The great thing about this is that I can surf the web looking for 
javascript patterns.  When I find a new one, I can add it to the table.

*Recanning*

The scan method creates child nodes.  The rescan method rescans the body 
text of the children, looking for new blocks that can be turned into 
grandchild nodes, great-grandchild nodes, etc.

This code is not quite ready.  In fact, there are subtle issues about when 
to rescan.  The present code sets a limit, say 50 lines.  There is not much 
point in rescanning a node with fewer lines.

Rescanning also sometimes splits if/else statements into blocks.  We might 
want not to do this if the *created* blocks would be less than, say, 
another threshold value, which may or may not be 50 lines.  At present, 
this test is not done, and maybe it won't ever be.

*Summary*

The new scheme is already a huge success.  Decisions never involve scan 
state--they are made at a much higher level.  Even with bugs, the new code 
already handles javascript much more robustly than the old code.  I'll be 
upping the new code later today.

Edward

-- 
You received this message because you are subscribed to the Google Groups 
"leo-editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/leo-editor.
For more options, visit https://groups.google.com/d/optout.

Breakthrough importing javascript

Reply via email to