ENB: About the JS importer and #1481

Edward K. Ream Sat, 01 Feb 2020 03:01:04 -0800

Recent revs in the js-importer branch provide a basic fix for #1481 
<https://github.com/leo-editor/leo-editor/issues/1481>. This long 
Engineering Notebook post will discuss this fix, it's limitations, further 
problems and possible fixes for those problems.


As always, feel free to ignore. However, this post describes important 
aspects of the code and its design. It may more than usual interest to 
Leo's devs.

*The immediate problem*

Perfect import failed for reveal.js because *i.gen_lines* failed to 
allocate lines properly to nodes. This caused lines to appear out of order.

i.gen_lines is part of the *pipeline*, defined in the base *Importer class*. 
At present, the *JS_Importer* class (and most other importers) are 
subclasses of the Importer class. Iirc, all importers use i.gen_lines 
unchanged. The JS importer overrides i.starts_block, one of i.gen_lines's 
helpers.

i.gen_lines uses a *parse state* to determine the start and end of *blocks*. 
For Javascript, these blocks correspond (roughly) to classes and functions. 
Alas, there are many ways to define a class or function in JS. JS is, by 
far, the most difficult language to parse of all the languages handled by 
Leo's importers.

Perfect import failed for reveal.js because the importer mistook a line 
like if('function'){ as the start of a function. The present fix will 
usually work (Leo can now import reveal.js), but not always, as I'll now 
explain...

*Tokenizing Javascript*

You ask, how hard can it be to recognize strings like 'function' in JS? The 
answer is "very very hard". Tokenizing JS depends on context. In 
particular, it is difficult to determine whether a '/' character is the 
"div" arithmetic operator or the start of a regular expression! JS is, by 
far, the most difficult language to scan (tokenize) of all the languages 
handled by Leo's importers.

*i.scan_line* updates the parse state after *carefully* tokenizing each 
line. The JS importer method *js_i.scan_line* overrides i.scan_line to 
handle '/' properly. As you can see, it's not pretty. The bug happened 
because *js_i.starts_block* contained regex's that didn't distinguish the 
"function" keyword from a string containing "function". The partial fix is 
mostly a hack. js_i.starts_block now keeps track of whether function *looks* 
like it is in a string. But this faux tokenizing could fail if quotes (or 
"function") appear in a regex, as they very well might.

A proper fix would involve fully tokenizing each line, in *both *js_i.scan_line 
and in js_i.starts_block. Therefore, we want a stand-alone javascript 
tokenizer, written in python. Present revs include a copy of JsLex.py 
<https://bitbucket.org/ned/jslex/src/default/jslex.py>, but it isn't hooked 
up yet. The JsLex code contains a note that it doesn't handle non-ascii 
characters properly, so it may need substantial revision. Happily, JsLex 
does contain a suite of unit tests.

*Other problems*

The lines of reveal.js that caused the perfect import to fail are not 
handled very well even after perfect import succeeds. Here are the 
complications...

The JS importer must be completely immune from indentation, and it is. As a 
direct consequence, perfect import tests ignore leading whitespace. 
However, once lines are allocated to nodes, the importer tries to *adjust 
*nodes 
to make them as pleasing as possible. In particular, the importer removes 
*common 
leading whitespace* from all nodes. In addition, under special 
circumstances, the importer may try to move one or more lines from the end 
of one node to the start of the previous or following sibling node. The *post 
pass* part of the pipeline handles all these adjustments. It would be 
difficult/impossible to handle them in gen_lines.

At present, the JS importer only generates @others directives, never 
section references. This could be changed, but imo using @others is much 
better. However, @others does impose additional constraints on what the 
post pass can do. It's time for an extended example.  Here is the gist of 
the code that caused perfect import to fail:

function startEmbeddedContent( element ) {
    toArray( element.querySelectorAll( 'video, audio' ) ).forEach( function( 
el ) {
        if('function') { // The culprit
            promise.catch( function() {
                el.addEventListener( 'play', function() {
                    el.controls = false;
                } );
            } );
        }
    }
}

To have any chance of understanding what is going on, the post pass must be 
disabled.  Here are the results, without the post pass.  Headlines are 
preceded by ===:

=== function startEmbeddedContent

function startEmbeddedContent( element ) {
    @others
}

=== toArray( element.querySelectorAll('video, audio')).forEach function

    toArray( element.querySelectorAll('video, audio')).forEach( function(el) 
{
        if('function') {
            @others
    }

=== promise.catch function

            promise.catch(function() {
                @others
            } );
        } // <=====

=== el.addEventListener('play', function

                el.addEventListener('play', function() {
                    el.controls = false;
                } );


Now you can see why a post pass is desirable.

It's "obvious" that the last line of the node "promise.catch function" 
belongs in the *previous* node. However, *such a move can not be done in 
general!*

Only the post pass has any chance of having enough data to make this 
adjustment. This adjustment can only be done because the node 
"promise.catch function" is the *last* node under the range of the @others 
in the node "toArray...". The adjustment would be invalid if the node 
"promise.catch function" had any following siblings.

*Summary*

The present code now imports reveal.js without error. However, the latest 
code is a hack. A proper fix entails carefully tokenizing lines in two 
places. I plan to use JsLex to do this. JsLex will need work to handle 
non-ascii characters properly. I'll do that as part of the fix for #1481.

The post pass attempts to reallocate lines to make the result more 
palatable. The JS importer uses @others instead of section references. This 
imposes constraints on possible adjustments. I'll attempt to improve the 
post pass as another part of #1481. This may involve a rewrite/rethink. 
This post has been part of the rethinking process.

The present problems with Leo's JS importer arise from well-known 
infelicities in JS itself. The fixes to #1481 actually show the strengths 
of the Leo's importer architecture. Fixes are straightforward and will be 
confined only to the JS_Importer class.

Edward

-- 
You received this message because you are subscribed to the Google Groups 
"leo-editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/leo-editor/2995648b-a7ee-405c-b9e8-972a50da9d32%40googlegroups.com.

ENB: About the JS importer and #1481

Reply via email to