ENB: Big speedups and Aha's

Edward K. Ream Mon, 28 Nov 2016 08:55:00 -0800

This is an Engineering Notebook post.  Feel free to ignore.

Otoh, anyone who wants to understand the importers should read this post 
carefully.


*tl;dr:* 

1. i.scan_dict should result in an 5x speed improvement over i.scan_table. 
Maybe more.

2. A clever trick, *discovered while writing this post*, makes it possible 
to scan comments and strings *quickly*, *without changing the line-oriented 
nature of the code*.

*The new code*

Rev 78413a4 contains the new code. It mostly works, but it is still 
disabled.

Up until now, the importers have used a table-driven scanner, *i.scan_table*.  
This scanner knows about *context*: strings, comments, etc. For *each* 
character of the input file, the present importer calls i.scan_table to 
update brackets and set the context.

i.scan_table tries to match each pattern of the table against the input 
character.  Most tables contain 8-10 entries.  Most input characters *fail* 
to match, so on average, 8-10 fairly complex comparisons are done for *each 
*character of the input file.

Instead, *i.scan_dict* looks up the incoming character in a *dictionary*. 
Most characters do *not* appear in the dict, so i.scan_dict fails 
immediately. So on average, i.scan_dict must be about 8x faster than 
i.scan_table.

*Aha: how to scan strings and comments*

For years I have been noodling whether there is some way to scan comments 
and strings *quickly* while using the line-oriented scanning code. We would 
naively expect that going into "string-scanning mode" would speed up the 
scan compared to calling i.scan_dict for every character.

But actually, it wouldn't help all that much now that i.scan_dict is so 
fast.  And furthermore, that would complicate x.v2_gen_lines.

The problem is this: if i.scan_dict scanned the comment or string 
completely, i.scan_dict might return an index *i* that is *past *the line 
that x.v2_gen_lines is handling.  Somehow, x.v2_gen_lines would have to 
re-sync to the next *complete* line. It could be done, but it's complex 
enough that I have been wary of that approach.

But as I was writing this post I suddenly saw a beautiful solution:

Aha: i.scan_dict doesn't need to scan strings and comments *completely*!
Just scan to the *end* of the present line, or the end of the string or 
comment,
whichever comes first.

This is so good.  It speeds up strings and comments almost as much as 
possible, without changing anything else. The line orientation of the 
importer remains exactly as it was.

This may be the way it is written in The Book.  i.scan_dict will do this 
soon.

*When can importer code ignore context?*

It's easy to get confused about this.  In fact, however, the principles are 
simple:

1. Importers that *only* count brackets, {}, (), and/or [], can 

*always ignore context.*The reason is simple: the scanner ignores brackets 
in strings, comments, etc., so those counts are *always* accurate.

2. Importers that use pattern matches to determine the start of classes, 
defs, etc., must avoid pattern matches within strings and comments.

That is, patterns should fail if the previous line ends in a context.

There probably are languages in which a pattern could match on a line 
*after* the close of a comment or string. (C comes to mind).  I'll deal 
with such pathologies only if someone files a bug report :-)



*Faux optimizations of the ScanState classes*I'm tempted to include all 
bracket counts, indentation levels, backslash/newline data into a single 
generic class. No need to create a dedicated class for each importer.  It 
might simplify the code.  

But no, it's just as likely to make it harder to understand.  It probably 
won't happen.

And we don't actually need to create a new ScanState instance for every 
line.  We could just swap the present/previous states when changing lines.

However, we would also have to remember to copy the present state when 
pushing it on the stack. Again, this looks like a dubious "improvement".  
Clarity, simplicity and robustness are likely more important than a bit 
less stress on the GC.

*Summary*

i.scan_dict is should speed up the import code by 5x or more.

i.scan_dict will soon scan to comments or strings up to the end of the 
present line. This trick will speed up the scan (a bit) without requiring 
changes to any other code.

More importantly, this trick proves that:

1. Dedicated, character-oriented string/comments scanners are not needed.
2. There is no need to entangle character-oriented code with line-oriented 
code.
3. The new Importer code is the simplest (and fastest!) code that could 
possibly work.

Edward

-- 
You received this message because you are subscribed to the Google Groups 
"leo-editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/leo-editor.
For more options, visit https://groups.google.com/d/optout.

ENB: Big speedups and Aha's

Reply via email to