This is an Engineering Notebook post. Feel free to ignore.
Otoh, anyone who wants to understand the importers should read this post
carefully.
*tl;dr:*
1. i.scan_dict should result in an 5x speed improvement over i.scan_table.
Maybe more.
2. A clever trick, *discovered while writing this post*, makes it possible
to scan comments and strings *quickly*, *without changing the line-oriented
nature of the code*.
*The new code*
Rev 78413a4 contains the new code. It mostly works, but it is still
disabled.
Up until now, the importers have used a table-driven scanner, *i.scan_table*.
This scanner knows about *context*: strings, comments, etc. For *each*
character of the input file, the present importer calls i.scan_table to
update brackets and set the context.
i.scan_table tries to match each pattern of the table against the input
character. Most tables contain 8-10 entries. Most input characters *fail*
to match, so on average, 8-10 fairly complex comparisons are done for *each
*character of the input file.
Instead, *i.scan_dict* looks up the incoming character in a *dictionary*.
Most characters do *not* appear in the dict, so i.scan_dict fails
immediately. So on average, i.scan_dict must be about 8x faster than
i.scan_table.
*Aha: how to scan strings and comments*
For years I have been noodling whether there is some way to scan comments
and strings *quickly* while using the line-oriented scanning code. We would
naively expect that going into "string-scanning mode" would speed up the
scan compared to calling i.scan_dict for every character.
But actually, it wouldn't help all that much now that i.scan_dict is so
fast. And furthermore, that would complicate x.v2_gen_lines.
The problem is this: if i.scan_dict scanned the comment or string
completely, i.scan_dict might return an index *i* that is *past *the line
that x.v2_gen_lines is handling. Somehow, x.v2_gen_lines would have to
re-sync to the next *complete* line. It could be done, but it's complex
enough that I have been wary of that approach.
But as I was writing this post I suddenly saw a beautiful solution:
Aha: i.scan_dict doesn't need to scan strings and comments *completely*!
Just scan to the *end* of the present line, or the end of the string or
comment,
whichever comes first.
This is so good. It speeds up strings and comments almost as much as
possible, without changing anything else. The line orientation of the
importer remains exactly as it was.
This may be the way it is written in The Book. i.scan_dict will do this
soon.
*When can importer code ignore context?*
It's easy to get confused about this. In fact, however, the principles are
simple:
1. Importers that *only* count brackets, {}, (), and/or [], can
*always ignore context.*The reason is simple: the scanner ignores brackets
in strings, comments, etc., so those counts are *always* accurate.
2. Importers that use pattern matches to determine the start of classes,
defs, etc., must avoid pattern matches within strings and comments.
That is, patterns should fail if the previous line ends in a context.
There probably are languages in which a pattern could match on a line
*after* the close of a comment or string. (C comes to mind). I'll deal
with such pathologies only if someone files a bug report :-)
*Faux optimizations of the ScanState classes*I'm tempted to include all
bracket counts, indentation levels, backslash/newline data into a single
generic class. No need to create a dedicated class for each importer. It
might simplify the code.
But no, it's just as likely to make it harder to understand. It probably
won't happen.
And we don't actually need to create a new ScanState instance for every
line. We could just swap the present/previous states when changing lines.
However, we would also have to remember to copy the present state when
pushing it on the stack. Again, this looks like a dubious "improvement".
Clarity, simplicity and robustness are likely more important than a bit
less stress on the GC.
*Summary*
i.scan_dict is should speed up the import code by 5x or more.
i.scan_dict will soon scan to comments or strings up to the end of the
present line. This trick will speed up the scan (a bit) without requiring
changes to any other code.
More importantly, this trick proves that:
1. Dedicated, character-oriented string/comments scanners are not needed.
2. There is no need to entangle character-oriented code with line-oriented
code.
3. The new Importer code is the simplest (and fastest!) code that could
possibly work.
Edward
--
You received this message because you are subscribed to the Google Groups
"leo-editor" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/leo-editor.
For more options, visit https://groups.google.com/d/optout.