On Saturday, 5 October 2013 at 00:24:22 UTC, Andrei Alexandrescu
wrote:
Vision
======
I'd been following the related discussions for a while, but I
have made up my mind today as I was working on a C++ lexer
today. The C++ lexer is for Facebook's internal linter. I'm
translating the lexer from C++.
Before long I realized two simple things. First, I can't reuse
anything from Brian's code (without copying it and doing
surgery on it), although it is extremely similar to what I'm
doing.
Second, I figured that it is almost trivial to implement a
simple, generic, and reusable (across languages and tasks)
static trie searcher that takes a compile-time array with all
tokens and keywords and returns the token at the front of a
range with minimum comparisons.
Such a trie searcher is not intelligent, but is very composable
and extremely fast. It is just smart enough to do maximum munch
(e.g. interprets "==" and "foreach" as one token each, not
two), but is not smart enough to distinguish an identifier
"whileTrue" from the keyword "while" (it claims "while" was
found and stops right at the beginning of "True" in the
stream). This is for generality so applications can define how
identifiers work (e.g. Lisp allows "-" in identifiers but D
doesn't etc). The trie finder doesn't do numbers or comments
either. No regexen of any kind.
The beauty of it all is that all of these more involved bits
(many of which are language specific) can be implemented
modularly and trivially as a postprocessing step after the trie
finder. For example the user specifies "/*" as a token to the
trie finder. Whenever a comment starts, the trie finder will
find and return it; then the user implements the alternate
grammar of multiline comments.
That is more or less how SDC's lexer works. You pass it 2AA : one
with string associated with tokens type, and one with string to
function's name that return the actual token (for instance to
handle /*) and finally one when nothing matches.
A giant 3 headed monster mixin is created from these data.
That has been really handy so far.
If what we need at this point is a conventional lexer for the D
language, std.d.lexer is the ticket. But I think it wouldn't be
difficult to push our ambitions way beyond that. What say you?
Yup, I do agree.