On Wednesday, 11 September 2013 at 19:56:45 UTC, H. S. Teoh wrote:
2. The example uses an if-then sequence of isBuiltType,
isKeyword,
etc. Should be an enum so a switch can be done for speed.
I believe this is probably a result of having a separate enum
value of
each token, and at the same time needing to group them into
categories
for syntax highlighting purposes. I'd suggest a function for
converting
the TokenType enum value into a TokenCategory enum. Perhaps
something
like:
enum TokenCategory { BuiltInType, Keyword, ... }
// Convert TokenType into TokenCategory
TokenCategory category(TokenType tt) { ... }
Then in user code, you call category() on the token type, and
switch
over that. This maximizes performance.
Implementation-wise, I'd suggest either a hash table on
TokenType, or
perhaps even encoding the category into the TokenType enum
values, for
example:
enum TokenCategory {
BuiltInType, Keyword, ...
}
enum TokenType {
IntToken = (TokenCategory.BuiltInType << 16) | 0x0001,
FloatToken = (TokenCategory.BuiltInType << 16) | 0x0002,
...
FuncToken = (TokenCategory.Keyword << 16) | 0x0001,
}
Then the category function can be simply:
TokenCategory category(TokenType tt) {
return cast(TokenCategory)(tt >> 16);
}
Though admittedly, this is a bit hackish. But if you're going
for speed,
this would be quite fast.
There are already plenty of hackish things in that module, so
this would fit right in.
4. When naming tokens like .. 'slice', it is giving it a
syntactic/semantic name rather than a token name. This would be
awkward if .. took on new meanings in D. Calling it 'dotdot'
would
be clearer. Ditto for the rest. For example that is done
better, '*'
is called 'star', rather than 'dereference'.
I agree. Especially since '*' can also mean multiplication,
depending on
context. It would be weird (and unnecessarily obscure) for
parser code
to refer to 'dereference' when parsing expressions. :)
If you read the docs/code you'll see that "*" is called "star"
:-). The slice -> dotdot rename is pretty simple to do.
5. The LexerConfig initialization should be a constructor
rather than
a sequence of assignments. LexerConfig documentation is
awfully thin.
For example, 'tokenStyle' is explained as being 'Token style',
whatever that is.
I'm on the fence about this one. Setting up the configuration
before
starting the lexing process makes sense to me.
Perhaps one way to improve this is to rename LexerConfig to
DLexer, and
make byToken a member function (or call it via UFCS):
DLexer lex;
lex.iterStyle = ...;
// etc.
foreach (token; lex.byToken()) {
...
}
This reads better: you're getting a list of tokens from the
lexer 'lex',
as opposed to getting something from byToken(config), which
doesn't
really *say* what it's doing: is it tokenizing the config
object?
byToken is a free function because its first parameter is the
range to tokenize. This allows you to use UFCS on the range.
(e.g. "sourcCode.byToken()" Putting it in a struct/class would
break this.
6. No clue how lookahead works with this. Parsing D requires
arbitrary
lookahead.
What has this got to do with lexing? The parser needs to keep
track of
its state independently of the lexer anyway, doesn't it? I'm
not sure
how DMD's parser works, but the way I usually write parsers is
that
their state encodes the series of tokens encountered so far, so
they
don't need the lexer to save any tokens for them. If they need
to refer
to tokens, they create partial AST trees on the parser stack
that
reference said tokens. I don't see why it should be the lexer's
job to
keep track of such things.
For parsing, you'll likely want to use array() to grab all the
tokens. But there are other uses such as syntax highlighting that
only need one token at a time.
9. Need to insert intra-page navigation links, such as when
'byToken()' appears in the text, it should be link to where
byToken
is described.
Agreed.
I'll work on this later this evening.
[...]
I believe the state of the documentation is a showstopper, and
needs
to be extensively fleshed out before it can be considered
ready for
voting.
I think the API can be improved. The LexerConfig -> DLexer
rename is an
important readability issue IMO.
I'm not convinced (yet?).
Also, it's unclear what types of input is supported -- the code
example
only uses string input, but what about file input? Does it
support
byLine? Or does it need me to slurp the entire file contents
into an
in-memory buffer first?
The input is a range of bytes, which the lexer assumes is UTF-8.
The lexer works much faster on arrays than it does on arbitrary
ranges though. Dmitry Olshansky implemented a circular buffer in
this module that's used to cache the input range internally. This
buffer is then used for the lexing process.
Now, somebody pointed out that there's currently no way to tell
it that
the data you're feeding to it is a partial slice of the full
source. I
think this should be an easy fix: LexerConfig (or DLexer after
the
rename) can have a field for specifying the initial line number
/ column
number, defaulting to (1, 1) but can be changed by the caller
for
parsing code fragments.
That's simple enough to add.
A minor concern I have about the current implementation (I
didn't look
at the code yet, but this is just based on the documentation),
is that
there's no way to choose what kind of data you want in the
token stream.
Right now, it appears that it always computes startIndex, line,
and
column. What if I don't care about this information, and only
want, say,
the token type and value (say I'm pretty-printing the source,
so I don't
care how the original was formatted)? Would it be possible to
skip the
additional computations required to populate these fields?
It's possible, but I fear that it would make the code a mess.