On Wednesday, 11 September 2013 at 19:56:45 UTC, H. S. Teoh wrote:
2. The example uses an if-then sequence of isBuiltType, isKeyword,
etc. Should be an enum so a switch can be done for speed.

I believe this is probably a result of having a separate enum value of each token, and at the same time needing to group them into categories for syntax highlighting purposes. I'd suggest a function for converting the TokenType enum value into a TokenCategory enum. Perhaps something
like:

        enum TokenCategory { BuiltInType, Keyword, ... }

        // Convert TokenType into TokenCategory
        TokenCategory category(TokenType tt) { ... }

Then in user code, you call category() on the token type, and switch
over that. This maximizes performance.

Implementation-wise, I'd suggest either a hash table on TokenType, or perhaps even encoding the category into the TokenType enum values, for
example:

        enum TokenCategory {
                BuiltInType, Keyword, ...
        }

        enum TokenType {
                IntToken = (TokenCategory.BuiltInType << 16) | 0x0001,
                FloatToken = (TokenCategory.BuiltInType << 16) | 0x0002,
                ...
                FuncToken = (TokenCategory.Keyword << 16) | 0x0001,
        }

Then the category function can be simply:

        TokenCategory category(TokenType tt) {
                return cast(TokenCategory)(tt >> 16);
        }

Though admittedly, this is a bit hackish. But if you're going for speed,
this would be quite fast.

There are already plenty of hackish things in that module, so this would fit right in.

4. When naming tokens like .. 'slice', it is giving it a
syntactic/semantic name rather than a token name. This would be
awkward if .. took on new meanings in D. Calling it 'dotdot' would be clearer. Ditto for the rest. For example that is done better, '*'
is called 'star', rather than 'dereference'.

I agree. Especially since '*' can also mean multiplication, depending on context. It would be weird (and unnecessarily obscure) for parser code
to refer to 'dereference' when parsing expressions. :)

If you read the docs/code you'll see that "*" is called "star" :-). The slice -> dotdot rename is pretty simple to do.

5. The LexerConfig initialization should be a constructor rather than a sequence of assignments. LexerConfig documentation is awfully thin.
For example, 'tokenStyle' is explained as being 'Token style',
whatever that is.

I'm on the fence about this one. Setting up the configuration before
starting the lexing process makes sense to me.

Perhaps one way to improve this is to rename LexerConfig to DLexer, and
make byToken a member function (or call it via UFCS):

        DLexer lex;
        lex.iterStyle = ...;
        // etc.

        foreach (token; lex.byToken()) {
                ...
        }

This reads better: you're getting a list of tokens from the lexer 'lex', as opposed to getting something from byToken(config), which doesn't really *say* what it's doing: is it tokenizing the config object?

byToken is a free function because its first parameter is the range to tokenize. This allows you to use UFCS on the range. (e.g. "sourcCode.byToken()" Putting it in a struct/class would break this.

6. No clue how lookahead works with this. Parsing D requires arbitrary
lookahead.

What has this got to do with lexing? The parser needs to keep track of its state independently of the lexer anyway, doesn't it? I'm not sure how DMD's parser works, but the way I usually write parsers is that their state encodes the series of tokens encountered so far, so they don't need the lexer to save any tokens for them. If they need to refer to tokens, they create partial AST trees on the parser stack that reference said tokens. I don't see why it should be the lexer's job to
keep track of such things.

For parsing, you'll likely want to use array() to grab all the tokens. But there are other uses such as syntax highlighting that only need one token at a time.

9. Need to insert intra-page navigation links, such as when
'byToken()' appears in the text, it should be link to where byToken
is described.

Agreed.

I'll work on this later this evening.


[...]
I believe the state of the documentation is a showstopper, and needs to be extensively fleshed out before it can be considered ready for
voting.

I think the API can be improved. The LexerConfig -> DLexer rename is an
important readability issue IMO.

I'm not convinced (yet?).

Also, it's unclear what types of input is supported -- the code example only uses string input, but what about file input? Does it support byLine? Or does it need me to slurp the entire file contents into an
in-memory buffer first?

The input is a range of bytes, which the lexer assumes is UTF-8. The lexer works much faster on arrays than it does on arbitrary ranges though. Dmitry Olshansky implemented a circular buffer in this module that's used to cache the input range internally. This buffer is then used for the lexing process.

Now, somebody pointed out that there's currently no way to tell it that the data you're feeding to it is a partial slice of the full source. I think this should be an easy fix: LexerConfig (or DLexer after the rename) can have a field for specifying the initial line number / column number, defaulting to (1, 1) but can be changed by the caller for
parsing code fragments.

That's simple enough to add.

A minor concern I have about the current implementation (I didn't look at the code yet, but this is just based on the documentation), is that there's no way to choose what kind of data you want in the token stream. Right now, it appears that it always computes startIndex, line, and column. What if I don't care about this information, and only want, say, the token type and value (say I'm pretty-printing the source, so I don't care how the original was formatted)? Would it be possible to skip the
additional computations required to populate these fields?

It's possible, but I fear that it would make the code a mess.

Reply via email to