Re: std.d.lexer: pre-voting review / discussion

Brian Schott Wed, 11 Sep 2013 13:41:24 -0700

On Wednesday, 11 September 2013 at 19:56:45 UTC, H. S. Teoh wrote:

2. The example uses an if-then sequence of isBuiltType,isKeyword,
etc. Should be an enum so a switch can be done for speed.
I believe this is probably a result of having a separate enumvalue ofeach token, and at the same time needing to group them intocategoriesfor syntax highlighting purposes. I'd suggest a function forconvertingthe TokenType enum value into a TokenCategory enum. Perhapssomething
like:

        enum TokenCategory { BuiltInType, Keyword, ... }

        // Convert TokenType into TokenCategory
        TokenCategory category(TokenType tt) { ... }
Then in user code, you call category() on the token type, andswitch
over that. This maximizes performance.
Implementation-wise, I'd suggest either a hash table onTokenType, orperhaps even encoding the category into the TokenType enumvalues, for
example:

        enum TokenCategory {
                BuiltInType, Keyword, ...
        }

        enum TokenType {
                IntToken = (TokenCategory.BuiltInType << 16) | 0x0001,
                FloatToken = (TokenCategory.BuiltInType << 16) | 0x0002,
                ...
                FuncToken = (TokenCategory.Keyword << 16) | 0x0001,
        }

Then the category function can be simply:

        TokenCategory category(TokenType tt) {
                return cast(TokenCategory)(tt >> 16);
        }
Though admittedly, this is a bit hackish. But if you're goingfor speed,
this would be quite fast.

There are already plenty of hackish things in that module, sothis would fit right in.

4. When naming tokens like .. 'slice', it is giving it a
syntactic/semantic name rather than a token name. This would be
awkward if .. took on new meanings in D. Calling it 'dotdot'wouldbe clearer. Ditto for the rest. For example that is donebetter, '*'
is called 'star', rather than 'dereference'.
I agree. Especially since '*' can also mean multiplication,depending oncontext. It would be weird (and unnecessarily obscure) forparser code
to refer to 'dereference' when parsing expressions. :)

If you read the docs/code you'll see that "*" is called "star":-). The slice -> dotdot rename is pretty simple to do.

5. The LexerConfig initialization should be a constructorrather thana sequence of assignments. LexerConfig documentation isawfully thin.
For example, 'tokenStyle' is explained as being 'Token style',
whatever that is.
I'm on the fence about this one. Setting up the configurationbefore
starting the lexing process makes sense to me.
Perhaps one way to improve this is to rename LexerConfig toDLexer, and
make byToken a member function (or call it via UFCS):

        DLexer lex;
        lex.iterStyle = ...;
        // etc.

        foreach (token; lex.byToken()) {
                ...
        }
This reads better: you're getting a list of tokens from thelexer 'lex',as opposed to getting something from byToken(config), whichdoesn'treally *say* what it's doing: is it tokenizing the configobject?

byToken is a free function because its first parameter is therange to tokenize. This allows you to use UFCS on the range.(e.g. "sourcCode.byToken()" Putting it in a struct/class wouldbreak this.

6. No clue how lookahead works with this. Parsing D requiresarbitrary
lookahead.
What has this got to do with lexing? The parser needs to keeptrack ofits state independently of the lexer anyway, doesn't it? I'mnot surehow DMD's parser works, but the way I usually write parsers isthattheir state encodes the series of tokens encountered so far, sotheydon't need the lexer to save any tokens for them. If they needto referto tokens, they create partial AST trees on the parser stackthatreference said tokens. I don't see why it should be the lexer'sjob to
keep track of such things.

For parsing, you'll likely want to use array() to grab all thetokens. But there are other uses such as syntax highlighting thatonly need one token at a time.

9. Need to insert intra-page navigation links, such as when
'byToken()' appears in the text, it should be link to wherebyToken
is described.
Agreed.


I'll work on this later this evening.

[...]
I believe the state of the documentation is a showstopper, andneedsto be extensively fleshed out before it can be consideredready for
voting.
I think the API can be improved. The LexerConfig -> DLexerrename is an
important readability issue IMO.


I'm not convinced (yet?).

Also, it's unclear what types of input is supported -- the codeexampleonly uses string input, but what about file input? Does itsupportbyLine? Or does it need me to slurp the entire file contentsinto an
in-memory buffer first?

The input is a range of bytes, which the lexer assumes is UTF-8.The lexer works much faster on arrays than it does on arbitraryranges though. Dmitry Olshansky implemented a circular buffer inthis module that's used to cache the input range internally. Thisbuffer is then used for the lexing process.

Now, somebody pointed out that there's currently no way to tellit thatthe data you're feeding to it is a partial slice of the fullsource. Ithink this should be an easy fix: LexerConfig (or DLexer aftertherename) can have a field for specifying the initial line number/ columnnumber, defaulting to (1, 1) but can be changed by the callerfor
parsing code fragments.


That's simple enough to add.

A minor concern I have about the current implementation (Ididn't lookat the code yet, but this is just based on the documentation),is thatthere's no way to choose what kind of data you want in thetoken stream.Right now, it appears that it always computes startIndex, line,andcolumn. What if I don't care about this information, and onlywant, say,the token type and value (say I'm pretty-printing the source,so I don'tcare how the original was formatted)? Would it be possible toskip the
additional computations required to populate these fields?


It's possible, but I fear that it would make the code a mess.

Re: std.d.lexer: pre-voting review / discussion

Reply via email to