On Wed, Sep 11, 2013 at 10:37:34PM +0200, Brian Schott wrote: > On Wednesday, 11 September 2013 at 19:56:45 UTC, H. S. Teoh wrote: > >>2. The example uses an if-then sequence of isBuiltType, isKeyword, > >>etc. Should be an enum so a switch can be done for speed. > > > >I believe this is probably a result of having a separate enum value > >of each token, and at the same time needing to group them into > >categories for syntax highlighting purposes. I'd suggest a function > >for converting the TokenType enum value into a TokenCategory enum. > >Perhaps something like: > > > > enum TokenCategory { BuiltInType, Keyword, ... } > > > > // Convert TokenType into TokenCategory > > TokenCategory category(TokenType tt) { ... } > > > >Then in user code, you call category() on the token type, and switch > >over that. This maximizes performance. > > > >Implementation-wise, I'd suggest either a hash table on TokenType, or > >perhaps even encoding the category into the TokenType enum values, > >for example: > > > > enum TokenCategory { > > BuiltInType, Keyword, ... > > } > > > > enum TokenType { > > IntToken = (TokenCategory.BuiltInType << 16) | 0x0001, > > FloatToken = (TokenCategory.BuiltInType << 16) | 0x0002, > > ... > > FuncToken = (TokenCategory.Keyword << 16) | 0x0001, > > } > > > >Then the category function can be simply: > > > > TokenCategory category(TokenType tt) { > > return cast(TokenCategory)(tt >> 16); > > } > > > >Though admittedly, this is a bit hackish. But if you're going for > >speed, this would be quite fast. > > There are already plenty of hackish things in that module, so this > would fit right in.
Better hope it passes code review, though. :) > >>4. When naming tokens like .. 'slice', it is giving it a > >>syntactic/semantic name rather than a token name. This would be > >>awkward if .. took on new meanings in D. Calling it 'dotdot' would > >>be clearer. Ditto for the rest. For example that is done better, '*' > >>is called 'star', rather than 'dereference'. > > > >I agree. Especially since '*' can also mean multiplication, depending > >on context. It would be weird (and unnecessarily obscure) for parser > >code to refer to 'dereference' when parsing expressions. :) > > If you read the docs/code you'll see that "*" is called "star" :-). > The slice -> dotdot rename is pretty simple to do. I was just going by what Walter said, but OK, fair enough. > >>5. The LexerConfig initialization should be a constructor rather > >>than a sequence of assignments. LexerConfig documentation is awfully > >>thin. For example, 'tokenStyle' is explained as being 'Token > >>style', whatever that is. > > > >I'm on the fence about this one. Setting up the configuration before > >starting the lexing process makes sense to me. > > > >Perhaps one way to improve this is to rename LexerConfig to DLexer, > >and make byToken a member function (or call it via UFCS): > > > > DLexer lex; > > lex.iterStyle = ...; > > // etc. > > > > foreach (token; lex.byToken()) { > > ... > > } > > > >This reads better: you're getting a list of tokens from the lexer > >'lex', as opposed to getting something from byToken(config), which > >doesn't really *say* what it's doing: is it tokenizing the config > >object? > > byToken is a free function because its first parameter is the range to > tokenize. This allows you to use UFCS on the range. (e.g. > "sourcCode.byToken()" Putting it in a struct/class would break this. Very good point. In that case, would it be possible to make byToken accept configuration parameters inline? That is: auto tokens = File(mysource) .byLine() // does this actually work?? .byToken(IterationStyle.everything, ...); This seems better suited for UFCS-style chains, as opposed to needing to create a separate config object outside the chain just so you can use it inside. And writing LexerConfig(...) inline seems a bit cumbersome: auto tokens = File(mysource) .byLine() // does this actually work?? .byToken(LexerConfig(IterationStyle.everything, ...)); though, in retrospect, perhaps not as bad as it initially sounds. > >>6. No clue how lookahead works with this. Parsing D requires > >>arbitrary lookahead. > > > >What has this got to do with lexing? The parser needs to keep track > >of its state independently of the lexer anyway, doesn't it? I'm not > >sure how DMD's parser works, but the way I usually write parsers is > >that their state encodes the series of tokens encountered so far, so > >they don't need the lexer to save any tokens for them. If they need > >to refer to tokens, they create partial AST trees on the parser stack > >that reference said tokens. I don't see why it should be the lexer's > >job to keep track of such things. > > For parsing, you'll likely want to use array() to grab all the > tokens. But there are other uses such as syntax highlighting that > only need one token at a time. I don't think that's necessary. An appropriately-constructed parser doesn't need arbitrary slicing of tokens; its internal state should encode what tokens have been seen thus far so that it never needs to backtrack. [...] > >I think the API can be improved. The LexerConfig -> DLexer rename is > >an important readability issue IMO. > > I'm not convinced (yet?). OK, I take that back. UFCS chaining is an important emerging D paradigm that we should support, and the current API seems better suited for that purpose. > >Also, it's unclear what types of input is supported -- the code > >example only uses string input, but what about file input? Does it > >support byLine? Or does it need me to slurp the entire file contents > >into an in-memory buffer first? > > The input is a range of bytes, which the lexer assumes is UTF-8. The > lexer works much faster on arrays than it does on arbitrary ranges > though. Dmitry Olshansky implemented a circular buffer in this module > that's used to cache the input range internally. This buffer is then > used for the lexing process. OK, if this can be indicated in the docs, that'd be great. :) (A signature constraint of the form `if (is(ElementType!R == char))` that should be good enough, I think, since DDOC in DMD git HEAD will include signature constraints in the generated docs, and it should be clear enough what is expected. [...] > >A minor concern I have about the current implementation (I didn't > >look at the code yet, but this is just based on the documentation), > >is that there's no way to choose what kind of data you want in the > >token stream. Right now, it appears that it always computes > >startIndex, line, and column. What if I don't care about this > >information, and only want, say, the token type and value (say I'm > >pretty-printing the source, so I don't care how the original was > >formatted)? Would it be possible to skip the additional computations > >required to populate these fields? > > It's possible, but I fear that it would make the code a mess. Fair enough. It's not a significant overhead anyway (I believe), so it shouldn't be a big issue. T -- Once bitten, twice cry...