On Wed, Sep 11, 2013 at 01:31:45PM -0700, Walter Bright wrote: > On 9/11/2013 12:55 PM, H. S. Teoh wrote: > >>3. I assumed TokenType is a type. But it's not, it's an enum. Even > >>the document says it's a 'type', but it's not a type. > > > >I don't understand this complaint. I thought it's pretty clear that > >it's referring to what type of token it is (or would you prefer > >TokenKind?). > > A type is a well-understood concept in D. This is using 'type' to > mean something completely different. This is very jarring. Your idea > of using 'kind' is vastly superior.
No, in typical D code, user-defined type names are capitalized, such as InputRange, Duration, Complex, ByLine, etc.. They do not have 'Type' in their names (we don't call them InputRangeType, for example). So this usage doesn't conflict at all. > >>5. The LexerConfig initialization should be a constructor rather > >>than a sequence of assignments. LexerConfig documentation is awfully > >>thin. For example, 'tokenStyle' is explained as being 'Token > >>style', whatever that is. > > > >I'm on the fence about this one. Setting up the configuration before > >starting the lexing process makes sense to me. > > Yes, using a constructor :-) Constructors with many arguments are evil, because you can't just modify one setting and use defaults for the rest; you have to specify all of them up to the last default argument you have to change. > >>6. No clue how lookahead works with this. Parsing D requires > >>arbitrary lookahead. > > > >What has this got to do with lexing? > > The lexer needs to support backing up and going forward again. > That's how the dmd lexer works. Then the parser should be fixed. Keeping track of what tokens have been seen so far is the parser's job, not the lexer's. LALR(1) parsers, for example, deal with multiple similar-looking constructs without requiring more than a single token lookahead. The trick is that you do not assume a particular grammar production until it has been fully disambiguated. The grammar, if not conducive to this, can usually be rearranged to make it possible. For example, if the grammar has two production rules: StatementA ::= TokenA TokenB TokenC TokenD StatementB ::= TokenA TokenB TokenC TokenE then you may claim that you need a 4-token lookahead in order to tell, given TokenA in the current input, which type of statement it's going to be. However, these rules can be rewritten like this: StatementABPrefix ::= TokenA TokenB TokenC StatementA ::= StatementABPrefix TokenD StatementB ::= StatementABPrefix TokenE Now the parser only needs a 1-token lookahead to do its job, and no backtracking is ever needed. Any information encoded by the first 3 tokens can be stored in the AST node of StatementABPrefix, so it's recoverable without needing to save previous tokens. Unless, you're saying that D grammar is more complex than LALR(1)... If so, I'd like to see an example of where this is the case. > >Also, it's unclear what types of input is supported -- the code > >example only uses string input, but what about file input? Does it > >support byLine? Or does it need me to slurp the entire file contents > >into an in-memory buffer first? > > The point of having the input be an InputRange is you DO NOT CARE if > the input is coming from a string, file, etc. Yes, but the docs currently doesn't say what form this input range must take. Must it be a range of characters? Am I allowed to pass in a range of *strings* (representing source lines)? This needs to be clearly stated. The fact that tokens slice the input also needs to be clearly stated, since that precludes the use of transient ranges like byLine (your token values will be garbled by the time you get to them). T -- Let's call it an accidental feature. -- Larry Wall