On Tuesday, July 31, 2012 23:39:38 Philippe Sigaud wrote: > On Tue, Jul 31, 2012 at 11:20 PM, Jonathan M Davis <jmdavisp...@gmx.com> wrote: > > On Tuesday, July 31, 2012 23:10:37 Philippe Sigaud wrote: > >> Having std.lexer in Phobos would be quite good. With a pre-compiled lexer > >> for D. > > > > I'm actually quite far along with one now - one which is specifically > > written and optimized for lexing D. I'll probably be done with it not too > > long after the 2.060 release (though we'll see). > > That was quick! Cool!
Yeah. Once I started on it, I made a lot of progress really quickly. There's still a fair bit to do (primarily having to do with literals), but it probably won't take all that much longer. Certainly, I'd expect to have it done within a couple of weeks if not sooner, unless something goes wrong. > >Writing it has been going surprisingly > > > > quickly actually, and I've already found some bugs in the spec as a result > > (some of which have been fixed, some of which I still need to create pull > > requests for). So, regardless of what happens with my lexer, at least the > > spec will be more accurate. > > Could you please describe the kind of token it produces? > Can it build a symbol table? > Does it recognize all kind of strings (including q{ } ones)? > How does it deal with comments, particularly nested ones? > Does it automatically discard whitespaces or produce them as a token? > I'd favor this approach, if only because wrapping the lexer in a > filter!noWS(tokenRange) is easy. > Does it produce a lazy range btw? Well, it's still a work in progress, so it certainly can be adjusted as necessary. I intend it to fully implement the spec (and make sure that both it and the spec match what dmd is doing) as far as lexing goes. The idea is that you should be able to build a fully compliant D parser on top of it and build a fully compliant D compiler on top of that. It already supports all of the comment types and several of the string literal types. I haven't sorted out q{} yet, but I will before I'm done, and that may or may not affect how some things work, since I'm not quite sure how to handle q{} yet (it may end up being done with tokens marking the beginning and end of the token sequence encompassed by q{}, but we'll see). I'm in the middle of dealing with the named entity stuff at the moment, which unfortunately has revealed a rather nasty compiler bug with regards to template compile times, which I still need to report (I intend to do that this evening). The file generating the table of named entities currently takes over 6 minutes to compile on my Phenom II thanks to that bug, so I'm not quite sure how that's going to affect things. Regardless, the lexer should support _everything_ as far as what's required for fully lexing D goes by the time that I'm done. I don't have the code with me at the moment, but I believe that the token type looks something like struct Token { TokenType type; string str; LiteralValue value; SourcePos pos; } struct SourcePos { size_t line; size_ col; size_t tabWidth = 8; } The type is an enum which gives the type of the token (obviously) which includes the various comment types and an error type (so errors are reported by getting a token that was an error token rather than throwing or anything like that, which should make lexing passed malformed stuff easy). str holds the exact text which was lexed (this includes the entire comment for the various comment token types), which in the case of lexing string rather than another range type would normally (always? - I don't remember) be a slice of the string being lexed, which should help make lexing string very efficient. It may or may not make sense to change that to the range type used rather than string. For nesting block comments, the whole comment is one token (with the token type which is specifically for nested comments), regardless of whether there's any nesting going on. But that could be changed if there were a need to get separate tokens for the comments inside. LiteralValue is a VariantN of the types that a literal can be (long, ulong, real, and string IIRC) and is empty unless the token is a literal type (the various string postfixes - c,w, and d - are treated as different token types rather than giving the literal value different string types - the same with the integral and floating point literals). And pos holds the position in the text where the token started, which should make it easy to use for syntax highlighting and the like (as well as indicating where an error occurred). The initial position is passed as an optional argument to the lexing function, so it doesn't have to be 1:1 (though that's the default), and it allows you to select the tabwidth. So, you'll pass a range and an optional starting position to the lexing function, and it'll return a lazy range of Tokens. Whitespace is stripped as part of the lexing process, but if you take the pos properties of two adjacent tokens, you should be able to determine how much whitespace was between them. I _think_ that that's how it currently is, but again, I don't have the code with me at the moment, so it may not be 100% correct. And since it's a work in progress, it's certainly open to changes. - Jonathan M Davis