Jonathan M Davis , dans le message (digitalmars.D:174191), a écrit : > On Thursday, August 02, 2012 11:08:23 Walter Bright wrote: >> The tokens are not kept, correct. But the identifier strings, and the string >> literals, are kept, and if they are slices into the input buffer, then >> everything I said applies. > > String literals often _can't_ be slices unless you leave them in their > original state rather than giving the version that they translate to (e.g. > leaving \© in the string rather than replacing it with its actual, > unicode value). And since you're not going to be able to create the literal > using whatever type the range is unless it's a string of some variety, that > means that the literals often can't be slices, which - depending on the > implementation - would make it so that that they can't _ever_ be slices. > > Identifiers are a different story, since they don't have to be translated at > all, but regardless of whether keeping a slice would be better than creating > a > new string, the identifier table will be far superior, since then you only > need > one copy of each identifier. So, it ultimately doesn't make sense to use > slices > in either case even without considering issues like them being spread across > memory. > > The only place that I'd expect a slice in a token is in the string which > represents the text which was lexed, and that won't normally be kept around. > > - Jonathan M Davis
I thought it was not the lexer's job to process litterals. Just split the input in tokens, and provide minimal info: TokenType, line and col along with the representation from the input. That's enough for a syntax highlighting tools for example. Otherwise you'll end up doing complex interpretation and the lexer will not be that efficient. Litteral interpretation can be done in a second step. Do you think doing litteral interpretation separately when you need it would be less efficient? -- Christophe