Well *that* was weird. Sorry for the mis-send. I know this topic has come up before, and sorry to bring it up again.
Context: I'm bringing up BitC on CLI, and planning to use antlr to do it. BitC characters cover the full unicode (20 bit) range. The good news: 1. Characters above U+FFFF can only appear in character and string literals. 2. The language requires that units of compilation be encoded in UTF-8. 3. Both JVM and CLI carry strings as UTF-16, so if we carry character literals around as string payloads we can deal with this internally. 4. Outside of character and string literals the legal input characters all fall within the 16-bit UNICODE subset. When we dealt with this in the current, yacc-based implementation, we proceeded as follows: 1. We hand-wrote the lexer and had it process the raw input as a byte stream. We then hand-decoded UTF-8 sequences as appropriate. 2. To carry around string literal values we encoded them internally as UTF-8 (because this was C). In JVM/CLR, obviously, we would encode in UTF-16. 3. We internally carted character literal values around as an unsigned 32-bit integer. So basically, we found that an "arm's length unicode" approach worked out for us. I had thought to adopt a similar approach with Antlr. I've been reading the Antlr Reference book, and I noted a comment to the effect that if you hand-write a lexer you lose the ability to do certain kinds of lookahead. Is this the case, or is it possible to hand-write a lexer in a fashion that cooperates with the regular behavior of Antlr? Thanks Jonathan List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address -- You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en.
