[il-antlr-interest: 28225] [antlr-interest] Unicode lexing

Jonathan S. Shapiro Tue, 09 Mar 2010 15:59:27 -0800

Well *that* was weird. Sorry for the mis-send.

I know this topic has come up before, and sorry to bring it up again.


Context: I'm bringing up BitC on CLI, and planning to use antlr to do it.
BitC characters cover the full unicode (20 bit) range.

The good news:

   1. Characters above U+FFFF can only appear in character and string
   literals.
   2. The language requires that units of compilation be encoded in UTF-8.
   3. Both JVM and CLI carry strings as UTF-16, so if we carry character
   literals around as string payloads we can deal with this internally.
   4. Outside of character and string literals the legal input characters
   all fall within the 16-bit UNICODE subset.

When we dealt with this in the current, yacc-based implementation, we
proceeded as follows:

   1. We hand-wrote the lexer and had it process the raw input as a byte
   stream. We then hand-decoded UTF-8 sequences as appropriate.
   2. To carry around string literal values we encoded them internally as
   UTF-8 (because this was C). In JVM/CLR, obviously, we would encode in
   UTF-16.
   3. We internally carted character literal values around as an unsigned
   32-bit integer.

So basically, we found that an "arm's length unicode" approach worked out
for us. I had thought to adopt a similar approach with Antlr.

I've been reading the Antlr Reference book, and I noted a comment to the
effect that if you hand-write a lexer you lose the ability to do certain
kinds of lookahead. Is this the case, or is it possible to hand-write a
lexer in a fashion that cooperates with the regular behavior of Antlr?

Thanks


Jonathan

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

-- 
You received this message because you are subscribed to the Google Groups 
"il-antlr-interest" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en.

[il-antlr-interest: 28225] [antlr-interest] Unicode lexing

Reply via email to