Jim, thanks for your response ... I know that in the EBCDIC system we feed a Unicode stream into the lexer, thus I'm pretty sure when the generated lexer code I pasted before is executed, it is already operating on the 32-bit unicode stream.
The problem is more about the native C compilation in an EBCDIC system like IBM z/OS mainframe. To see if a character from the Unicode stream is an 'A', we have to compare with a value 0x0041 ... If we match it with a native 'A' in the code, this will not be a match in an EBCDIC C compilation. Best, -Lego On Fri, Oct 16, 2009 at 3:07 AM, Jim Idle <[email protected]> wrote: > ANTLR works internally with 32 bit Unicode (UTF32), not EBCDIC, even if > it is in 8 bit mode. So you need to convert the EBCDIC to Unicode 8 bits and > use the ‘ASCII’ input stream. A simple way to do this would be to write your > own EBCDIC input stream that just converted to Unicode code points > (essentially EBCDIC->ASCII) on the fly via a lookup table. Trivial and > should be pretty quick. > > > > Jim > > > > *From:* [email protected] [mailto: > [email protected]] *On Behalf Of *Lego Haryanto > *Sent:* Tuesday, October 13, 2009 3:51 AM > *To:* [email protected] > *Subject:* [antlr-interest] ANTLR C: Question regarding the portability of > generated lexer C code > > > > I just recently noticed that the generated code from my lexer grammar > contains something like the following snippet: > > . > . > else if ( (((LA17_0 >= 'A') && (LA17_0 <= 'Z'))) ) > { > alt17=2; > } > else if ( (((LA17_0 >= 'a') && (LA17_0 <= 'z'))) ) > { > alt17=3; > } > else if ( (((LA17_0 >= 0x00A0) && (LA17_0 <= 0xD7FF))) ) > { > alt17=4; > } > . > . > > The generated code seems to comfortably use 'A' ... 'Z' literals. This may > not be good if let's say I compile the generated code in an IBM z/OS EBCDIC > environment as ['A' .. 'Z'] range contains more than just the 26 alphabet > codes and the value of the codes are not the same as the ones in Unicode > character set. > > I'm expecting something like in the third expression where 'A' is written > explicitly as 0x0041 (Unicode for 'A'). > > Please confirm. > > -Lego > -- Fear of the LORD is the beginning of knowledge (Proverbs 1:7) --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en -~----------~----~----~----~------~----~------~--~---
List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address
