[il-antlr-interest: 26348] Re: [antlr-interest] ANTLR C: Question regarding the portability of generated lexer C code

Jim Idle Sat, 17 Oct 2009 01:40:11 -0700

A couple of other things though on this now I think about it.


1)      To avoid problems with various systems ideas about what wide_t is, 
character strings are encoded in their ASCII hex forms – this would have to be 
avoided by encoding ‘C’ ‘h’ ‘a’ ‘r’

2)      Use ‘\n’ etc rather than unicode code points in your rules;

3)      Use 3.2 and ensure that you get switches generated for everything and 
not DFA tables;

 

I cannot immediately think of anything else that would get in the way. However, 
if you hack together anything then send me the changes you had to make as if 
they are reasonable then I am willing to add them to the C target J If the main 
thing is the encoding in ASCII form of string literals then it might be 
reasonable to add an ebcdic option to the command line and only those targets 
where it is an issue would look at it. I just don’t want to embark on huge 
changes just to accommodate IBM.

 

I am not sure how true it is or not, but a friend of mine worked for IBM in 
Florida and met a guy who was (or at least said he was) on the development 
committee for EBCDIC. He claimed that he had deliberately thrown in outrageous 
suggestions (but I don’t know if it was out of disgust or a bizarre sense of 
humor), and that some of them were adopted. 

 

Anyway, you have my sympathies trying to work on the C stuff on zOS. I was 
pioneer on using it and I can only hope that it is more mature these days. If I 
can help you, then I will J

 

Jim

 

From: Lego Haryanto [mailto:[email protected]] 
Sent: Thursday, October 15, 2009 8:27 PM
To: Jim Idle
Cc: [email protected]
Subject: Re: [antlr-interest] ANTLR C: Question regarding the portability of 
generated lexer C code

 

Jim, thanks for your response ...

I know that in the EBCDIC system we feed a Unicode stream into the lexer, thus 
I'm pretty sure when the generated lexer code I pasted before is executed, it 
is already operating on the 32-bit unicode stream.

The problem is more about the native C compilation in an EBCDIC system like IBM 
z/OS mainframe.

To see if a character from the Unicode stream is an 'A', we have to compare 
with a value 0x0041 ... If we match it with a native 'A' in the code, this will 
not be a match in an EBCDIC C compilation.

Best,
-Lego

On Fri, Oct 16, 2009 at 3:07 AM, Jim Idle <[email protected]> wrote:

ANTLR works internally with 32 bit Unicode (UTF32), not EBCDIC, even if it is 
in 8 bit mode. So you need to convert the EBCDIC to Unicode 8 bits and use the 
‘ASCII’ input stream. A simple way to do this would be to write your own EBCDIC 
input stream that just converted to Unicode code points (essentially 
EBCDIC->ASCII) on the fly via a lookup table. Trivial and should be pretty 
quick.

 

Jim

 

From: [email protected] 
[mailto:[email protected]] On Behalf Of Lego Haryanto
Sent: Tuesday, October 13, 2009 3:51 AM
To: [email protected]
Subject: [antlr-interest] ANTLR C: Question regarding the portability of 
generated lexer C code

 

I just recently noticed that the generated code from my lexer grammar contains 
something like the following snippet:

            .
            .
            else if ( (((LA17_0 >= 'A') && (LA17_0 <= 'Z'))) ) 
            {
                alt17=2;
            }
            else if ( (((LA17_0 >= 'a') && (LA17_0 <= 'z'))) ) 
            {
                alt17=3;
            }
            else if ( (((LA17_0 >= 0x00A0) && (LA17_0 <= 0xD7FF))) ) 
            {
                alt17=4;
            }
            .
            .

The generated code seems to comfortably use 'A' ... 'Z' literals.  This may not 
be good if let's say I compile the generated code in an IBM z/OS EBCDIC 
environment as ['A' .. 'Z'] range contains more than just the 26 alphabet codes 
and the value of the codes are not the same as the ones in Unicode character 
set.

I'm expecting something like in the third expression where 'A' is written 
explicitly as 0x0041 (Unicode for 'A').

Please confirm.

-Lego




-- 
Fear of the LORD is the beginning of knowledge (Proverbs 1:7)




--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"il-antlr-interest" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en
-~----------~----~----~----~------~----~------~--~---

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

[il-antlr-interest: 26348] Re: [antlr-interest] ANTLR C: Question regarding the portability of generated lexer C code

Reply via email to