Please note that the documentation for the C runtime in 3.4 is yet to be updated. In the meantime, if you wish to try it, then there is one change that you need to be aware of:
1) The distinction between ASCII and UCS2 input streams is now removed and there is a single function: antlr3FileStreamNew() to replace the file related input streams and a function” antlr3StringStreamNew to replace the memory related input streams. Prototypes and usage: antlr3FileStreamNew(pANTLR3_UINT8 fileName, ANTLR3_UINT32 encoding) antlr3StringStreamNew(pANTLR3_UINT8 data, ANTLR3_UINT32 encoding, ANTLR3_UINT32 size, pANTLR3_UINT8 name) fileName – path to input file in 8 bit characters. Used to call fopen() data – pointer to input data in any encoded form (note that I will change this to void * in the next beta/release) size – the size of the input data (always bytesm regardless of encoding) name – The name to use for the string stream (passed to error handlers for instance) may be NULL Then the encoding values are: ANTLR3_ENC_8BIT – 8 bit encoding (ASCII/latin1/etc) (replaces the existing ASCII stream) ANTLR3_ENC_UTF8 – UTF8 encoding (eats any BOM that may be present) ANTLR3_ENC_UTF16 – UTF16 encoding (also handles UCS2) – determine byte order from BOM or machine natural without BOM ANTLR3_ENC_UTF16BE – UTF16 encoding (also handles UCS2), big endian but no BOM ANTLR3_ENC_UTF16LE – UTF16 encoding (also handles UCS2), little endian but no BOM ANTLR3_ENC_UTF32 - UTF32 encoding – determine byte order from BOM or machine natural without BOM ANTLR3_ENC_UTF32BE - UTF32 encoding – big endian but no BOM ANTLR3_ENC_UTF32LE - UTF32 encoding – little endian but no BOM ANTLR3_ENC_EBCDIC - EBCDIC encoding (8 bit). Note that EBCDIC encoding means that the input is in EBCDIC and it is not changed. The LA() method for EBCDIC encoding converts a character to ASCII before matching. Therefore the pointers to the first character of the token in the input stream remain pointing at EBCDIC and you are responsible for any conversion of the token strings if you need to convert them. Encoding is as per the Unicode standards and supports the full Unicode character range and all surrogate pairs are decoded in UTF16. Note however that for performance reasons, errors in the encoding are usually ignored (for instance a valid hi surrogate that does not have a lo surrogate), but that invalid sequences that may not be ignored, may screw up your input. You can of course override any of the LA methods and report such things as errors, should you need to do so. The purpose of LA() is to return the 32 bit integer Unicode code point for the specified character – how it does that is irrelevant to the lexer, which is just matching 32 but numbers. This means you should not code your lexer to match surrogates, just the code points. Jim List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address -- You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en.
