Re: [antlr-dev] \uFFFF

Johannes Luber Wed, 03 Sep 2008 13:47:59 -0700

Terence Parr schrieb:
> 
> On Sep 3, 2008, at 1:31 PM, Johannes Luber wrote:
> 
>> Terence Parr schrieb:
>>> Hi, I designed everything to be 32 bit clean in terms of token types
>>> and character input so, while \uFFFF is not about character, there is
>>> no reason we can't allow that is input. currently we do not. I set the
>>> maximum to \uFFFE  but I am changing it to:
>>>
>>>     public static final int MAX_CHAR_VALUE = '\uFFFF';
>>>
>>> My unit tests and examples directory seemed to work okay.  The Java.g
>>> grammar for Sun needs to mimic what the javac compiler does; it allow
>>> us '\uFFFF' and more importantly converts that to the single Unicode
>>> character code point BEFORE the compiler sees it. it is done in the
>>> character string. anyway, ANTLR says that is an invalid character at
>>> the moment. I don't think we will  have a problem... can anyone think
>>> of an issue? I do all of my checks using -1 not '\uFFFF' for EOF...we
>>> *should* be okay...
>>>
>>> Ter
>>
>> Does unicode restrict the use \uffff? Will you sometime add handling for
>> chars bigger than \uffff natively in ANTLR, even if you need some
>> translation functions to make things work in JAva?
> 
> I haveTo figure out what all the 32bit character stuff means... The
> machinery of ANTLR should be able to handle 32-bit signed integers as
> characters or token types... representing them in a Java string for the
> Java target is another matter ;)
> 
> Ter
> 
Unicode assigns codepoints to each defined characters. UTF-x defines the
way, which encodes a particular codepoint. For UTF-16 this means that
the integers from '\uD800'..'\uDBFF' and '\uDC00'..'\uDFFF' are
considered surrogate code points. These SCPs require 2 chars/4 bytes in
Java (and C#/.NET) and can be transformed from and to the UTF-32
encoding, which is the 1:1 representation of code point numbers.


I'd prefer it if you add to '\uxxxx' also '\Uxxxxxxxx', which is the way
in C# to encode big unicodes. I've created some functions to deal with
this for my own lexer in some earlier stage. You may find them useful,
even if you have to change it to the JRE workings. While my file is
under the MIT license, you can license snippets under BSD.

Johannes

/*
The MIT License

Copyright (c) 2007-2008 Johannes Luber

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

Acknowledgements: There have been quite a few people helping me with the
implementation. I'd like to thank Terence Parr, Gavin Lambert, Jim Idle,
Robin Davies, Jamie Penney and those I've forgotten to mention for their
insight and the time they spent on me.

Converted original grammar in Ecma 334 into ANTLR syntax and collapsed rules 
like
A:B?, B: C+ into A: C* unless that caused ambiguities. Further grammar changes 
have
been made to remove left-recursion, non-determinisms and ambiguities.

Important changes from the original grammar: The usage of the original lexical 
grammar would have resulted in
creating another pass as this lexical grammar basically not only includes 
parser rules, but also is a complete
source file description (without deciphering the meanings of those tokens). As 
I changed this grammar into a
pure ANTLR lexer grammar, I removed some parser-equivalent structures, moved 
some parser rules into the
parser. The preprocessor directives are missing for now.

Implementation notes:
        - Passing a volatile field (Â§17.4.3) as a reference parameter or 
output parameter causes a warning, since
          the field may not be treated as volatile by the invoked method.
        - Â§14.3 explains member lookup. Â§14.4.2 explains overload resolution. 
Parts of Â§14 explain the algorithms
          for lookups of various things which can be done only in the AST 
grammar.
        - Property, indexer and event accessors require code which prevents 
that a variable named value respectively
          having the same name as one of the parameters can be manually 
declared.

*/

lexer grammar CSharp3Lexer;

// Grammar Ambiguities described in Â§9.2.3 in Ecma 334


options {
        language=CSharp2;
        //backtrack=true;
        //memoize=true;
        filter=true;
}

@namespace {
        Kerriv.CSharpML
}

@lexer::header {
}

@lexer::members {
// Required additional members

// constants
private const int LEAD_OFFSET = 0xD800 - (0x10000 >> 10);

// Required additional functions

public void Init() {
        characterClasses = new List<UnicodeCategory>
                [Enum.GetValues(typeof(CharacterKind)).Length];
        for (int i = 0; i < characterClasses.Length; i++) {
                characterClasses[i] = new List<UnicodeCategory>();
        }
        
        characterClasses[(int) 
CharacterKind.IdentifierStart].Add(UnicodeCategory.UppercaseLetter);
        characterClasses[(int) 
CharacterKind.IdentifierStart].Add(UnicodeCategory.LowercaseLetter);
        characterClasses[(int) 
CharacterKind.IdentifierStart].Add(UnicodeCategory.TitlecaseLetter);
        characterClasses[(int) 
CharacterKind.IdentifierStart].Add(UnicodeCategory.ModifierLetter);
        characterClasses[(int) 
CharacterKind.IdentifierStart].Add(UnicodeCategory.OtherLetter);
        characterClasses[(int) 
CharacterKind.IdentifierStart].Add(UnicodeCategory.LetterNumber);

        characterClasses[(int) 
CharacterKind.IdentifierPart].Add(UnicodeCategory.UppercaseLetter);
        characterClasses[(int) 
CharacterKind.IdentifierPart].Add(UnicodeCategory.LowercaseLetter);
        characterClasses[(int) 
CharacterKind.IdentifierPart].Add(UnicodeCategory.TitlecaseLetter);
        characterClasses[(int) 
CharacterKind.IdentifierPart].Add(UnicodeCategory.ModifierLetter);
        characterClasses[(int) 
CharacterKind.IdentifierPart].Add(UnicodeCategory.OtherLetter);
        characterClasses[(int) 
CharacterKind.IdentifierPart].Add(UnicodeCategory.LetterNumber);
        characterClasses[(int) 
CharacterKind.IdentifierPart].Add(UnicodeCategory.NonSpacingMark);
        characterClasses[(int) 
CharacterKind.IdentifierPart].Add(UnicodeCategory.SpacingCombiningMark);
        characterClasses[(int) 
CharacterKind.IdentifierPart].Add(UnicodeCategory.Format);
        characterClasses[(int) 
CharacterKind.IdentifierPart].Add(UnicodeCategory.ConnectorPunctuation);
        characterClasses[(int) 
CharacterKind.IdentifierPart].Add(UnicodeCategory.DecimalDigitNumber);

        characterClasses[(int) 
CharacterKind.Space].Add(UnicodeCategory.SpaceSeparator);
}

/// <summary>
/// This function transforms a UTF-32 codepoint into 2 UTF-16 code units. The 
result needs to be checked
/// with Char.IsSurrogatePair() if the given codepoint belongs really into the 
surrogate category.
/// </summary>
/// <param name="codepoint">The UTF-32 codepoint.</param>
private string TransformUtf32ToUtf16(int codepoint) {
        char[] surrogatePair = new char[2];
        
        // computations (taken from http://www.unicode.org/faq/utf_bom.html#35, 
second solution)
        surrogatePair[0] = (char) (LEAD_OFFSET + (codepoint >> 10));    // high 
surrogate
        surrogatePair[1] = (char) (0xDC00 + (codepoint & 0x3FF));               
// low surrogate
        
        return new string(surrogatePair);
}

private int ConvertHexCharArrayIntoInt32(char[] hexString) {
        // Checking argument
        if (hexString == null)
                throw new ArgumentNullException("hexString", "May not be 
null!");
        if (hexString.Length == 0)
                throw new ArgumentException("hexString", "String may not be 
empty!");
        
        int result = 0;
        int power = 1;
        
        for (int i = hexString.Length-1; i >= 0 ; i--) {
                result += power * ConvertHexCharIntoInt32(hexString[i]);
                power *= 16;
        }
        
        return result;
}

private int ConvertHexCharIntoInt32(char hexDigit) {
        switch (hexDigit) {
        case '0':
        case '1':
        case '2':
        case '3':
        case '4':
        case '5':
        case '6':
        case '7':
        case '8':
        case '9':
                return hexDigit - '0';
        case 'A':
        case 'B':
        case 'C':
        case 'D':
        case 'E':
        case 'F':
                return hexDigit - 'A' + 10;
        case 'a':
        case 'b':
        case 'c':
        case 'd':
        case 'e':
        case 'f':
                return hexDigit - 'a' + 10;
        default:
                throw new ArgumentException("hexDigit", "Only digits and 
letters [a-fA-F] are allowed!");
        }
}
}

// $<Keywords
ABSTRACT : 'abstract';
AS : 'as';
BASE : 'base';
BOOL : 'bool';
BREAK : 'break';
BYTE : 'byte';
CASE : 'case';
CATCH : 'catch';
CHAR : 'char';
CHECKED : 'checked';
CLASS : 'class';
CONST : 'const';
CONTINUE : 'continue';
DECIMAL : 'decimal';
DEFAULT : 'default';
DELEGATE : 'delegate';
DO : 'do';
DOUBLE : 'double';
ELSE : 'else';
ENUM : 'enum';
EVENT : 'event';
EXPLICIT : 'explicit';
EXTERN : 'extern';
FALSE : 'false';
FINALLY : 'finally';
FIXED : 'fixed';
FLOAT : 'float';
FOR : 'for';
FOREACH : 'foreach';
GOTO : 'goto';
IF : 'if';
IMPLICIT : 'implicit';
IN : 'in';
INT : 'int';
INTERFACE : 'interface';
INTERNAL : 'internal';
IS : 'is';
LOCK : 'lock';
LONG : 'long';
NAMESPACE : 'namespace';
NEW : 'new';
NULL : 'null';
OBJECT : 'object';
OPERATOR : 'operator';
OUT : 'out';
OVERRIDE : 'override';
PARAMS : 'params';
PRIVATE : 'private';
PROTECTED : 'protected';
PUBLIC : 'public';
READONLY : 'readonly';
REF : 'ref';
RETURN : 'return';
SBYTE : 'sbyte';
SEALED : 'sealed';
SHORT : 'short';
SIZEOF : 'sizeof';
STACKALLOC : 'stackalloc';
STATIC : 'static';
STRING : 'string';
STRUCT : 'struct';
SWITCH : 'switch';
THIS : 'this';
THROW : 'throw';
TRUE : 'true';
TRY : 'try';
TYPEOF : 'typeof';
UINT : 'uint';
ULONG : 'ulong';
UNCHECKED : 'unchecked';
UNSAFE : 'unsafe';
USHORT : 'ushort';
USING : 'using';
VIRTUAL : 'virtual';
VOID : 'void';
VOLATILE : 'volatile';
WHILE : 'while';
// $>

// $<Lexical Analysis

WHITESPACE
        :       WHITESPACE_CHARACTER+
        ;

fragment WHITESPACE_CHARACTER
        :       UNICODE_CLASS_Zs 
        |       '\u0009' // Horizontal tab character
        |       '\u000B' // Vertical tab character
        |       '\u000C' // Form feed character
        ;

fragment UNICODE_CLASS_Zs // Any character with Unicode class Zs (18 characters 
known)
        :       '\u0020' // SPACE
        |       '\u00A0' // NO_BREAK SPACE
        |       '\u1680' // OGHAM SPACE MARK
        |       '\u180E' // MONGOLIAN VOWEL SEPARATOR
        |       '\u2000' // EN QUAD
        |       '\u2001' // EM QUAD
        |       '\u2002' // EN SPACE
        |       '\u2003' // EM SPACE
        |       '\u2004' // THREE_PER_EM SPACE
        |       '\u2005' // FOUR_PER_EM SPACE
        |       '\u2006' // SIX_PER_EM SPACE
        |       '\u2008' // PUNCTUATION SPACE
        |       '\u2009' // THIN SPACE
        |       '\u200A' // HAIR SPACE
        |       '\u202F' // NARROW NO_BREAK SPACE
        |       '\u205F' // MEDIUM MATHEMATICAL SPACE
        |       '\u3000' // IDEOGRAPHIC SPACE
        ;

NEW_LINE
        :       '\u000D' // Carriage return character
        |       '\u000A' // Line feed character
        |       '\u000D\u000A' // Carriage return character followed by line 
feed character
        |       '\u0085' // Next line character
        |       '\u2028' // Line separator character
        |       '\u2029' // Paragraph separator character
        ;
        
SINGLE_LINE_COMMENT
        :       '//' INPUT_CHARACTER*
        ;
        
        
fragment INPUT_CHARACTER
        :       ~NEW_LINE_CHARACTER // Any Unicode character except a 
new_line_character
        ;

NEW_LINE_CHARACTER
        :       '\u000D' // Carriage return character
        |       '\u000A' // Line feed character
        |       '\u0085' // Next line character
        |       '\u2028' // Line separator character
        |       '\u2029' // Paragraph separator character
        ;
        
DELIMITED_COMMENT
        :       '/*' ( options {greedy=false;} : . )* '*/'
        ;
        
// This rule is supposed to catch all characters which may be used as a part of 
an identifier.
// Note that this rule is a superset which may not only include positional 
invalid characters,
// but always invalid characters.
// Sort the bad identifiers out in the parser where the symbol tables are 
build. Digits aren't
// included because allowing them at first place causes confusion with 
INTEGER_LITERAL
// and REAL_LITERAL. The way it is structured is to get the characters in the 
UTF-16 encoding.
//
// This is a hack to workaround the ANTLR 3 limitation that one can't choose 
unicode character
// classes directly. Also known characters required by other rules are excluded.
fragment ANY_UNUSED_CHARACTER
        :       'A'..'Z'        // Use only alphabet characters below U+0080
        |       'a'..'z'
        |       '\u0080'..'\u009F'      // NO NO_BREAK SPACE
        |       '\u00A1'..'\u167F'      // NO OGHAM SPACE MARK
        |       '\u1681'..'\u180D'      // NO MONGOLIAN VOWEL SEPARATOR
        |       '\u180F'..'\u1FFF'      // NO EN QUAD, EM QUAD, EN SPACE, 
THREE_PER_EM SPACE, FOUR_PER_EM SPACE, SIX_PER_EM SPACE
        |       '\u2007'                // NO PUNCTUATION SPACE, THIN SPACE, 
HAIR SPACE
        |       '\u200B'..'\u202E'      // NO NARROW NO_BREAK SPACE
        |       '\u2030'..'\u205E'      // NO MEDIUM MATHEMATICAL SPACE
        |       '\u2060'..'\u2FFF'      // NO IDEOGRAPHIC SPACE
        |       '\u3001'..'\uD7FF'
        |       '\uE000'..'\uFFFF'
        |       '\uD800'..'\uDBFF' '\uDC00'..'\uDFFF' // Surrogate code points
        ;
// $>

// $<Tokens

// $<indentifier
IDENTIFIER
        :       AVAILABLE_IDENTIFIER
        |       VERBATIM_IDENTIFIER
        ;

fragment AVAILABLE_IDENTIFIER
        :       IDENTIFIER_OR_KEYWORD // An identifier_or_keyword that is not a 
keyword - no check due to precedence in ANTLR grammars needed
        ;

fragment VERBATIM_IDENTIFIER
        :       '@' IDENTIFIER_OR_KEYWORD
        ;
        
fragment IDENTIFIER_OR_KEYWORD
        :       (ANY_UNUSED_CHARACTER | UNICODE_ESCAPE_SEQUENCE) 
(ANY_UNUSED_CHARACTER | DECIMAL_DIGIT | UNICODE_ESCAPE_SEQUENCE)*
        ;

fragment UNICODE_ESCAPE_SEQUENCE
        :       '\\u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT 
        |       '\\U' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT 
HEX_DIGIT HEX_DIGIT HEX_DIGIT
        ;
// $>

// $<integer_literal
INTEGER_LITERAL 
        :       DECIMAL_INTEGER_LITERAL
        |       HEXADECIMAL_INTEGER_LITERAL
        ;

fragment DECIMAL_INTEGER_LITERAL
        :       DECIMAL_DIGIT+ INTEGER_TYPE_SUFFIX?
        ;

fragment DECIMAL_DIGIT
        : '0'..'9'
        ;

fragment DECIMAL_DIGITS
        :       DECIMAL_DIGIT+
        ;
                
fragment INTEGER_TYPE_SUFFIX
        :       'U'
        |       'u'
        |       'L'
        |       'l'
        |       'UL'
        |       'Ul'
        |       'uL'
        |       'ul'
        |       'LU'
        |       'Lu'
        |       'lU'
        |       'lu'
        ;
        
fragment HEXADECIMAL_INTEGER_LITERAL
        :       ('0x' | '0X') HEX_DIGIT+ INTEGER_TYPE_SUFFIX?
        ;
        
fragment HEX_DIGIT
        :       '0'..'9'
        |       'A'..'F'
        |       'a'..'f'
        ;
// $>
        
// $<real_literal
REAL_LITERAL
        :       DECIMAL_DIGITS '.' DECIMAL_DIGITS EXPONENT_PART? 
REAL_TYPE_SUFFIX?
        |       '.' DECIMAL_DIGITS EXPONENT_PART? REAL_TYPE_SUFFIX?
        |       DECIMAL_DIGITS EXPONENT_PART REAL_TYPE_SUFFIX?
        |       DECIMAL_DIGITS REAL_TYPE_SUFFIX
        ;

fragment EXPONENT_PART
        :       'e' SIGN? DECIMAL_DIGIT+
        |       'E' SIGN? DECIMAL_DIGIT+
        ;

fragment SIGN
        :       '+'
        |       '-'
        ;
        
fragment REAL_TYPE_SUFFIX
        :       'F'
        |       'f'
        |       'D'
        |       'd'
        |       'M'
        |       'm'
        ;
// $>

// $<character_literal

CHARACTER_LITERAL
        :       '\'' CHARACTER '\''
        ;
        
fragment CHARACTER
        :       SINGLE_CHARACTER
        |       SIMPLE_ESCAPE_SEQUENCE
        |       HEXADECIMAL_ESCAPE_SEQUENCE
        |       UNICODE_ESCAPE_SEQUENCE
        ;
        
fragment SINGLE_CHARACTER
        :       ~('\'' | '\\' | NEW_LINE_CHARACTER)
        ; 

fragment SIMPLE_ESCAPE_SEQUENCE
        :       '\\\''
        |       '\\' '\"'
        |       '\\\\'
        |       '\\0'
        |       '\\a'
        |       '\\b'
        |       '\\f'
        |       '\\n'
        |       '\\r'
        |       '\\t'
        |       '\\v'
        ;
         
fragment HEXADECIMAL_ESCAPE_SEQUENCE
        :       '\\x' (h+=HEX_DIGIT {$h.Count<=4}?=>)+ // Restrict the number 
of HEX_DIGITs to a maximum of 4
        ;
// $>
        
// $<string_literal
STRING_LITERAL
        :       REGULAR_STRING_LITERAL
        |       VERBATIM_STRING_LITERAL
        ;

fragment REGULAR_STRING_LITERAL
        :       '"' REGULAR_STRING_LITERAL_CHARACTER* '"'
        ;

fragment REGULAR_STRING_LITERAL_CHARACTER
        :       SINGLE_REGULAR_STRING_LITERAL_CHARACTER
        |       SIMPLE_ESCAPE_SEQUENCE
        |       HEXADECIMAL_ESCAPE_SEQUENCE
        |       UNICODE_ESCAPE_SEQUENCE
        ;

fragment SINGLE_REGULAR_STRING_LITERAL_CHARACTER
        :       ~( '"' | '\\' | NEW_LINE_CHARACTER )
        ;
        
fragment VERBATIM_STRING_LITERAL
        :       '@"' VERBATIM_STRING_LITERAL_CHARACTER* '"'
        ;

fragment VERBATIM_STRING_LITERAL_CHARACTER
        :       SINGLE_VERBATIM_STRING_LITERAL_CHARACTER
        |       QUOTE_ESCAPE_SEQUENCE
        ;
        
fragment SINGLE_VERBATIM_STRING_LITERAL_CHARACTER
        :       ~'"'
        ;
         
fragment QUOTE_ESCAPE_SEQUENCE
        :       '""'
        ;
// $>
        
// $<operator_or_punctuator
OPEN_BRACE : '{';
CLOSE_BRACE : '}';
OPEN_BRACKET : '[';
CLOSE_BRACKET : ']';
OPEN_PARENS : '(';
CLOSE_PARENS : ')';
DOT : '.';
COMMA : ',';
COLON : ':';
SEMICOLON : ';';
PLUS : '+';
MINUS : '-';
STAR : '*';
DIV : '/';
PERCENT : '%';
AMP : '&';
BITWISE_OR : '|';
CARET : '^';
BANG : '!';
TILDE : '~';
ASSIGNMENT : '=';
LT : '<';
GT : '>';
INTERR : '?';
DOUBLE_COLON : '::';
OP_COALESCING : '??';
OP_INC : '++';
OP_DEC : '--';
OP_AND : '&&';
OP_OR : '||';
OP_PTR : '->';
OP_EQ : '==';
OP_NE : '!=';
OP_LE : '<=';
OP_GE : '>=';
OP_ADD_ASSIGNMENT : '+=';
OP_SUB_ASSIGNMENT : '-=';
OP_MULT_ASSIGNMENT : '*=';
OP_DIV_ASSIGNMENT : '/=';
OP_MOD_ASSIGNMENT : '%=';
OP_AND_ASSIGNMENT : '&=';
OP_OR_ASSIGNMENT : '|=';
OP_XOR_ASSIGNMENT : '^=';
OP_LEFT_SHIFT : '<<';
OP_LEFT_SHIFT_ASSIGNMENT : '<<=';
// $>

// $>

OTHER : d=. {Console.Out.Write((char) $d);} ; // Add error diagnostics!

_______________________________________________
antlr-dev mailing list
[email protected]
http://www.antlr.org:8080/mailman/listinfo/antlr-dev

Re: [antlr-dev] \uFFFF

Reply via email to