I'm writing a parser generator for ANTLR-grammars and have come across the rule

fragment Letter
    : [a-zA-Z$_] // these are below 0x7F
| ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters above 0x7F which are not a surrogate | [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
    ;

at

https://github.com/antlr/grammars-v4/blob/master/cto/CtoLexer.g4#L158

This rule is converted into

    Match m__Letter()
    {
return alt(alt(rng('a', 'z'), rng('A', 'Z'), ch('$'), ch('_')), not(alt(rng('\u0000', '\u007F'), rng('\uD800', '\uDBFF'))), seq(rng('\uD800', '\uDBFF'), rng('\uDC00', '\uDFFF')));
    }

given suitable defs of alt, rng, seq, not.

This errors as

CtoLexer_parser.d 665 57 error invalid UTF character \U0000d800 CtoLexer_parser.d 665 67 error invalid UTF character \U0000dbff CtoLexer_parser.d 666 28 error invalid UTF character \U0000d800 CtoLexer_parser.d 666 38 error invalid UTF character \U0000dbff CtoLexer_parser.d 666 53 error invalid UTF character \U0000dc00 CtoLexer_parser.d 666 63 error invalid UTF character \U0000dfff

Doesn't DMD support these Unicodes yet?

Reply via email to