On Tue, Nov 27, 2012 at 2:37 PM, Eli Friedman <[email protected]> wrote: > On Tue, Nov 27, 2012 at 2:25 PM, Richard Smith <[email protected]> wrote: >> I had a look at supporting UTF-8 in source files, and came up with the >> attached approach. getCharAndSize maps UTF-8 characters down to a char with >> the high bit set, representing the class of the character rather than the >> character itself. (I've not done any performance measurements yet, and the >> patch is generally far from being ready for review). >> >> Have you considered using a similar approach for lexing UCNs? We already >> land in getCharAndSizeSlow, so it seems like it'd be cheap to deal with them >> there. Also, validating the codepoints early would allow us to recover >> better (for instance, from UCNs encoding whitespace or elements of the basic >> source character set). > > That would affect the spelling of the tokens, and I don't think the C > or C++ standard actually allows us to do that. Evil testcase: > > #define CONCAT(a,b) a ## b > #define \U000100010\u00FD 1 > #if !CONCAT(\, U000100010\u00FD) > #error "This should never happen" > #endif
For this particular case it doesn't matter: "If a character sequence that matches the syntax of a universal-character-name is produced by token concatenation (16.3.3), the behavior is undefined." (2.2 Phases of Translation [lex.phases], paragraph 1, list item 4.) For what it's worth, the standard also says "An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal." -- James _______________________________________________ cfe-commits mailing list [email protected] http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
