Re: [cfe-commits] [PATCH] Support for universal character names in identifiers

James Dennett Tue, 18 Dec 2012 23:38:14 -0800

On Tue, Nov 27, 2012 at 2:37 PM, Eli Friedman <[email protected]> wrote:
> On Tue, Nov 27, 2012 at 2:25 PM, Richard Smith <[email protected]> wrote:
>> I had a look at supporting UTF-8 in source files, and came up with the
>> attached approach. getCharAndSize maps UTF-8 characters down to a char with
>> the high bit set, representing the class of the character rather than the
>> character itself. (I've not done any performance measurements yet, and the
>> patch is generally far from being ready for review).
>>
>> Have you considered using a similar approach for lexing UCNs? We already
>> land in getCharAndSizeSlow, so it seems like it'd be cheap to deal with them
>> there. Also, validating the codepoints early would allow us to recover
>> better (for instance, from UCNs encoding whitespace or elements of the basic
>> source character set).
>
> That would affect the spelling of the tokens, and I don't think the C
> or C++ standard actually allows us to do that.  Evil testcase:
>
> #define CONCAT(a,b) a ## b
> #define \U000100010\u00FD 1
> #if !CONCAT(\, U000100010\u00FD)
> #error "This should never happen"
> #endif


For this particular case it doesn't matter: "If a character sequence
that matches the syntax of a universal-character-name is produced by
token concatenation (16.3.3), the behavior is undefined."  (2.2 Phases
of Translation [lex.phases], paragraph 1, list item 4.)

For what it's worth, the standard also says "An implementation may use
any internal encoding, so long as an actual extended character
encountered in the source file, and the same extended character
expressed in the source file as a universal-character-name (i.e.,
using the \uXXXX notation), are handled equivalently except where this
replacement is reverted in a raw string literal."

-- James
_______________________________________________
cfe-commits mailing list
[email protected]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits

Re: [cfe-commits] [PATCH] Support for universal character names in identifiers

Reply via email to