On Tue, Nov 27, 2012 at 2:37 PM, Eli Friedman <[email protected]>wrote:

> On Tue, Nov 27, 2012 at 2:25 PM, Richard Smith <[email protected]>
> wrote:
> > I had a look at supporting UTF-8 in source files, and came up with the
> > attached approach. getCharAndSize maps UTF-8 characters down to a char
> with
> > the high bit set, representing the class of the character rather than the
> > character itself. (I've not done any performance measurements yet, and
> the
> > patch is generally far from being ready for review).
> >
> > Have you considered using a similar approach for lexing UCNs? We already
> > land in getCharAndSizeSlow, so it seems like it'd be cheap to deal with
> them
> > there. Also, validating the codepoints early would allow us to recover
> > better (for instance, from UCNs encoding whitespace or elements of the
> basic
> > source character set).
>
> That would affect the spelling of the tokens, and I don't think the C
> or C++ standard actually allows us to do that.


If I understand you correctly, you're concerned that we would get the wrong
string in the token's spelling? When we build a token, we take the
characters from the underlying source buffer, not the value returned by
getCharAndSize.


> Evil testcase:
>
> #define CONCAT(a,b) a ## b
> #define \U000100010\u00FD 1
> #if !CONCAT(\, U000100010\u00FD)
> #error "This should never happen"
> #endif
>
> -Eli
>
_______________________________________________
cfe-commits mailing list
[email protected]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits

Reply via email to