On Tue, Nov 27, 2012 at 2:37 PM, Eli Friedman <[email protected]>wrote:
> On Tue, Nov 27, 2012 at 2:25 PM, Richard Smith <[email protected]> > wrote: > > I had a look at supporting UTF-8 in source files, and came up with the > > attached approach. getCharAndSize maps UTF-8 characters down to a char > with > > the high bit set, representing the class of the character rather than the > > character itself. (I've not done any performance measurements yet, and > the > > patch is generally far from being ready for review). > > > > Have you considered using a similar approach for lexing UCNs? We already > > land in getCharAndSizeSlow, so it seems like it'd be cheap to deal with > them > > there. Also, validating the codepoints early would allow us to recover > > better (for instance, from UCNs encoding whitespace or elements of the > basic > > source character set). > > That would affect the spelling of the tokens, and I don't think the C > or C++ standard actually allows us to do that. If I understand you correctly, you're concerned that we would get the wrong string in the token's spelling? When we build a token, we take the characters from the underlying source buffer, not the value returned by getCharAndSize. > Evil testcase: > > #define CONCAT(a,b) a ## b > #define \U000100010\u00FD 1 > #if !CONCAT(\, U000100010\u00FD) > #error "This should never happen" > #endif > > -Eli >
_______________________________________________ cfe-commits mailing list [email protected] http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
