On Tue, Nov 27, 2012 at 2:25 PM, Richard Smith <[email protected]> wrote: > I had a look at supporting UTF-8 in source files, and came up with the > attached approach. getCharAndSize maps UTF-8 characters down to a char with > the high bit set, representing the class of the character rather than the > character itself. (I've not done any performance measurements yet, and the > patch is generally far from being ready for review). > > Have you considered using a similar approach for lexing UCNs? We already > land in getCharAndSizeSlow, so it seems like it'd be cheap to deal with them > there. Also, validating the codepoints early would allow us to recover > better (for instance, from UCNs encoding whitespace or elements of the basic > source character set).
That would affect the spelling of the tokens, and I don't think the C or C++ standard actually allows us to do that. Evil testcase: #define CONCAT(a,b) a ## b #define \U000100010\u00FD 1 #if !CONCAT(\, U000100010\u00FD) #error "This should never happen" #endif -Eli _______________________________________________ cfe-commits mailing list [email protected] http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
