On Jan 14, 2013, at 13:19 , Richard Smith <[email protected]> wrote:
> As a general point, please keep in mind how we might support UTF-8 in source > code when working on this. The C++ standard requires that our observable > behavior is that we treat extended characters in the source code and UCNs > identically (modulo raw string literals), so the more code we can share > between the two, the better. > > Please see the attached patch for a start on implementing UTF-8 support. One > notable difference between this and the UCN patch is that the character > validation happens in the lexer, not when we come to look up an > IdentifierInfo; this is necessary in order to support error recovery for > UTF-8 whitespace characters, and may be necessary to avoid accepts-invalids > for UCNs which we never convert to identifiers. I was trying to avoid using a sentinel char value; one reason is my three-quarters-finished implementation of fixits for smart quotes. If we just assume that UTF-8 characters are rare, we can handle them in LexTokenInternal's 'default' case, and use a 'classifyUTF8()' helper rather than smashing the character input stream with placeholders. The main difference between UCNs and literal UTF-8 is that (valid) literal UTF-8 will always appear literally in the source. But I guess it doesn't matter so much since the only place Unicode is valid is in identifiers and as whitespace, and neither of those will use the output of getCharAndSize. I do, however, want to delay the check for if a backslash starts a UCN to avoid Eli's evil recursion problem: char *\\ \\\ \\ \\\ \\ \\\ \u00FC; If UCNs are processed in getCharAndSize, you end up with several recursive calls asking if the first backslash starts a UCN. It doesn't, of course, but if getCharAndSize calls isUCNAfterSlash you need to getCharAndSize all the way to the character after the final backslash to prove it. After all, this is a UCN, in C at least: char *\ \ u00FC; And once we're delaying the backslash, I'm not sure it makes sense to classify the Unicode until we hit LexTokenInternal. Once we get there, though, I can see it making sense to do it there rather than in identifier creation, and have a (mostly) unified Unicode path after that. Jordan
_______________________________________________ cfe-commits mailing list [email protected] http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
