On Jan 14, 2013, at 13:19 , Richard Smith <[email protected]> wrote:

> As a general point, please keep in mind how we might support UTF-8 in source 
> code when working on this. The C++ standard requires that our observable 
> behavior is that we treat extended characters in the source code and UCNs 
> identically (modulo raw string literals), so the more code we can share 
> between the two, the better.
> 
> Please see the attached patch for a start on implementing UTF-8 support. One 
> notable difference between this and the UCN patch is that the character 
> validation happens in the lexer, not when we come to look up an 
> IdentifierInfo; this is necessary in order to support error recovery for 
> UTF-8 whitespace characters, and may be necessary to avoid accepts-invalids 
> for UCNs which we never convert to identifiers.

I was trying to avoid using a sentinel char value; one reason is my 
three-quarters-finished implementation of fixits for smart quotes. If we just 
assume that UTF-8 characters are rare, we can handle them in LexTokenInternal's 
'default' case, and use a 'classifyUTF8()' helper rather than smashing the 
character input stream with placeholders.

The main difference between UCNs and literal UTF-8 is that (valid) literal 
UTF-8 will always appear literally in the source. But I guess it doesn't matter 
so much since the only place Unicode is valid is in identifiers and as 
whitespace, and neither of those will use the output of getCharAndSize. I do, 
however, want to delay the check for if a backslash starts a UCN to avoid Eli's 
evil recursion problem:

char *\\
\\\
\\
\\\
\\
\\\
\u00FC;

If UCNs are processed in getCharAndSize, you end up with several recursive 
calls asking if the first backslash starts a UCN. It doesn't, of course, but if 
getCharAndSize calls isUCNAfterSlash you need to getCharAndSize all the way to 
the character after the final backslash to prove it. After all, this is a UCN, 
in C at least:

char *\
\
u00FC;

And once we're delaying the backslash, I'm not sure it makes sense to classify 
the Unicode until we hit LexTokenInternal. Once we get there, though, I can see 
it making sense to do it there rather than in identifier creation, and have a 
(mostly) unified Unicode path after that.

Jordan
_______________________________________________
cfe-commits mailing list
[email protected]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits

Reply via email to