On Jan 14, 2013, at 17:33 , Richard Smith <[email protected]> wrote:

> On Mon, Jan 14, 2013 at 4:54 PM, Jordan Rose <[email protected]> wrote:
> 
> On Jan 14, 2013, at 13:19 , Richard Smith <[email protected]> wrote:
> 
>> As a general point, please keep in mind how we might support UTF-8 in source 
>> code when working on this. The C++ standard requires that our observable 
>> behavior is that we treat extended characters in the source code and UCNs 
>> identically (modulo raw string literals), so the more code we can share 
>> between the two, the better.
>> 
>> Please see the attached patch for a start on implementing UTF-8 support. One 
>> notable difference between this and the UCN patch is that the character 
>> validation happens in the lexer, not when we come to look up an 
>> IdentifierInfo; this is necessary in order to support error recovery for 
>> UTF-8 whitespace characters, and may be necessary to avoid accepts-invalids 
>> for UCNs which we never convert to identifiers.
> 
> I was trying to avoid using a sentinel char value; one reason is my 
> three-quarters-finished implementation of fixits for smart quotes. If we just 
> assume that UTF-8 characters are rare, we can handle them in 
> LexTokenInternal's 'default' case, and use a 'classifyUTF8()' helper rather 
> than smashing the character input stream with placeholders.
> 
> I started off with an approach which handled them in the 'default' case, but 
> the current approach ended up much cleaner (in particular, it doesn't require 
> anything special in any identifier lexing code to cope with UTF-8 characters 
> which aren't identifier characters). I don't follow the significance of the 
> smart quotes fixit, I'm afraid; my patch performs a similar fix-up for 
> Unicode whitespace characters, so I would expect that fixit to be 
> implementable either way.

I guess you'd just move the smart quotes identification to the "classify this 
Unicode" function, so okay. (OTOH, a UCN for smart quotes shouldn't really get 
a fixit for straight quotes, ever.)


> My biggest concern with the getCharAndSize approach for UTF-8 is that 
> isObviouslySimpleCharacter is an extraordinarily hot function, and the patch 
> adds a branch to it. I've not measured the impact of that, but it could kill 
> that approach if the cost is too high. Maybe there's some clever 
> (branch-free?) way of detecting '?', '\', and values > 127 which doesn't 
> trigger for any common ASCII characters, though.

A larger jump table would do this, of course, but a memory lookup is probably 
worse than a branch. Still, that is an advantage of pushing UTF-8 handling back 
to LexTokenInternal: the hot path doesn't change, and a correct C/C++ program 
will never hit the new code.


> The main difference between UCNs and literal UTF-8 is that (valid) literal 
> UTF-8 will always appear literally in the source. But I guess it doesn't 
> matter so much since the only place Unicode is valid is in identifiers and as 
> whitespace, and neither of those will use the output of getCharAndSize. I do, 
> however, want to delay the check for if a backslash starts a UCN to avoid 
> Eli's evil recursion problem:
> 
> char *\\
> \\\
> \\
> \\\
> \\
> \\\
> \u00FC;
> 
> If UCNs are processed in getCharAndSize, you end up with several recursive 
> calls asking if the first backslash starts a UCN.
> 
> It doesn't, of course, but if getCharAndSize calls isUCNAfterSlash you need 
> to getCharAndSize all the way to the character after the final backslash to 
> prove it.
> 
> Once you get to the third backslash and see it's not followed by whitespace, 
> you can stop, right? Is this just a performance concern for a pathological 
> case, or is there more to it than that?

If getCharAndSize is responsible for identifying UCNs, then you have to 
actually read the whole UCN to give a correct code for it. On the flip side, to 
properly read a UCN in C mode, you have to go through getCharAndSize instead of 
reading directly, because you want newline splicing to happen. So even asking 
if the next character is a UCN ends up requiring you to read all the 
backslashes, which is fairly stupid on Clang's part.

We could copy the newline-splicing logic into getCharAndSize, or we could just 
ignore this pathological case. Or we could go with my approach (handle UCNs in 
LexTokenInternal and later) which makes the problem go away because "backslash" 
doesn't imply "eagerly consume" unless spelling-followed by a newline.


> After all, this is a UCN, in C at least:
> 
> char *\
> \
> u00FC;
> 
> And once we're delaying the backslash, I'm not sure it makes sense to 
> classify the Unicode until we hit LexTokenInternal. Once we get there, 
> though, I can see it making sense to do it there rather than in identifier 
> creation, and have a (mostly) unified Unicode path after that.
> 
> Yes, if we need to delay handling the backslash, that makes sense to me, but 
> it's likely to be quite an invasive change.

That's what my patch did. I didn't think it was that invasive.

_______________________________________________
cfe-commits mailing list
[email protected]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits

Reply via email to