On Jan 17, 2013, at 17:43 , Richard Smith <[email protected]> wrote:

> On Thu, Jan 17, 2013 at 11:31 AM, Jordan Rose <[email protected]> wrote:
>> How about this approach?
>> - LexUnicode mirrors LexTokenInternal, dispatching to the proper lex method 
>> based on the first Unicode character in a token.
>> - UCNs are validated in readUCN (called by LexTokenInternal and 
>> LexIdentifier). The specific identifier restrictions are checked in 
>> LexUnicode and LexIdentifier.
>> - UCNs are recomputed in Preprocessor::LookUpIdentifierInfo because we start 
>> with the spelling info there, but all the validation has already happened.
> 
> This seems reasonable to me. It looks like this will now correctly reject
> 
> #if 0
> \u0000
> #endif
> 
> ... which I believe the old approach accepted.
> 
>> With these known flaws:
>> - the classification of characters in LexUnicode should be more efficient.
>> - poor recovery for a non-identifier UCN in an identifier. Right now I just 
>> take that to mean "end of identifier", which is the most pedantically 
>> correct thing to do, but it's probably not what's intended.
>> - still needs more tests, of course
>> 
>> FWIW, though, I'm not sure unifying literal Unicode and UCNs is actually a 
>> great idea. The case where it matters most (validation of identifier 
>> characters) is pretty easy to separate out into a helper function (and 
>> indeed it already is). The other cases (accepting Unicode whitespace and 
>> fixits for accidental Unicode) only make sense for literal Unicode, not 
>> escaped Unicode.
> 
> For build systems which automatically convert UTF-8 to UCNs for gcc
> compatibility[0], it's not implausible that UCN-encoded smart quotes
> or whitespace might end up in Clang input.
> 
>  [0]: http://gcc.gnu.org/wiki/FAQ#utf8_identifiers
> 
>> Anyway, what do you think?
> 
> Generally, I like this. Some specific comments:
> 
> In readUCN, \u123 isn't an error, it's two separate tokens. Perhaps
> the diagnostics here should be warnings instead?

Since a lone backslash always produces an invalid token, it seems reasonable to 
leave these as errors, but yes, I now need to go back and not consume anything 
in that case.


> Likewise, in LexIdentifier and LexTokenInternal, you shouldn't call
> ConsumeChar until you know that you have actually got a UCN.
> 
> Likewise, in LexUnicode, err_non_ascii can be at most a warning; we're
> required to produce a preprocessing-token for each character which
> can't be the start of another token. The err_invalid_utf8 part in
> LexTokenInternal is fine though, since that's covered by the
> implementation-defined mapping from physical source file characters to
> the basic source character set.

*sigh* This seems like more justification to distinguish UCNs and actual 
Unicode to handle the "accidental" case, but I will make sure UCNs as written 
never fall into that block. (Actually, I'll just go with your way for now and 
handle that QoI issue later.)

_______________________________________________
cfe-commits mailing list
[email protected]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits

Reply via email to