On Jan 17, 2013, at 17:43 , Richard Smith <[email protected]> wrote:
> On Thu, Jan 17, 2013 at 11:31 AM, Jordan Rose <[email protected]> wrote: >> How about this approach? >> - LexUnicode mirrors LexTokenInternal, dispatching to the proper lex method >> based on the first Unicode character in a token. >> - UCNs are validated in readUCN (called by LexTokenInternal and >> LexIdentifier). The specific identifier restrictions are checked in >> LexUnicode and LexIdentifier. >> - UCNs are recomputed in Preprocessor::LookUpIdentifierInfo because we start >> with the spelling info there, but all the validation has already happened. > > This seems reasonable to me. It looks like this will now correctly reject > > #if 0 > \u0000 > #endif > > ... which I believe the old approach accepted. > >> With these known flaws: >> - the classification of characters in LexUnicode should be more efficient. >> - poor recovery for a non-identifier UCN in an identifier. Right now I just >> take that to mean "end of identifier", which is the most pedantically >> correct thing to do, but it's probably not what's intended. >> - still needs more tests, of course >> >> FWIW, though, I'm not sure unifying literal Unicode and UCNs is actually a >> great idea. The case where it matters most (validation of identifier >> characters) is pretty easy to separate out into a helper function (and >> indeed it already is). The other cases (accepting Unicode whitespace and >> fixits for accidental Unicode) only make sense for literal Unicode, not >> escaped Unicode. > > For build systems which automatically convert UTF-8 to UCNs for gcc > compatibility[0], it's not implausible that UCN-encoded smart quotes > or whitespace might end up in Clang input. > > [0]: http://gcc.gnu.org/wiki/FAQ#utf8_identifiers > >> Anyway, what do you think? > > Generally, I like this. Some specific comments: > > In readUCN, \u123 isn't an error, it's two separate tokens. Perhaps > the diagnostics here should be warnings instead? Since a lone backslash always produces an invalid token, it seems reasonable to leave these as errors, but yes, I now need to go back and not consume anything in that case. > Likewise, in LexIdentifier and LexTokenInternal, you shouldn't call > ConsumeChar until you know that you have actually got a UCN. > > Likewise, in LexUnicode, err_non_ascii can be at most a warning; we're > required to produce a preprocessing-token for each character which > can't be the start of another token. The err_invalid_utf8 part in > LexTokenInternal is fine though, since that's covered by the > implementation-defined mapping from physical source file characters to > the basic source character set. *sigh* This seems like more justification to distinguish UCNs and actual Unicode to handle the "accidental" case, but I will make sure UCNs as written never fall into that block. (Actually, I'll just go with your way for now and handle that QoI issue later.) _______________________________________________ cfe-commits mailing list [email protected] http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
