On Jan 18, 2013, at 11:36 , Richard Smith <[email protected]> wrote:
> On Fri, Jan 18, 2013 at 11:20 AM, Jordan Rose <[email protected]> wrote: >> >> On Jan 17, 2013, at 17:43 , Richard Smith <[email protected]> wrote: >> >>> On Thu, Jan 17, 2013 at 11:31 AM, Jordan Rose <[email protected]> wrote: >>>> How about this approach? >>>> - LexUnicode mirrors LexTokenInternal, dispatching to the proper lex >>>> method based on the first Unicode character in a token. >>>> - UCNs are validated in readUCN (called by LexTokenInternal and >>>> LexIdentifier). The specific identifier restrictions are checked in >>>> LexUnicode and LexIdentifier. >>>> - UCNs are recomputed in Preprocessor::LookUpIdentifierInfo because we >>>> start with the spelling info there, but all the validation has already >>>> happened. >>> >>> This seems reasonable to me. It looks like this will now correctly reject >>> >>> #if 0 >>> \u0000 >>> #endif >>> >>> ... which I believe the old approach accepted. >>> >>>> With these known flaws: >>>> - the classification of characters in LexUnicode should be more efficient. >>>> - poor recovery for a non-identifier UCN in an identifier. Right now I >>>> just take that to mean "end of identifier", which is the most pedantically >>>> correct thing to do, but it's probably not what's intended. >>>> - still needs more tests, of course >>>> >>>> FWIW, though, I'm not sure unifying literal Unicode and UCNs is actually a >>>> great idea. The case where it matters most (validation of identifier >>>> characters) is pretty easy to separate out into a helper function (and >>>> indeed it already is). The other cases (accepting Unicode whitespace and >>>> fixits for accidental Unicode) only make sense for literal Unicode, not >>>> escaped Unicode. >>> >>> For build systems which automatically convert UTF-8 to UCNs for gcc >>> compatibility[0], it's not implausible that UCN-encoded smart quotes >>> or whitespace might end up in Clang input. >>> >>> [0]: http://gcc.gnu.org/wiki/FAQ#utf8_identifiers >>> >>>> Anyway, what do you think? >>> >>> Generally, I like this. Some specific comments: >>> >>> In readUCN, \u123 isn't an error, it's two separate tokens. Perhaps >>> the diagnostics here should be warnings instead? >> >> Since a lone backslash always produces an invalid token, it seems reasonable >> to leave these as errors, but yes, I now need to go back and not consume >> anything in that case. > > The issue is that it's not an invalid token at this phase of > translation. If the preprocessing-token is not converted to a token, > it's not an error. For instance, it might be skipped in a #if 0, or it > might be stringized, or it might be in a macro which is never > expanded. *sigh* Right. Warnings it is. >>> Likewise, in LexIdentifier and LexTokenInternal, you shouldn't call >>> ConsumeChar until you know that you have actually got a UCN. >>> >>> Likewise, in LexUnicode, err_non_ascii can be at most a warning; we're >>> required to produce a preprocessing-token for each character which >>> can't be the start of another token. The err_invalid_utf8 part in >>> LexTokenInternal is fine though, since that's covered by the >>> implementation-defined mapping from physical source file characters to >>> the basic source character set. >> >> *sigh* This seems like more justification to distinguish UCNs and actual >> Unicode to handle the "accidental" case, but I will make sure UCNs as >> written never fall into that block. (Actually, I'll just go with your way >> for now and handle that QoI issue later.) >> > > One thing I missed before: please don't use iscntrl or isxdigit; > they're dependent on the current locale (and to be safe, don't use > isascii either). I understand for iscntrl, but isascii appears a lot and the expanded form is ugly. I guess I'll write my own wrapper. Jordan _______________________________________________ cfe-commits mailing list [email protected] http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
