Re: [cfe-commits] [PATCH] Support for universal character names in identifiers

Jordan Rose Fri, 18 Jan 2013 11:42:28 -0800

On Jan 18, 2013, at 11:36 , Richard Smith <[email protected]> wrote:


> On Fri, Jan 18, 2013 at 11:20 AM, Jordan Rose <[email protected]> wrote:
>> 
>> On Jan 17, 2013, at 17:43 , Richard Smith <[email protected]> wrote:
>> 
>>> On Thu, Jan 17, 2013 at 11:31 AM, Jordan Rose <[email protected]> wrote:
>>>> How about this approach?
>>>> - LexUnicode mirrors LexTokenInternal, dispatching to the proper lex 
>>>> method based on the first Unicode character in a token.
>>>> - UCNs are validated in readUCN (called by LexTokenInternal and 
>>>> LexIdentifier). The specific identifier restrictions are checked in 
>>>> LexUnicode and LexIdentifier.
>>>> - UCNs are recomputed in Preprocessor::LookUpIdentifierInfo because we 
>>>> start with the spelling info there, but all the validation has already 
>>>> happened.
>>> 
>>> This seems reasonable to me. It looks like this will now correctly reject
>>> 
>>> #if 0
>>> \u0000
>>> #endif
>>> 
>>> ... which I believe the old approach accepted.
>>> 
>>>> With these known flaws:
>>>> - the classification of characters in LexUnicode should be more efficient.
>>>> - poor recovery for a non-identifier UCN in an identifier. Right now I 
>>>> just take that to mean "end of identifier", which is the most pedantically 
>>>> correct thing to do, but it's probably not what's intended.
>>>> - still needs more tests, of course
>>>> 
>>>> FWIW, though, I'm not sure unifying literal Unicode and UCNs is actually a 
>>>> great idea. The case where it matters most (validation of identifier 
>>>> characters) is pretty easy to separate out into a helper function (and 
>>>> indeed it already is). The other cases (accepting Unicode whitespace and 
>>>> fixits for accidental Unicode) only make sense for literal Unicode, not 
>>>> escaped Unicode.
>>> 
>>> For build systems which automatically convert UTF-8 to UCNs for gcc
>>> compatibility[0], it's not implausible that UCN-encoded smart quotes
>>> or whitespace might end up in Clang input.
>>> 
>>> [0]: http://gcc.gnu.org/wiki/FAQ#utf8_identifiers
>>> 
>>>> Anyway, what do you think?
>>> 
>>> Generally, I like this. Some specific comments:
>>> 
>>> In readUCN, \u123 isn't an error, it's two separate tokens. Perhaps
>>> the diagnostics here should be warnings instead?
>> 
>> Since a lone backslash always produces an invalid token, it seems reasonable 
>> to leave these as errors, but yes, I now need to go back and not consume 
>> anything in that case.
> 
> The issue is that it's not an invalid token at this phase of
> translation. If the preprocessing-token is not converted to a token,
> it's not an error. For instance, it might be skipped in a #if 0, or it
> might be stringized, or it might be in a macro which is never
> expanded.

*sigh* Right. Warnings it is.


>>> Likewise, in LexIdentifier and LexTokenInternal, you shouldn't call
>>> ConsumeChar until you know that you have actually got a UCN.
>>> 
>>> Likewise, in LexUnicode, err_non_ascii can be at most a warning; we're
>>> required to produce a preprocessing-token for each character which
>>> can't be the start of another token. The err_invalid_utf8 part in
>>> LexTokenInternal is fine though, since that's covered by the
>>> implementation-defined mapping from physical source file characters to
>>> the basic source character set.
>> 
>> *sigh* This seems like more justification to distinguish UCNs and actual 
>> Unicode to handle the "accidental" case, but I will make sure UCNs as 
>> written never fall into that block. (Actually, I'll just go with your way 
>> for now and handle that QoI issue later.)
>> 
> 
> One thing I missed before: please don't use iscntrl or isxdigit;
> they're dependent on the current locale (and to be safe, don't use
> isascii either).

I understand for iscntrl, but isascii appears a lot and the expanded form is 
ugly. I guess I'll write my own wrapper.

Jordan
_______________________________________________
cfe-commits mailing list
[email protected]
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits

Re: [cfe-commits] [PATCH] Support for universal character names in identifiers

Reply via email to