On Thu, Nov 15, 2012 at 7:17 PM, Eli Friedman <eli.fried...@gmail.com> wrote: > Patch attached. Adds support universal character names in identifiers, e.g.: > > char * \u00FC = "u-umlaut"; > > Not that it's particularly useful, but it's a longstanding hole in our > C99 support. > > The general outline of the approach is that the spelling of the > identifier token contains the UCN, but the IdentifierInfo for the > identifier token contains pure UTF-8. I think this is reasonable > given the C phases of translation, and consistent with the way we > handle UCNs in other contexts.
This seems like a good approach to me. > I'm intentionally leaving out most of the support for universal > character names in user-defined literals, to try and reduce the size > of the patch. Index: include/clang/Lex/Lexer.h =================================================================== --- include/clang/Lex/Lexer.h (revision 168014) +++ include/clang/Lex/Lexer.h (working copy) @@ -573,6 +573,10 @@ void cutOffLexing() { BufferPtr = BufferEnd; } bool isHexaLiteral(const char *Start, const LangOptions &LangOpts); + + bool isUCNAfterSlash(const char *CurPtr, unsigned Size, unsigned SizeTmp[5]); + void ConsumeUCNAfterSlash(const char *&CurPtr, unsigned SizeTmp[5], + Token &Result); These [5]s should be [9]s. Also, how about wrapping the unsigned[9] in a struct so it doesn't have to be repeated in so many places, or at least passing it by reference so we'll get a compile error if the caller's array is the wrong size? Index: include/clang/Lex/Token.h =================================================================== --- include/clang/Lex/Token.h (revision 168014) +++ include/clang/Lex/Token.h (working copy) @@ -74,9 +74,10 @@ StartOfLine = 0x01, // At start of line or only after whitespace. LeadingSpace = 0x02, // Whitespace exists before this token. DisableExpand = 0x04, // This identifier may never be macro expanded. - NeedsCleaning = 0x08, // Contained an escaped newline or trigraph. + NeedsCleaning = 0x08, // Contained an escaped newline or trigraph. LeadingEmptyMacro = 0x10, // Empty macro exists before this token. - HasUDSuffix = 0x20 // This string or character literal has a ud-suffix. + HasUDSuffix = 0x20, // This string or character literal has a ud-suffix. + HasUCN = 0x40 // This identifier contains a UCN Missing full stop. ;-) The set of permitted characters appears to be correct only for C11 and C++11: it seems that C99 (+TR1,2,3) and C++98 (+TC1) permitted smaller sets (and not even the same smaller set!). C++98 used the list from ISO/IEC PDTR 10176 and C99 used ISO/IEC TR 10176:1998 (surprisingly, C++03 didn't move from the PDTR to the 1998 TR). If you're doing this to have a complete C99 (and C++98, modulo 'export') implementation, then maybe you care about this... :) + if (UCNIdentifierBuffer.empty() ? !isAllowedInitiallyIDChar(UcnVal) : + !isAllowedIDChar(UcnVal)) { + StringRef CurCharacter = CleanedStr.substr(i, NumChars); + Diag(Identifier, diag::err_ucn_invalid_in_id) << CurCharacter; It'd be nice for the diagnostic to be different for UCNs which can't appear at all versus UCNs which can't appear at the start of an identifier. > I know this patch is a little lacking in terms of tests, but I'm not > really sure what tests we need; suggestions welcome. UCNs which resolve to characters in the basic source character set. Identifier emission in diagnostics. Stringization of tokens containing UCNs. (If I'm reading this right, we have a pre-existing bug here, in that characters outside the basic source character set must be converted into UCNs in the resulting string literal.) ud-suffixes for integer and floating-point. Do you want to ExtWarn on this in C89? _______________________________________________ cfe-commits mailing list cfe-commits@cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits