On Wed, Dec 04, 2002 at 08:55:03AM -0500, Jungshik Shin wrote: > > On Wed, 4 Dec 2002, Keld J�rn Simonsen wrote: > > > On Tue, Dec 03, 2002 at 10:33:19PM -0800, H. Peter Anvin wrote: > > > > Maybe a --normalize-utf option to the linker might be a good idea, but > > > it should be an option, IMO. > > > > First of all, the standard does not refer to Unicode, but to 10646. > > And the C standard does not use Unicode normalization. > > There is a list in the ISO C standard of 10646 characters that are > > allowed in identifiers, and these do not have alternate representations. > > Thank you for the note. > > I found FCD of ISO/IEC 9899 1999 (N2794 at > http://wwwold.dkuug.dk/jtc1/sc22/open/n2794). It dates from Aug., > 1998. In Annex I 'Universal Character names for identifiers'(page > 487. If you use Acroread to view PDF version, it's 499), a set of > characters allowed are listed. (More or less identical list is found at > http://std.dkuug.dk/JTC1/SC22/WG20/docs/standards#10176) Basically ISO C99 > seems to avoid problems arising from multiple representation issues by > allowing only precomposed characters in identifiers(is there any change in > this regard in the finally approved ISO/IEC 9899 1999?)
Yes, they do not allow combining charracters in the final approved ISO C99. > Keld's statement > that they do not have alternate representations is not right. > If that's the case, characters like 'Latin Small Letter with Macron' > or 'Hangul Syllable Gga' for which there are alternate representations > should not be present in the list, but they are listed as allowed. The C99 standard only allows one representation of a character. Actually each character is a unique character according to IS 10646, so the base standard does not have alternative representations, per definition. > > What ISO C99 seems to do is to shift the burden of normalization to > editors or whatever tool used by programmers to edit source files from > compilers and linkers. That's fine(editors can do that) and is perhaps > a wise decision (preventing potential troubles from propagating thru > a compiler-linker chain at the earliest stage by issuing an error and > stopping compilation), but there's a little trouble with allowing only > precomposed characters. Both ISO/IEC JTC1/SC2/WG2 and UTC would not encode > any more precomposed characters which can be represented with exisitng > base characters followed by one or more combining characters. Yes, that is an unfortunate policy, but I am not sure it holds. > However, > 'combining diacritical marks'(e.g. \u0300 - \u0362) are not allowed in > identifiers so that 'any character' that's not encoded as a precomposed > form can't be used in identifiers. Some people would resent not being able > to use 'their characters' in identifiers and may use it to make a case for > encoding precomposed forms of theirs in ISO 10646. How about references > to filenames (as in '#include directive') with combining diacritic > marks that are parts of characters NOT encoded in precomposed form? > Aha, they can use '\unnnn, or \Unnnnnnnn)... Filenames are not covered by C99 extended identifiers specifications, AFAIK. Kind regards keld -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
