Re: gcc identifiers

Keld Jďż˝rn Simonsen Wed, 04 Dec 2002 06:59:40 -0800

On Wed, Dec 04, 2002 at 08:55:03AM -0500, Jungshik Shin wrote:
> 
> On Wed, 4 Dec 2002, Keld Jďż˝rn Simonsen wrote:
> 
> > On Tue, Dec 03, 2002 at 10:33:19PM -0800, H. Peter Anvin wrote:
> 
> > > Maybe a --normalize-utf option to the linker might be a good idea, but
> > > it should be an option, IMO.
> >
> > First of all, the standard does not refer to Unicode, but to 10646.
> > And the C standard does not use Unicode normalization.
> > There is a list in the ISO C standard of 10646 characters that are
> > allowed in identifiers, and these do not have alternate representations.
> 
>   Thank you for the note.
> 
>   I found FCD of ISO/IEC 9899 1999 (N2794 at
> http://wwwold.dkuug.dk/jtc1/sc22/open/n2794). It dates from Aug.,
> 1998.  In Annex I 'Universal Character names for identifiers'(page
> 487. If you use Acroread  to view PDF version, it's 499), a set of
> characters allowed are listed. (More or less identical list is found at
> http://std.dkuug.dk/JTC1/SC22/WG20/docs/standards#10176) Basically ISO C99
> seems to avoid problems arising from multiple representation issues by
> allowing only precomposed characters in identifiers(is there any change in
> this regard in the finally approved ISO/IEC 9899 1999?)


Yes, they do not allow combining charracters in the final approved ISO
C99.

> Keld's statement
> that they do not have alternate representations is not right.
> If that's the case, characters like 'Latin Small Letter with Macron'
> or 'Hangul Syllable Gga' for which there are alternate representations
> should not be present in the list, but they are listed as allowed.

The C99 standard only allows one representation of a character.
Actually each character is a unique character according to IS 10646,
so the base standard does not have alternative representations,
per definition.
> 
>   What ISO C99 seems to do is to shift the burden of normalization to
> editors or whatever tool used by programmers to edit source files from
> compilers and linkers.  That's fine(editors can do that) and is perhaps
> a wise decision (preventing potential troubles from propagating thru
> a compiler-linker chain at the earliest stage by issuing an error and
> stopping compilation), but there's a little trouble with allowing only
> precomposed characters. Both ISO/IEC JTC1/SC2/WG2 and UTC would not encode
> any more precomposed characters which can be represented with exisitng
> base characters followed by one or more combining characters.

Yes, that is an unfortunate policy, but I am not sure it holds.

> However,
> 'combining diacritical marks'(e.g. \u0300 - \u0362) are not allowed in
> identifiers  so that 'any character' that's not encoded as a precomposed
> form can't be used in identifiers. Some people would resent not being able
> to use 'their characters' in identifiers and may use it to make a case for
> encoding precomposed forms of theirs in ISO 10646.  How about references
> to filenames (as in '#include directive') with combining diacritic
> marks that are parts of characters NOT encoded in precomposed form?
> Aha, they can use '\unnnn, or \Unnnnnnnn)...

Filenames are not covered by C99 extended identifiers specifications,
AFAIK.

Kind regards
keld
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: gcc identifiers

Reply via email to