I am not sure to be on-topic here.

Jungshik Shin wrote:
> Basically ISO C99
> seems to avoid problems arising from multiple representation issues by
> allowing only precomposed characters in identifiers

Correct. This is not a C99 (or C++98) decision, but comes from WG20
(the WG in charge of internationalisation), which issued this
"recommendation" in TR11076 (or is it TR10176). The "motivation"
is to avoid as possible the problem of normalisation.

However there is some differences between both standards. In C, we
try (hard) to allow conforming implementations to be "UTFx dumb",
i.e. to have some encoding for Unicode on input, to accept any
character (how bullshit it may be, e.g \u03A2), and to stay conforming.
While at the same time we promote better implementations, able to
distinguish between "bullshit" characters and "correct" ones, and
reject the first ones. But to do that, the compiler should have a
huge knowledge of the "gray area" between them, and all the
compatibility problems, such as decomposition, digits, etc.

The result is that the minimum set of programs (the strictly
conforming ones, which should enforce all the rules of the standards;
this is intended to be the maximally portable ones, by the way),
should restrict themselves to a set of characters which is intended
to avoid any problem (at least, unless they are almost unavailable,
such as using variable саѕе, i.e. \u0441\u0430\u0455\u0435 ...)


> If that's the case, characters like 'Latin Small Letter with Macron'
> or 'Hangul Syllable Gga' for which there are alternate representations
> should not be present in the list, but they are listed as allowed.

There is no problem with the restricted set, since the alternate
representations are not allowed in portable programs. And "good"
compilers, which extend the standard, are allowed to treat the
alternate as identical to the precomposed version (i.e., they are
allowed to use NKC).


>   What ISO C99 seems to do is to shift the burden of normalization to
> editors or whatever tool used by programmers to edit source files from
> compilers and linkers.

You are missing the purpose of a programming language standard.
It does not intent to "shift the burden". In fact, regarding the time
we spent (and are still speding implementers) on this, versus the
interest from the "customers", this is perhaps an overworked problem!
But the content of the standard defines at the same time something
(the minimum set, maximally portable), and framework to implement
the actual solutions, in a way that should allow the better
interoperation, and at the same time the easiest way to use it, and to
implement it too (think about the C compilers for embeeded systems,
which are required by law to support the ISO standard because of
governments requirements, while no-one care for i18n characters...
They really need a cheap solution.) Another solution are the
"validators", i.e. compilers that only accepts the strictly
conforming programs, logically to assure maximum portability. The
problem is that the rules are so strict, that no useful programs (for
example, one that uses "open()") can pass... GCC is defnitively
something else, it aims at the better support of the standard, so
they do _not_ want to implement the cheapest solutions, they really
want something useful, which will go quite further than the minimum
implementation, something that fulfills both the letter _and_ the
spirit of the standard(s). And they will get it, but it will take
some time.


> Both ISO/IEC JTC1/SC2/WG2 and UTC would not encode
> any more precomposed characters which can be represented with exisitng
> base characters followed by one or more combining characters.

Well... Certainly they are not willing that. But sometimes they got
it wrong. Look at U+17A4 (Khmer QAA). I am sure other examples will come.
For the years that are coming, perhaps 10 years, Unicode/10646 will be 
evolving standards, so moving targets. We have to deal with that.

> However,
> 'combining diacritical marks'(e.g. \u0300 - \u0362) are not allowed in
> identifiers  so that 'any character' that's not encoded as a precomposed
> form can't be used in identifiers.

Again: they *are* allowed (look after 6.4.3: only the \uxxxx for the
basic set is explicitely forbidden). But programs that include them are
not "strictly conforming", i.e. not maximally portable.


Antoine


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to