I am not sure to be on-topic here. Jungshik Shin wrote: > Basically ISO C99 > seems to avoid problems arising from multiple representation issues by > allowing only precomposed characters in identifiers
Correct. This is not a C99 (or C++98) decision, but comes from WG20 (the WG in charge of internationalisation), which issued this "recommendation" in TR11076 (or is it TR10176). The "motivation" is to avoid as possible the problem of normalisation. However there is some differences between both standards. In C, we try (hard) to allow conforming implementations to be "UTFx dumb", i.e. to have some encoding for Unicode on input, to accept any character (how bullshit it may be, e.g \u03A2), and to stay conforming. While at the same time we promote better implementations, able to distinguish between "bullshit" characters and "correct" ones, and reject the first ones. But to do that, the compiler should have a huge knowledge of the "gray area" between them, and all the compatibility problems, such as decomposition, digits, etc. The result is that the minimum set of programs (the strictly conforming ones, which should enforce all the rules of the standards; this is intended to be the maximally portable ones, by the way), should restrict themselves to a set of characters which is intended to avoid any problem (at least, unless they are almost unavailable, such as using variable саѕе, i.e. \u0441\u0430\u0455\u0435 ...) > If that's the case, characters like 'Latin Small Letter with Macron' > or 'Hangul Syllable Gga' for which there are alternate representations > should not be present in the list, but they are listed as allowed. There is no problem with the restricted set, since the alternate representations are not allowed in portable programs. And "good" compilers, which extend the standard, are allowed to treat the alternate as identical to the precomposed version (i.e., they are allowed to use NKC). > What ISO C99 seems to do is to shift the burden of normalization to > editors or whatever tool used by programmers to edit source files from > compilers and linkers. You are missing the purpose of a programming language standard. It does not intent to "shift the burden". In fact, regarding the time we spent (and are still speding implementers) on this, versus the interest from the "customers", this is perhaps an overworked problem! But the content of the standard defines at the same time something (the minimum set, maximally portable), and framework to implement the actual solutions, in a way that should allow the better interoperation, and at the same time the easiest way to use it, and to implement it too (think about the C compilers for embeeded systems, which are required by law to support the ISO standard because of governments requirements, while no-one care for i18n characters... They really need a cheap solution.) Another solution are the "validators", i.e. compilers that only accepts the strictly conforming programs, logically to assure maximum portability. The problem is that the rules are so strict, that no useful programs (for example, one that uses "open()") can pass... GCC is defnitively something else, it aims at the better support of the standard, so they do _not_ want to implement the cheapest solutions, they really want something useful, which will go quite further than the minimum implementation, something that fulfills both the letter _and_ the spirit of the standard(s). And they will get it, but it will take some time. > Both ISO/IEC JTC1/SC2/WG2 and UTC would not encode > any more precomposed characters which can be represented with exisitng > base characters followed by one or more combining characters. Well... Certainly they are not willing that. But sometimes they got it wrong. Look at U+17A4 (Khmer QAA). I am sure other examples will come. For the years that are coming, perhaps 10 years, Unicode/10646 will be evolving standards, so moving targets. We have to deal with that. > However, > 'combining diacritical marks'(e.g. \u0300 - \u0362) are not allowed in > identifiers so that 'any character' that's not encoded as a precomposed > form can't be used in identifiers. Again: they *are* allowed (look after 6.4.3: only the \uxxxx for the basic set is explicitely forbidden). But programs that include them are not "strictly conforming", i.e. not maximally portable. Antoine -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
