John Meacham wrote: > > > > Sure, but as I've been saying, the implementation of glibc doesn't do > > > > this. In the C or POSIX locale, the ctype macros only recognise ASCII. > > > > > > > Should this be considered a bug in glibc? > > > > > > hmm.. how odd. I would consider it a bug, I think. I don't have a copy > > > of the ISO spec handy but will be sure to look up whether that is > > > conforming... It is certainly a malfeature if it is not a bug... > > > > It certainly isn't a violation of ANSI/ISO C; that simply states that > > "The behavior of these functions is affected by the LC_CTYPE category > > of the current locale". It's perfectly legal for the implementation to > > use different wide encodings depending upon the locale. > > no, glibc #defines __STDC_ISO_10646__ so wchar_t's are guarenteed to > hold UCS4 values always independent of locale.
OK; although the draft which I have only says: __STDC_ISO_10646__ A decimal constant of the form yyyymmL | (for example, 199712L), intended to | indicate that values of type wchar_t are | the coded representations of the | characters defined by ISO/IEC 10646, | along with all amendments and technical | corrigenda as of the specified year and | month. That's the only reference to that macro in the entire document. It doesn't explicitly contradict (or even reference) the comments about the semantics of the <wctype.h> functions. > the LC_CTYPE only affects > what multibyte encoding is used. What was curious was that the character > classification routines changed behavior based on LC_CTYPE (despite the > encoding still being UCS4) > > this might make sense for the classification routines dealing with upper > and lower case actually, since I believe that that might depend on the > language you are expressing. however, other character classification > routines (such as wcwidth) should not depend on the current locale. There are some variations between wcwidth() implementations; e.g. the XFree86 version of xterm includes two implementations, and the comment: * The following functions are the same as mk_wcwidth() and * mk_wcwidth_cjk(), except that spacing characters in the East Asian * Ambiguous (A) category as defined in Unicode Technical Report #11 * have a column width of 2. This variant might be useful for users of * CJK legacy encodings who want to migrate to UCS without changing * the traditional terminal character-width behaviour. It is not * otherwise recommended for general use. I suppose that it's possible that some systems might wish to make the behaviour locale-dependent. However, this is all a long way from the glibc behaviour, i.e. that for the C/POSIX locale, and for locales without an LC_CTYPE data file, everything outside of the ASCII range is undefined (not a member of any category, not translated by towupper() etc). > it is unclear what the correct thing for an haskell implementation to > do. possibilities are: > 1) determine some locale independent semantics for the classification > functions and implement that > 2) guarentee the validity of character classification routines only when > the character is representable in the current locale > 3) link against another library such as libunicode which provides its > own classification routines (this could be done optionally at compile > time...) > > split the classification routines into locale dependent and independent > ones, guarentee the locale independent ones will always work and one of > the two above solutions for the rest... > > In any case, solution 2 seems to be what we have now, which is probably > an okay interim solution as > long as we add a isRepresentable to determine if a Char can be expressed > in the current locale and whether we can trust the cclasification > functions... I have an implementation of one in the CWString library I > posted earlier... > > in any case, anything is better than the current 'ignore the locale' > situation :) Not necessarily. E.g. there are reasons why most programs don't just call setlocale(LC_ALL, "") to make everything behave according to the locale settings. I18N complicates many things sufficiently that I would favour forcing the programmer to explicitly ask for it (i.e. don't change the semantics of any existing functions, but provide new functions for use in internationalised code). -- Glynn Clements <[EMAIL PROTECTED]> _______________________________________________ FFI mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/ffi