On Thu, Aug 28, 2003 at 05:34:09AM +0100, Glynn Clements wrote: > John Meacham wrote: > > > > > > In our new implementation of Data.Char.isUpper and > > > > friends, I made the > > > > > > simplifying assumption that Char==wchar_t==Unicode. With > > > > glibc, this > > > > > > appears to be valid as long as (a) you set LANG to > > > > something other than > > > > > > "C" or "POSIX", and (b) you call setlocale() first. > > > > > The glibc Info file says: > > > > > The wide character character set always is UCS4, at least on > > > > > GNU systems. > > > > yes. with glibc, wchar_t is always unicode no matter what the locale. > > > > better yet, all ISO C implementations define a handy C symbol to test > > > > for this. if __STDC_ISO_10646__ is defined then wchar_t is always > > > > unicode no matter what. > > > > > > Sure, but as I've been saying, the implementation of glibc doesn't do > > > this. In the C or POSIX locale, the ctype macros only recognise ASCII. > > > > > Should this be considered a bug in glibc? > > > > hmm.. how odd. I would consider it a bug, I think. I don't have a copy > > of the ISO spec handy but will be sure to look up whether that is > > conforming... It is certainly a malfeature if it is not a bug... > > It certainly isn't a violation of ANSI/ISO C; that simply states that > "The behavior of these functions is affected by the LC_CTYPE category > of the current locale". It's perfectly legal for the implementation to > use different wide encodings depending upon the locale.
no, glibc #defines __STDC_ISO_10646__ so wchar_t's are guarenteed to hold UCS4 values always independent of locale. the LC_CTYPE only affects what multibyte encoding is used. What was curious was that the character classification routines changed behavior based on LC_CTYPE (despite the encoding still being UCS4) this might make sense for the classification routines dealing with upper and lower case actually, since I believe that that might depend on the language you are expressing. however, other character classification routines (such as wcwidth) should not depend on the current locale. it is unclear what the correct thing for an haskell implementation to do. possibilities are: 1) determine some locale independent semantics for the classification functions and implement that 2) guarentee the validity of character classification routines only when the character is representable in the current locale 3) link against another library such as libunicode which provides its own classification routines (this could be done optionally at compile time...) split the classification routines into locale dependent and independent ones, guarentee the locale independent ones will always work and one of the two above solutions for the rest... In any case, solution 2 seems to be what we have now, which is probably an okay interim solution as long as we add a isRepresentable to determine if a Char can be expressed in the current locale and whether we can trust the cclasification functions... I have an implementation of one in the CWString library I posted earlier... in any case, anything is better than the current 'ignore the locale' situation :) John -- --------------------------------------------------------------------------- John Meacham - California Institute of Technology, Alum. - [EMAIL PROTECTED] --------------------------------------------------------------------------- _______________________________________________ FFI mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/ffi