On Thu, 16 Sep 2004, Udo Stenzel wrote: > Having a seperate byte based api is far better. If you don't know the > encoding, all you have is bytes, no text.
Okay, after reading large chunks of this discussion, I'm going to rock the boat a bit by suggesting that bytes *are* text, and *do* belong in the Char type, and hence that the current binary file API is actually correct, after a fashion. In fact, I think that we can resolve many of the problems of this thread by abandoning the conceptual distinction between characters and bytes. Suppose I invoke gcc -o XXX YYY.c where XXX and YYY are strings of Japanese characters. It has been pointed out that if GCC treats its filename arguments as opaque byte strings to be passed back to the appropriate file opening functions, then it will work even if the current locale isn't Japanese. But that's only true on Posix- like systems. On NT, filenames are made of Unicode code points, and argv is encoded according to the current locale. If GCC uses argv, it will fail on the example above. I've run into this problem many times on my desktop XP box, which uses a US-English locale but contains some filenames with Japanese characters in them. But in any case GCC's arguments aren't really opaque: it needs to check each argument to see if it's an option, and it needs to look at the extensions of files like YYY.c to figure out which subprogram to invoke. Nevertheless, the opaque-filename approach does work on Posix, because -- this is the important bit -- the characters GCC cares about (like '-', 'o', '.', and 'c') have the same representation in every encoding. In other words, the character encoding is neither transparent nor opaque to GCC, but sort of "band-limited": it can understand the values from 0 to 127, but the higher values are mysterious to it. They could be Latin-1 code points; they could be EUC half-characters; they could be Unicode code points. It doesn't know, and it doesn't *need* to know. It will fail if given an encoding which doesn't follow this rule (e.g. EBCDIC). We can make GCC (were it implemented in Haskell) work with all filenames on both major platforms without platform-specific code by representing command-line arguments and pathnames as Strings = [Char]s, where Char is defined as the byte values 0-255 on Posix, but the UTF-16 values on Win32. Clearly this is very fragile, but the type system provides a solution: newtype {- TransASCIIEncoding a => -} Char a = Chr Word32 type String a = [Char a] class TransASCIIEncoding a where maxValueUsedByEncoding :: Word32 instance TransASCIIEncoding Unicode where ... instance TransASCIIEncoding UTF16 where ... instance TransASCIIEncoding UTF8 where ... instance TransASCIIEncoding GenericByte where ... 'x' :: Char a '\u1234' :: Char Unicode '\q789' :: Char WeirdCompilerSupportedEncoding instance (TransASCIIEncoding a) => Bounded (Char a) where minBound = Chr 0 maxBound = Chr maxValueUsedByEncoding class CharTranscoding a b where transcode :: CharacterString a ord :: Character a -> Maybe Int -- Nothing if arg isn't ASCII ordUnicode :: Character Unicode -> Int Obvious problems: backward compatibility and codings like ISO 2022 and Shift-JIS which break the fundamental assumption. I don't think either problem is fatal. A more flexible subtyping mechanism would be nice, so that (e.g.) byte-writing functions could take any Char type with a sufficiently small maxValue. -- Ben _______________________________________________ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe