Re: [Haskell-cafe] Writing binary files?
On Mon, Sep 13, 2004 at 12:01:58PM +0100, Glynn Clements wrote: My view is that, right now, we have the worst of both worlds, and taking a short step backwards (i.e. narrow the Char type and leave the rest alone) is a lot simpler (and more feasible) than the long journey towards real I18N. This being Haskell, I can't imagine a consensus on a step backwards. In any case, a Char type distinct from bytes and the rest is the most valuable part of the current situation. The rest is just libraries, and the solution to that is to create other libraries. (It's true that the Prelude is harder to work around, but even that can be done, as with the new exception interface.) Indeed more than one approach can proceed concurrently, and that's probably what's going to happen: The Right Thing proceeds in stages: 1. new byte-based libraries 2. conversions sitting on top of these 3. the ultimate I18N API The Quick Fix: alter the existing implementation to use the encoding determined by the current locale at the borders. When the Right Thing is finished, the Quick Fix can be recast as a special case. The Right Thing might take a very long (possibly infinite) time, because this is the sort of thing people can argue about endlessly. Still, the first stage would deal with most of the scenarios you raised. It just needs a group of people who care about it to get together and do it. The Quick Fix is the most logical implementation of the current definition of Haskell, and entirely consistent with its general philosophy of presenting the programmer with an idealized (some might say oversimplified) model of computation. From the start, Haskell has supported only character-based I/O, with whatever translations were required to present a uniform view on all platforms. And that's not an entirely bad thing. It won't work all the time, but it will be simple, and good enough for most people. Its existence will not rule out binary I/O or more sophisticated alternatives. Those who need more may be motivated to help finish the Right Thing. ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
I've not been following this debate, but I think I agree with Ross. In particular, the idea of narrowing the Char type really seems like a bad idea to me (if I understand the intent correctly). Not so long ago, I did a whole load of work on the HaXml parser so that, among other things, it would support UTF-8 and UTF-16 Unicode (as required by the XML spec). To do this depends upon having a Char type that can represent the full repertoire of Unicode characters. Other languages have been forced into this (maybe painful) transition; I don't think Haskell can reasonably go backwards if it is to have any hope of surviving. #g -- At 12:31 15/09/04 +0100, [EMAIL PROTECTED] wrote: On Mon, Sep 13, 2004 at 12:01:58PM +0100, Glynn Clements wrote: My view is that, right now, we have the worst of both worlds, and taking a short step backwards (i.e. narrow the Char type and leave the rest alone) is a lot simpler (and more feasible) than the long journey towards real I18N. This being Haskell, I can't imagine a consensus on a step backwards. In any case, a Char type distinct from bytes and the rest is the most valuable part of the current situation. The rest is just libraries, and the solution to that is to create other libraries. (It's true that the Prelude is harder to work around, but even that can be done, as with the new exception interface.) Indeed more than one approach can proceed concurrently, and that's probably what's going to happen: The Right Thing proceeds in stages: 1. new byte-based libraries 2. conversions sitting on top of these 3. the ultimate I18N API The Quick Fix: alter the existing implementation to use the encoding determined by the current locale at the borders. When the Right Thing is finished, the Quick Fix can be recast as a special case. The Right Thing might take a very long (possibly infinite) time, because this is the sort of thing people can argue about endlessly. Still, the first stage would deal with most of the scenarios you raised. It just needs a group of people who care about it to get together and do it. The Quick Fix is the most logical implementation of the current definition of Haskell, and entirely consistent with its general philosophy of presenting the programmer with an idealized (some might say oversimplified) model of computation. From the start, Haskell has supported only character-based I/O, with whatever translations were required to present a uniform view on all platforms. And that's not an entirely bad thing. It won't work all the time, but it will be simple, and good enough for most people. Its existence will not rule out binary I/O or more sophisticated alternatives. Those who need more may be motivated to help finish the Right Thing. ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe Graham Klyne For email: http://www.ninebynine.org/#Contact ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
Glynn Clements [EMAIL PROTECTED] writes: Unless you are the sole user of a system, you have no control over what filenames may occur on it (and even if you are the sole user, you may wish to use packages which don't conform to your rules). For these occasions you may set the encoding to ISO-8859-1. But then you can't sensibly show them to the user in a GUI, nor in ncurses using the wide character API, nor you can't sensibly store them in a file which is to be always encoded in UTF-8 (e.g. XML file where you can't put raw bytes without knowing their encoding). There are two paradigms: manipulate bytes not knowing their encoding, and manipulating characters explicitly encoded in various encodings (possibly UTF-8). The world is slowly migrating from the first to the second. There are limits to the extent to which this can be achieved. E.g. what happens if you set the encoding to UTF-8, then call getDirectoryContents for a directory which contains filenames which aren't valid UTF-8 strings? The library fails. Don't do that. This environment is internally inconsistent. Call it what you like, it's a reality, and one which programs need to deal with. The reality is that filenames are encoded in different encodings depending on the system. Sometimes it's ISO-8859-1, sometimes ISO-8859-2, sometimes UTF-8. We should not ignore the possibility of UTF-8-encoded filenames. In CLisp it fails silently (undecodable filenames are skipped), which is bad. It should fail loudly. Most programs don't care whether any filenames which they deal with are valid in the locale's encoding (or any other encoding). They just receive lists (i.e. NUL-terminated arrays) of bytes and pass them directly to the OS or to libraries. And this is why I can't switch my home environment to UTF-8 yet. Too many programs are broken; almost all terminal programs which use more than stdin and stdout in default modes, i.e. which use line editing or work in full screen. How would you display a filename in a full screen text editor, such that it works in a UTF-8 environment? If the assumed encoding is ISO-8859-*, this program will work regardless of the filenames which it is passed or the contents of the file (modulo the EOL translation on Windows). OTOH, if it were to use UTF-8 (e.g. because that was the locale's encoding), it wouldn't work correctly if either filename or the file's contents weren't valid UTF-8. A program is not supposed to encounter filenames which are not representable in the locale's encoding. In your setting it's impossible to display a filename in a way other than printing to stdout. More accurately, it specifies which encoding to assume when you *need* to know the encoding (i.e. ctype.h etc), but you can't obtain that information from a more reliable source. In the case of filenames there is no more reliable source. My central point is that the existing API forces the encoding to be an issue when it shouldn't be. It is an unavoidable issue because not every interface in a given computer system uses the same encoding. Gtk+ uses UTF-8; you must convert text to UTF-8 in order to display it, and in order to convert you must know its encoding. Well, to an extent it is an implementation issue. Historically, curses never cared about encodings. A character is a byte, you draw bytes on the screen, curses sends them directly to the terminal. This is the old API. But newer ncurses API is prepared even for combining accents. A character is coded with a sequence of wchar_t values, such that all except the first one are combining characters. Furthermore, the curses model relies upon monospaced fonts, and falls down once you encounter CJK text (where a monospaced font means one whose glyphs are an integer multiple of the cell size, not necessarily a single cell). It doesn't fall. Characters may span several columns. There is wcwidth(), and curses specification in X/Open says how it should behave for wide CJK characters. I haven't tested it but I believe ncurses supports them. Extending something like curses to handle encoding issues is far from trivial; which is probably why it hasn't been finished yet. It's almost finished. The API specification was ready in 1997. It works in ncurses modulo unfixed bugs. But programs can't use it unless they use Unicode internally. Although, if you're going to have implicit String - [Word8] converters, there's no reason why you can't do the reverse, and have isAlpha :: Word8 - IO Bool. Although, like ctype.h, this will only work for single-byte encodings. We should not ignore multibyte encodings like UTF-8, which means that Haskell should have a Unicoded character type. And it's already specified in Haskell 98 that Char is such a type! What is missing is API for manipulating binary files, and conversion between byte streams and character streams using particular text encodings. A mail client is expected to respect the encoding set in
Re: [Haskell-cafe] Writing binary files?
Marcin 'Qrczak' Kowalczyk wrote: Unless you are the sole user of a system, you have no control over what filenames may occur on it (and even if you are the sole user, you may wish to use packages which don't conform to your rules). For these occasions you may set the encoding to ISO-8859-1. But then you can't sensibly show them to the user in a GUI, nor in ncurses using the wide character API, nor you can't sensibly store them in a file which is to be always encoded in UTF-8 (e.g. XML file where you can't put raw bytes without knowing their encoding). If you need to preserve the data exactly, you can use octal escapes (\337), URL encoding (%DF) or similar. If you don't, you can just approximate it (e.g. display unrepresentable characters as ?). But this is an inevitable consequence of filenames being bytes rather than chars. [Actually, regarding on-screen display, this is also an issue for Unicode. How many people actually have all of the Unicode glyphs? I certainly don't.] There are two paradigms: manipulate bytes not knowing their encoding, and manipulating characters explicitly encoded in various encodings (possibly UTF-8). The world is slowly migrating from the first to the second. This migration isn't a process which will ever be complete. There will always be plenty of cases where bytes really are just bytes. And even to the extent that it can be done, it will take a long time. Outside of the Free Software ghetto, long-term backward compatibility still means a lot. [E.g.: EBCDIC has been in existence longer than I have and, in spite of the fact that it's about the only widely-used encoding in existence which doesn't have ASCII as a subset, it shows no sign of dying out any time soon.] There are limits to the extent to which this can be achieved. E.g. what happens if you set the encoding to UTF-8, then call getDirectoryContents for a directory which contains filenames which aren't valid UTF-8 strings? The library fails. Don't do that. This environment is internally inconsistent. Call it what you like, it's a reality, and one which programs need to deal with. The reality is that filenames are encoded in different encodings depending on the system. Sometimes it's ISO-8859-1, sometimes ISO-8859-2, sometimes UTF-8. We should not ignore the possibility of UTF-8-encoded filenames. I'm not suggesting we do. In CLisp it fails silently (undecodable filenames are skipped), which is bad. It should fail loudly. No, it shouldn't fail at all. Most programs don't care whether any filenames which they deal with are valid in the locale's encoding (or any other encoding). They just receive lists (i.e. NUL-terminated arrays) of bytes and pass them directly to the OS or to libraries. And this is why I can't switch my home environment to UTF-8 yet. Too many programs are broken; almost all terminal programs which use more than stdin and stdout in default modes, i.e. which use line editing or work in full screen. How would you display a filename in a full screen text editor, such that it works in a UTF-8 environment? So, what are you suggesting? That the whole world switches to UTF-8? Or that every program should pass everything through iconv() (and handle the failures)? Or what? If the assumed encoding is ISO-8859-*, this program will work regardless of the filenames which it is passed or the contents of the file (modulo the EOL translation on Windows). OTOH, if it were to use UTF-8 (e.g. because that was the locale's encoding), it wouldn't work correctly if either filename or the file's contents weren't valid UTF-8. A program is not supposed to encounter filenames which are not representable in the locale's encoding. Huh? What does supposed to mean in this context? That everything would be simpler if reality wasn't how it is? If that's your position, then my response is essentially: Yes, but so what? In your setting it's impossible to display a filename in a way other than printing to stdout. Writing to stdout doesn't amount to displaying anything; stdout doesn't have to be a terminal. More accurately, it specifies which encoding to assume when you *need* to know the encoding (i.e. ctype.h etc), but you can't obtain that information from a more reliable source. In the case of filenames there is no more reliable source. Sure; but that doesn't automatically mean that the locale's encoding is correct for any given filename. The point is that you often don't need to know the encoding. Converting a byte string to a character string when you're just going to be converting it back to the original byte string is pointless. And it introduces unnecessary errors. If the only difference between (decode . encode) and the identity function is that the former sometimes fails, what's the point? My central point is that the existing API forces the encoding to be an issue when it shouldn't be. It is an unavoidable
Re: [Haskell-cafe] Writing binary files?
Udo Stenzel wrote: Note that this needs to include all of the core I/O functions, not just reading/writing streams. E.g. FilePath is currently an alias for String, but (on Unix, at least) filenames are strings of bytes, not characters. Ditto for argv, environment variables, possibly other cases which I've overlooked. I don't think so. They all are sequences of CChars, and C isn't particularly known for keeping bytes and chars apart. CChar is a C char, which is a byte (not necessarily an octet, and not necessarily a character either). I believe, Windows NT has (alternate) filename handling functions that use unicode stringsr. Almost all of the Win32 API functions which handle strings exist in both char and wide-char versions. This would strengthen the view that a filename is a sequence of characters. It would be reasonable to make FilePath equivalent to String on Windows, but not on Unix. Ditto for argv, env, whatnot; they are typically entered from the shell and therefore are characters in the local encoding. Both argv and envp are char**, i.e. lists of byte strings. There is no guarantee that the values can be succesfully decoded according the locale's encoding. The environment is typically set on login, and inherited thereafter. It's typically limited to ASCII, but this isn't guaranteed. Similarly, a program may need to access files which he didn't create, and which have filenames which aren't valid strings according to his locale. E.g. a user may choose a locale which uses UTF-8, but the sysadmin has installed files with ISO-8859-1 filenames. If a Haskell program tries to coerce everything to String using the user's locale, the program will be unable to access such files. 3. The default encoding is settable from Haskell, defaults to ISO-8859-1. Agreed. Oh no, please don't do that. A global, settable encoding is, well, dys-functional. Hidden state makes programs hard to understand and Haskell imho shouldn't go that route. There's already plenty of hidden state in the system libraries upon which a Haskell program depends. And please don't introduce the notion of a default encoding. It isn't an issue of *introducing* it. Many Haskell98 functions (i.e. much of IO, System and Directory) accept or return Strings, yet have to be implemented on top of an OS which accepts or provides char*s. There *has* to be an encoding between the two, and currently it's hardwired to ISO-8859-1. The alternative to a global encoding is for *all* functions which interface to the OS to always either accept or return [CChar] or, if they accept or return Strings, accept an additional argument which specifies the encoding. Also, bear in mind that the functions under discussion are all I/O functions which, by their nature, deal with state (e.g. the state of the filesystem). I'd like to see the following: - Duplicate the IO library. The duplicate should work with [Byte] everywhere where the old library uses String. Byte is some suitable unsigned integer, on most (all?) platforms this will be Word8 Technically it should be CChar. However, it's fairly safe to assume that a byte will always be 8 bits; almost nobody writes code which works on systems where it isn't. However: if we go this route, I suspect that we will also need a convenient method for specifying literal byte strings in Haskell source code. - Provide an explicit conversion between encodings. A simple conversion of type [Word8] - String would suit me, iconv would provide all that is needed. For the general case, you need to allow for stateful encodings (e.g. ISO-2022). Actually, even UTF-8 needs to deal with state if you need to decode byte streams which are split into chunks and the breaks can occur in the middle of a character (e.g. if you're using non-blocking I/O). - iconv takes names of encodings as arguments. Provide some names as constants: one name for the internal encoding (probably UCS4), one name for the canonical external encoding (probably locale dependent). - Then redefine the old IO API in terms of the new API and appropriate conversions. The old API requires an implicit encoding. The OS gives accepts or provides bytes, the old API functions accept or return Chars, and the old API functions don't accept an encoding argument. This is why we are (or, at least, I am) suggesting a settable current encoding. Because the existing API *needs* a current encoding, and I'm assuming that there may be some reluctance to just discarding it completely. While we're at it, do away with the annoying CR/LF problem on Windows, this should simply be part of the local encoding. This way file can always be opened as binary, hSetBinary can be dropped. (This won't wont on ancient platforms where text files and binary files are genuinely different, but these are probably not interesting anyway.) Apart from OS-specific issues, it would be useful to treat EOL conventions
Re: [Haskell-cafe] Writing binary files?
Glynn Clements [EMAIL PROTECTED] writes: [Actually, regarding on-screen display, this is also an issue for Unicode. How many people actually have all of the Unicode glyphs? I certainly don't.] If I don't have a particular character in fonts, I will not create files with it in filenames. Actually I only use 9 Polish letters in addition to ASCII, and even them rarely. Usually it's only a subset of ASCII. Some programs use UTF-8 in filenames no matter what the locale is. For example the Evolution mail program which stores mail folders as files under names the user entered in a GUI. I had to rename some of these files in order to import them to Gnus, as it choked on filenames with strange characters, never mind that it didn't display them correctly (maybe because it tried to map them to virtual newsgroup names, or maybe because they are control characters in ISO-8859-x). If all programs consistently used the locale encoding for filenames, this should have worked. When I switch my environment to UTF-8, which may happen in a few years, I will convert filenames to UTF-8 and set up mount options to translate vfat filenames to/from UTF-8 instead of to ISO-8859-2. I expect good programs to understand that and display them correctly no matter what technique they are using for the display. For example the Epiphany web browser, when I open the file:/home/users/qrczak URL, displays ISO-8859-2-encoded filenames correctly. The virtual HTML file it created from the directory listing has x105; in its title where the directory name had 0xB1 in ISO-8859-2. When I run Epiphany with the locale set to pl_PL.UTF-8, it displays UTF-8 filenames correctly and ISO-8859-2 filenames are not shown at all. It's fine for me that it doesn't deal with wrongly encoded filenames, because it allowed to treat well encoded filenames correctly. For a web page rendered on the screen it makes no sense to display raw bytes. Epiphany treats filenames as sequences of characters encoded according to the locale. And even to the extent that it can be done, it will take a long time. Outside of the Free Software ghetto, long-term backward compatibility still means a lot. Windows has already switched most of its internals to Unicode, and it did it faster than Linux. In CLisp it fails silently (undecodable filenames are skipped), which is bad. It should fail loudly. No, it shouldn't fail at all. Since it uses Unicode as string representation, accepting filenames not encoded in the locale encoding would imply making garbage from filenames correctly encoded in the locale encoding. In a UTF-8 environment character U+00E1 in the filename means bytes 0xC3 0xA1 on ext2 filesystem (and 0x00E1 on vfat filesystem), so it can't at the same time mean 0xE1 on ext2 filesystem. And this is why I can't switch my home environment to UTF-8 yet. Too many programs are broken; almost all terminal programs which use more than stdin and stdout in default modes, i.e. which use line editing or work in full screen. How would you display a filename in a full screen text editor, such that it works in a UTF-8 environment? So, what are you suggesting? That the whole world switches to UTF-8? No, each computer system decides for itself, and announces it in the locale setting. I'm suggesting that programs should respect that and correctly handle all correctly encoded texts, including filenames. Better programs may offer to choose the encoding explicitly when it makes sense (e.g. text file editors for opening a file), but if they don't, they should at least accept the locale encoding. Or that every program should pass everything through iconv() (and handle the failures)? If it uses Unicode as internal string representation, yes (because the OS API on Unix generally uses byte encodings rather than Unicode). This should be done transparently in libraries of respective languages instead of in each program independently. A program is not supposed to encounter filenames which are not representable in the locale's encoding. Huh? What does supposed to mean in this context? That everything would be simpler if reality wasn't how it is? It means that if it encounters a filename encoded differently, it's usually not the fault of the program but of whoever caused the mismatch in the first place. In your setting it's impossible to display a filename in a way other than printing to stdout. Writing to stdout doesn't amount to displaying anything; stdout doesn't have to be a terminal. I know, it's not the point. The point is that other display channels than stdout connected to a terminal often work in terms of characters rather than bytes of some implicit encoding. For example various GUI frameworks, and wide character ncurses. Sure; but that doesn't automatically mean that the locale's encoding is correct for any given filename. The point is that you often don't need to know the encoding. What if I do need to know the encoding? I must assume
[Haskell-cafe] Re: Writing binary files?
Glynn Clements [EMAIL PROTECTED] writes: 3. The default encoding is settable from Haskell, defaults to ISO-8859-1. Agreed. So every haskell program that did more than just passing raw bytes From stdin to stdout should decode the appropriate environment variables, and set the encoding by itself? IMO that's too much of redundancy, the RTS should actually do that. There are limits to the extent to which this can be achieved. E.g. what happens if you set the encoding to UTF-8, then call getDirectoryContents for a directory which contains filenames which aren't valid UTF-8 strings? Then you _seriously_ messed up. Your terminal would produce garbage, Nautilus would break, ... 5. The default encoding is settable from Haskell, defaults to the locale encoding. I feel that the default encoding should be one whose decoder cannot fail, e.g. ISO-8859-1. You should have to explicitly request the use of the locale's encoding (analogous to calling setlocale(LC_CTYPE, ) at the start of a C program; there's a good reason why C doesn't do this without being explicitly told to). So that any haskell program that doesn't call setlocale and outputs anything else than US-ASCII will produce garbage on an UTF-8 system? Actually, the more I think about it, the more I think that simple, stupid programs probably shouldn't be using Unicode at all. Care to give any examples? Everything that has been mentioned until now would break with an UTF-8 locale: - ls (sorting would break), - env (sorting too) I.e. Char, String, string literals, and the I/O functions in Prelude, IO etc should all be using bytes, I don't want the same mess as in C, where strings and raw data are the very same. Haskell has a nice type system and nicely defined types for binary data ([Word8]) and for Strings (String), why don't use it? with a distinct wide-character API available for people who want to make the (substantial) effort involved in writing (genuinely) internationalised programs. If you introduce an entirely new i18n-only API, then it'll surely become difficult. :-) Anything that isn't ISO-8859-1 just doesn't work for the most part, and anyone who wants to provide real I18N first has to work around the pseudo-I18N that's already there (e.g. convert Chars back into Word8s so that they can decode them into real Chars). One more reason to fix the I/O functions to handle encodings and have a seperate/underlying binary I/O API. Oh, and because bytes are being stored in Chars, the type system won't help if you neglect to decode a string, or if you decode it twice. Yes, that's the problem with the current approach, i.e. that there's no easy way get a list of Word8's out of a handle. The advantage of assuming ISO-8859-* is that the decoder can't fail; every possible stream of bytes is valid. Assuming. But UTF-8 is not ISO-8859-*. When someday I change most of my files and filenames from ISO-8859-2 to UTF-8, and change the locale, the assumption will be wrong. I can't change that now, because too many programs would break. The current ISO-8859-1 assumption is also wrong. A program written in Haskell which sorts strings would break for non-ASCII letters even now that they are ISO-8859-2 unless specified otherwise. 1. In that situation, you can't avoid the encoding issues. It doesn't matter what the default is, because you're going to have to set the encoding anyhow. Why do you always want me to set the encoding? That should be the job of the RTS. It's ok to use a different API to get Strings instead of Word8's out of a handle, but _manually_ having to set the encoding? IIRC, Haskell is meant to be portable, and locale handling is pretty platform-dependent. 2. If you assume ISO-8859-1, you can always convert back to Word8 If I want a list of Word8's, then I should be able to get them without extracting them from a string. then re-decode as UTF-8. If you assume UTF-8, anything which is neither UTF-8 nor ASCII will fail far more severely than just getting the collation order wrong. If I use String's to handle binary data, then I should expect things to break. If I want to get text, and it's not in the expected encoding, then the user has messed up. Well, my view is essentially that files should be treated as containing bytes unless you explicitly choose to decode them, at which point you have to specify the encoding. Why do you always want me to _manually_ specify an encoding? If I want bytes, I'll use the (currently being discussed, see beginning of this thread) binary I/O API, if I want String's (i.e. text), I'll use the current I/O API (which is pretty text-orientated anyway, see hPutStrLn, hGetLine, ...). completely new wide-character API for those who wish to use it. Which would make it horrendously difficult to do even basic I18N. That gets the failed attempt at I18N out of everyone's way with a minimum of effort and with maximum backwards compatibility for
Re: [Haskell-cafe] Writing binary files?
Marcin 'Qrczak' Kowalczyk wrote: [Actually, regarding on-screen display, this is also an issue for Unicode. How many people actually have all of the Unicode glyphs? I certainly don't.] If I don't have a particular character in fonts, I will not create files with it in filenames. Actually I only use 9 Polish letters in addition to ASCII, and even them rarely. Usually it's only a subset of ASCII. But this seems to be assuming a closed world. I.e. the only files which the program will ever see are those which were created by you, or by others who are compatible with your conventions. Some programs use UTF-8 in filenames no matter what the locale is. For example the Evolution mail program which stores mail folders as files under names the user entered in a GUI. This is entirely reasonable for a file which a program creates. If a filename is just a string of bytes, a program can use whatever encoding it wants. I had to rename some of these files in order to import them to Gnus, as it choked on filenames with strange characters, never mind that it didn't display them correctly (maybe because it tried to map them to virtual newsgroup names, or maybe because they are control characters in ISO-8859-x). If it had just treated them as bytes, rather than trying to interpret them as characters, there wouldn't have been any problems. If all programs consistently used the locale encoding for filenames, this should have worked. But again, for this to work in general, you have to assume a closed world. When I switch my environment to UTF-8, which may happen in a few years, I will convert filenames to UTF-8 and set up mount options to translate vfat filenames to/from UTF-8 instead of to ISO-8859-2. But what about files which were been created by other people, who don't use UTF-8? I expect good programs to understand that and display them correctly no matter what technique they are using for the display. When it comes to display, you have to have to deal with encoding issues one way or another. But not all programs deal with display. For example the Epiphany web browser, when I open the file:/home/users/qrczak URL, displays ISO-8859-2-encoded filenames correctly. The virtual HTML file it created from the directory listing has x105; in its title where the directory name had 0xB1 in ISO-8859-2. When I run Epiphany with the locale set to pl_PL.UTF-8, it displays UTF-8 filenames correctly and ISO-8859-2 filenames are not shown at all. For many (probably most) programs, omitting such files would be an unacceptable failure. And even to the extent that it can be done, it will take a long time. Outside of the Free Software ghetto, long-term backward compatibility still means a lot. Windows has already switched most of its internals to Unicode, and it did it faster than Linux. Microsoft is actively hostile to both backwards compatibility and cross-platform compatibility. I consider the fact that some Unix (primarily Linux) developers seem equally hostile to be a problem. Having said that, with Linux developers, the issue usually due to not being bothered. Assuming that everything is UTF-8 allows a lot of potential problems to be ignored. Fortunately, the problem is mostly consigned to the periphery, i.e. the desktop, where most programs have to deal with display issues (so you *have* to decode bytes into characters), and it isn't too critical if they have limitations. The core OS and network server applications essentially remain encoding-agnostic. In CLisp it fails silently (undecodable filenames are skipped), which is bad. It should fail loudly. No, it shouldn't fail at all. Since it uses Unicode as string representation, accepting filenames not encoded in the locale encoding would imply making garbage from filenames correctly encoded in the locale encoding. In a UTF-8 environment character U+00E1 in the filename means bytes 0xC3 0xA1 on ext2 filesystem (and 0x00E1 on vfat filesystem), so it can't at the same time mean 0xE1 on ext2 filesystem. But, as I keep pointing out, filenames are byte strings, not character strings. You shouldn't be converting them to character strings unless you have to. And this is why I can't switch my home environment to UTF-8 yet. Too many programs are broken; almost all terminal programs which use more than stdin and stdout in default modes, i.e. which use line editing or work in full screen. How would you display a filename in a full screen text editor, such that it works in a UTF-8 environment? So, what are you suggesting? That the whole world switches to UTF-8? No, each computer system decides for itself, and announces it in the locale setting. I'm suggesting that programs should respect that and correctly handle all correctly encoded texts, including filenames. 1. Actually, each user decides which locale they wish to use. Nothing forces two users of a system to use the same locale. 2. Even if
[Haskell-cafe] Layered I/O
Binary i/o is not specifically a Haskell problem. Other programming systems, for example, Scheme, have been struggling with the same issues. Scheme Binary I/O proposal may therefore be of some interest in this group. http://srfi.schemers.org/srfi-56/ It is deliberately made to be the easiest to implement and the least controversial. There exist more ambitious proposals. The key feature is a layered, or stackable i/o. Enclosed is a justification message that I wrote two years ago for Haskell-Cafe, but somehow did not post. An early draft of the ambitious proposal, cast in the context of Scheme, is available here: http://pobox.com/~oleg/ftp/Scheme/io.txt More polished drafts exist, and even a prototype implementation. Unfortunately, once it became clear that the ideas are working out, the motivation fizzled. The discussion of i18n i/o highlighted the need for general overlay streams. We should be able to place a processing layer onto a handle -- and to peel it off and place another one. The layers can do character encoding, subranging (limiting the stream to the specified number of basic units), base64 and other decoding, signature collecting and verification, etc. Processing of a response from a web server illustrates a genuine need for such overlayed processing. Suppose we have established a connection to a web server, sent a GET or POST request and are now reading the response. It starts as follows: HTTP/1.1 200 Have it Content-type: text/plain; charset=iso-2022-jp Content-length: 12345 Date: Tuesday, August 13, 2002 empty-line To read the response line and the content headers, our stream must be in a ASCII, Latin-1 or UTF-8 encoding (regardless of the current locale). The body of the message is encoded in iso-2022-jp. This encoding may have nothing to do with the current locale. Furthermore, many character encodings cannot be reliably detected automatically. Therefore, after we read the headers we must forcibly set our stream to use the iso-2022-jp encoding. ISO-2022-JP is the encoding for Japanese characters [http://www.faqs.org/rfcs/rfc1554.html] It is a variable-length stateful encoding: the start of the specifically Japanese encoding is indicated by \e$B. After that, the reader should read _two octets_ from the input stream (and pass them to the application as they are). The server has indicated that it was sending 12345 _octets_ of data. We cannot tell offhand how many _characters_ of data we will read, because of the variable-length encoding. However, we must not even attempt to read the 12346-th octet: HTTP/1.1 connections are, in general, persistent, and we must not read more data than were sent. Otherwise, we deadlock. Therefore, our stream must be able to give us Japanese characters and still must count octets. The HTTP stream will not, in general, give EOF condition at the end of data. That is not the only complication. Suppose the web server replied: Content-type: text/plain; charset=iso-2022-jp Content-transfer-encoding: chunked We should expect to read a sequence of chunks of the format: length CRLF body CRLF where length is a hexadecimal number, and body is encoded as indicated in the Content-type. Therefore, after we read the header, we should keep our stream in the ASCII mode to read the length field. After that, we should switch the encoding into ISO-2022-JP. After we have consumed length octets, we should switch the stream back to ASCII, verify the trailing CRLF, and read the length of the next chunk. The ISO-2022-JP encoding is stateful and wide (a character is formed by several octets). It may well happen that a wide character is split between two chunks: one octet of a character will be in one chunk and the other octets in following chunk. Therefore, when we switch from the ISO-2022-JP encoding to ASCII and back, we must preserve the state of the encoding. This is not the end of the story however. A web server may send us a multi-part reply: a multi-part MIME entity made of several MIME entities, each with its own encoding and transfer modes. Neither of these encoding have anything to do with the current locale. Therefore, we may need to switch encodings back and forth quite a few times. Decoding of such complex streams becomes easier if we can overlay different processing layers on a stream. We start with a TCP handle, overlay an ASCII stream and read the headers, then overlay a stream that reads a specified number of units (and returns EOF when read that many). On the top of the latter we place an ISO-2022-JP decoder. Or we choose a base64-decoder overlayed with a PCS7 signed entity decoder and with a signature verification layer. OpenSSL is one package that offers i/o overlays and stream composition. Overlaying of parsers, encoders and hash accumulators is very common in that particular domain. I have implemented such a facility in two languages, e.g., to overlay an endian stream on the
Re: [Haskell-cafe] Layered I/O
My thoughts on I/O, binary and chars can be summerised: 1) Produce a new WordN based IO library. 2) Character strings cannot be separated from their encodings (ie they must be encoded somehow - even if that encoding is ascii). I would approch this using parameterised phantom types, so for example you would have: data Ascii data Chr a = Char Then the encoding becomes explicit in the type system, but of course functions like equality on types cvan still be expressed generically: charEq :: Chr a - Chr a - Bool but of course, you must be comparing the same encodings (enforced by type system)! The phantom type could be used in IO, IE to define a new encodeing: data MyEncoding instance ChrToBinary MyEncoding where chrToBinary = ... instance BinaryToChr MyEncoding where binaryToChr = ... Keean. ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
I modestly re-propose the I/O model which I first proposed last year: http://www.haskell.org/pipermail/haskell/2003-July/012312.html http://www.haskell.org/pipermail/haskell/2003-July/012313.html http://www.haskell.org/pipermail/haskell/2003-July/012350.html http://www.haskell.org/pipermail/haskell/2003-July/012352.html ... -- Ben ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
RE: [Haskell-cafe] Writing binary files?
On 15 September 2004 12:32, [EMAIL PROTECTED] wrote: On Mon, Sep 13, 2004 at 12:01:58PM +0100, Glynn Clements wrote: My view is that, right now, we have the worst of both worlds, and taking a short step backwards (i.e. narrow the Char type and leave the rest alone) is a lot simpler (and more feasible) than the long journey towards real I18N. This being Haskell, I can't imagine a consensus on a step backwards. In any case, a Char type distinct from bytes and the rest is the most valuable part of the current situation. The rest is just libraries, and the solution to that is to create other libraries. (It's true that the Prelude is harder to work around, but even that can be done, as with the new exception interface.) Indeed more than one approach can proceed concurrently, and that's probably what's going to happen: The Right Thing proceeds in stages: 1. new byte-based libraries 2. conversions sitting on top of these 3. the ultimate I18N API The Quick Fix: alter the existing implementation to use the encoding determined by the current locale at the borders. I wish I had some more time to work on this, but I implemented a prototype of an ultimate i18n API recently. This is derived from the API that Ben Rudiak-Gould proposed last year. It is in two layers: the InputStream/OutputStream classes provide raw byte I/O, and the TextStream class provides a conversion on top of that. The prototype uses iconv for conversions. You can make Streams from all sorts of things: files, sockets, pipes, and even Haskell arrays. InputStream and OutputStreams are just classes, so you can implement your own. IIRC, I managed to get it working with speed comparable to GHC's current IO library. Here's a tarball that works with GHC 6.2.1 on a Unix platform, just --make to build it: http://www.haskell.org/~simonmar/new-io.tar.gz If anyone would like to pick this up and run with it, I'd be delighted. I'm not likely to get back to in the short term, at least. Cheers, Simon ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
Simon Marlow [EMAIL PROTECTED] writes: Here's a tarball that works with GHC 6.2.1 on a Unix platform, just --make to build it: http://www.haskell.org/~simonmar/new-io.tar.gz Found a bug already... In System/IO/Stream.hs, line 183: streamReadBufrer s 0 buf = return 0 streamReadBuffer s len ptr = ... Note the different spellings of the function name. Regards, Malcolm ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] FilePath handling
Henning Thielemann [EMAIL PROTECTED] writes: I even plead for an abstract data type FilePath which supports operations like 'enter a directory', 'go one level higher' and so on. Beware of Common Lisp history: http://www.gigamonkeys.com/book/practical-a-portable-pathname-library.html As we discussed in chapter TK, Common Lisp provides an abstraction, the pathname, that is supposed to insulate us from the details of how different OS's and file systems name files. However the gods of abstraction give and they take away. While, with a bit of care, pathnames can be used to write code that will be portable between different OS's it can, ironically, be quite tricky to write pathname code that that will be portable between different Common Lisp implementations on the *same* OS. The root of the problem is that the pathname abstraction was designed to represent file names on a wide variety of file systems. The set of file systems still in general use at the time Common Lisp was being standardized was much more variegated than the set of file systems that are commonly used now. Unfortunately, by making pathnames abstract enough to account for a wide variety of file systems, Common Lisp's designers left implementors with a fair number of choices to be made when mapping the pathname abstraction onto a particular file system. Consequently different implementors, each implementing the pathname abstraction for the same file system, by making different choices at a few key junctions, could end up with conforming implementations that nonetheless provide different behavior for several of the main pathname-related functions. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/ ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Unicoded filenames
Here is what happens when a language provides only narrow-char API for filenames: Start of forwarded message Date: Wed, 15 Sep 2004 15:18:00 +0100 From: Peter Jolly [EMAIL PROTECTED] To: [EMAIL PROTECTED] CC: caml-list [EMAIL PROTECTED] Subject: Re: [Caml-list] How do I create files with UTF-8 file names? Mattias Waldau wrote: I have a filename as an UTF-8 encoded string. I need to be able to handle strange chars like accents, Asian chars etc. Is there any way to create a file with that name? I only need it on Win32. Windows uses UTF-16 for filenames, but provides a non-Unicode interface for legacy applications; the standard open() function that OCaml's open_out wraps appears to use the legacy interface. The precise codepage this uses is system-dependent, and AFAIK there's no way for a program to determine what it is without calling out to the Win32 API, but you can be pretty sure it won't be UTF-8. In other words, there is no reliable way to use a filename containing non-ASCII characters with OCaml's standard library. Or should I solve this problem by talking directly to the Win32-api? This is probably the best solution. A combination of CreateFileW() and MultiByteToWideChar() should do what you want. --- To unsubscribe, mail [EMAIL PROTECTED] Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners End of forwarded message -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/ ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
Graham Klyne wrote: In particular, the idea of narrowing the Char type really seems like a bad idea to me (if I understand the intent correctly). Not so long ago, I did a whole load of work on the HaXml parser so that, among other things, it would support UTF-8 and UTF-16 Unicode (as required by the XML spec). To do this depends upon having a Char type that can represent the full repertoire of Unicode characters. Note: I wasn't proposing doing away with wide character support altogether. Essentially, I was suggesting making Char a byte and having e.g. WideChar for wide characters. The reason being that the existing Haskell98 API uses Char for functions which are actually dealing with bytes. In an ideal world, the IO, System and Directory modules (and the Prelude I/O functions) would have used Byte, leaving Char to represent a (wide) character. However, that isn't the hand we've been dealt. -- Glynn Clements [EMAIL PROTECTED] ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] FilePath handling [Was: Writing binary files?]
Henning Thielemann wrote: Udo Stenzel wrote: The same thoughts apply to filenames. Make them [Word8] and convert explicitly. By the way, I think a path should be a list of names (that is of type [[Word8]]) and the library would be concerned with putting in the right path separator. Add functions to read and show pathnames in the local conventions and we'll never need to worry about path separators again. I even plead for an abstract data type FilePath which supports operations like 'enter a directory', 'go one level higher' and so on. Are you referring to pure operations on the FilePath, e.g. appending and removing entries? That's reasonable enough. But it needs to be borne in mind that there's a difference between: setCurrentDirectory .. and: dir - getCurrentDirectory setCurrentDirectory $ parentDirectory dir [where parentDirectory is a pure FilePath - FilePath function.] if the last component in the path is a symlink. If you want to make FilePath an instance of Eq, the situation gets much more complicated. -- Glynn Clements [EMAIL PROTECTED] ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Unicoded filenames
Marcin 'Qrczak' Kowalczyk wrote: Here is what happens when a language provides only narrow-char API for filenames: I have a filename as an UTF-8 encoded string. I need to be able to handle strange chars like accents, Asian chars etc. Is there any way to create a file with that name? I only need it on Win32. Windows uses UTF-16 for filenames, but provides a non-Unicode interface for legacy applications; the standard open() function that OCaml's open_out wraps appears to use the legacy interface. The precise codepage this uses is system-dependent, and AFAIK there's no way for a program to determine what it is without calling out to the Win32 API, but you can be pretty sure it won't be UTF-8. In other words, there is no reliable way to use a filename containing non-ASCII characters with OCaml's standard library. No, this is what happens when an API imposes restrictions upon the filenames which it can handle. Essentially, it's due to two (or possibly three) factors: 1. The fact that Windows uses wide strings, rather than multi-byte strings, for filenames. 2. The fact that Windows' compatibility interface is broken, i.e. it only lets you access filenames which can be represented in the current codepage (which, to me, is highly analogous to only supporting filenames which are valid in the current locale). 3. Possibly that OCaml insists upon using UTF-8. [I don't know that this is the case, but the fact that they specifically mention UTF-8 suggests that it might be.] IOW, this incident seems to oppose, rather than support, the filenames-as-characters viewpoint. -- Glynn Clements [EMAIL PROTECTED] ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Writing binary files?
Glynn Clements [EMAIL PROTECTED] writes: The RTS doesn't know the encoding. Assuming that the data will use the locale's encoding will be wrong too often. If the program wants to get bytes, it should get bytes explicitly, not some sort of pseudo-Unicode String. Like so many other people, you're making an argument based upon fiction (specifically, that you have a closed world where everything always uses the same encoding) then deeming anyone who is unable to maintain the fiction to be wrong. Everything's fine here with LANG=de_AT.utf8. And I can't recall having any problems with it. But well, YMMV. No. If a program just passes bytes around, everything will work so long as the inputs use the encoding which the outputs are assumed to use. And if the inputs aren't in the correct encoding, then you have to deal with encodings manually regardless of the default behaviour. The only programs that just pass bytes around that come to mind are basic Unix utilities. Basically everything else will somehow process the data. Sorting according to codepoints inevitably involves decoding. However, getting the order wrong is usually considered less problematic than failing outright. But more difficult to debug. Tough. You already have it, and will do for the foreseeable future. Many existing APIs (including the core Unix API), protocols and file formats are defined in terms of byte strings with no encoding specified or implied. Guess why I like Haskell (the language; the implementations are not up to that ideal yet). I'd like to. But many of the functions which provide or accept binary data (e.g. FilePath) insist on represent it using Strings. Good point. Adding functions that accept bytes instead of strings would be a major undertaking. I18N is inherently difficult. Lots of textual data exists in lots of different encodings, and the encoding is frequently unspecified. That's the problem with the current API. You can neither easily read/write bytes nor strings in a specified encoding. The problem is that we also need to fix them to handle *no encoding*. That's binary data. (assuming you didn't want to say 'unknown') Also, binary data and text aren't disjoint. Everything is binary; some of it is *also* text. Simon's new-io proposal does this very nicely. Stdin is by default a binary stream and you can obtain a TextInputStream for it using either the locale's encoding or a specified encoding. That's the way I'd like it to be. Or out of getDirectoryContents, getArgs, getEnv etc. Or to pass a list of Word8s to a handle, or to openFile, getEnv etc. That's a real issue. Adding new functions with a bin- is the only solution that comes to my mind. The point is that, currently, you can't. Nothing in the core Haskell98 API actually uses Word8, it all uses Char/String. That's the intent of this thread. :-) Because we don't have an oracle which will magically determine the encoding for you. That oracle is called locale setting. If I want to read text and can't determine the encoding by other ways (protocol spec, ...), then it's what the user set his locale setting to. If you want text, well, tough; what comes out most system calls and core library functions (not just read()) are bytes. Which need to be interpreted by the program depending on where these bytes come from. There isn't any magic wand which will turn them into characters without knowing the encoding. If I know the encoding, I should be able to set it. If I don't, it's the locale setting. completely new wide-character API for those who wish to use it. Which would make it horrendously difficult to do even basic I18N. Why? Having different types for single-byte and multi-byte strings together with seperate functions to handle them (that's what I assume you mean by a new wide-character API) with single-byte strings being the preferred one (the cause of being a seperate API) would make sorting, upper/lower case testing etc. not exactly easier. I know. That's what I'm saying. The problem is that the broken code is the Haskell98 API. No, it's not broken. It just has some missing features (i.e. I/O / env functions accepting bytes instead of strings). String's are a list of unicode characters, [Word8] is a list of bytes. And what comes out of (and goes into) most core library functions is the latter. Strictly speaking, the former comes out with the semantics of the latter. :-) Maybe bugs should be filed? Gabriel. pgpZrzOaMmiCZ.pgp Description: PGP signature ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe