Re: [Haskell-cafe] Writing binary files
On 21/08/06, Udo Stenzel [EMAIL PROTECTED] wrote: Neil Mitchell wrote: I'm trying to write out a binary file, in particular I want the following functions: hPutInt :: Handle - Int - IO () hGetInt :: Handle - IO Int For the purposes of these functions, Int = 32 bits, and its got to roundtrip - Put then Get must be the same. How would I do this? I see Ptr, Storable and other things, but nothing which seems directly useable for me. hPutInt h = hPutStr h . map chr . map (0xff ..) . take 4 . iterate (`shiftR` 8) hGetInt h = replicateM 4 (hGetChar h) = return . foldr (\i d - i `shiftL` 8 .|. ord d) 0 This of course assumes that a Char is read/written as a single low-order byte without any conversion. But you'd have to assume a lot more if you started messing with pointers. (Strange, somehow I get the feeling, the above is way too easy to be the answer you wanted.) Udo. What's wrong with the following i.e. what assumptions is it making (w.r.t. pointers) that I've missed? Is endian-ness an issue here? Alistair hPutInt :: Handle - Int32 - IO () hGetInt :: Handle - IO Int32 int32 :: Int32 int32 = 0 hPutInt h i = do alloca $ \p - do poke p i hPutBuf h p (sizeOf i) hGetInt h = do alloca $ \p - do bytes - hGetBuf h p (sizeOf int32) when (bytes sizeOf int32) (error too few bytes read) peek p ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re[2]: [Haskell-cafe] Writing binary files
Hello Alistair, Tuesday, August 22, 2006, 1:29:22 PM, you wrote: What's wrong with the following i.e. what assumptions is it making (w.r.t. pointers) that I've missed? Is endian-ness an issue here? data written by your module on big-endian machine, can't be read by the same module in the little-end machine bytes - hGetBuf h p (sizeOf int32) or bytes - hGetBuf h p (sizeOf (0::Int32)) -- Best regards, Bulatmailto:[EMAIL PROTECTED] ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Writing binary files
Hi, I'm trying to write out a binary file, in particular I want the following functions: hPutInt :: Handle - Int - IO () hGetInt :: Handle - IO Int For the purposes of these functions, Int = 32 bits, and its got to roundtrip - Put then Get must be the same. How would I do this? I see Ptr, Storable and other things, but nothing which seems directly useable for me. Thanks Neil ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files
Neil Mitchell wrote: I'm trying to write out a binary file, in particular I want the following functions: hPutInt :: Handle - Int - IO () hGetInt :: Handle - IO Int For the purposes of these functions, Int = 32 bits, and its got to roundtrip - Put then Get must be the same. How would I do this? I see Ptr, Storable and other things, but nothing which seems directly useable for me. hPutInt h = hPutStr h . map chr . map (0xff ..) . take 4 . iterate (`shiftR` 8) hGetInt h = replicateM 4 (hGetChar h) = return . foldr (\i d - i `shiftL` 8 .|. ord d) 0 This of course assumes that a Char is read/written as a single low-order byte without any conversion. But you'd have to assume a lot more if you started messing with pointers. (Strange, somehow I get the feeling, the above is way too easy to be the answer you wanted.) Udo. -- Worrying is like rocking in a rocking chair -- It gives you something to do, but it doesn't get you anywhere. signature.asc Description: Digital signature ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files
Hi hPutInt h = hPutStr h . map chr . map (0xff ..) . take 4 . iterate (`shiftR` 8) hGetInt h = replicateM 4 (hGetChar h) = return . foldr (\i d - i `shiftL` 8 .|. ord d) 0 This of course assumes that a Char is read/written as a single low-order byte without any conversion. But you'd have to assume a lot more if you started messing with pointers. (Strange, somehow I get the feeling, the above is way too easy to be the answer you wanted.) It's exactly the answer I was hoping for! Thanks Neil ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
Glynn Clements [EMAIL PROTECTED] writes: Ok, but let it be in addition to, not instead treating them as character strings. Provided that you know the encoding, nothing stops you converting them to strings, should you have a need to do so. There are already APIs which use Strings for filenames. I meant to keep them, let them use a program-settable encoding which defaults to the locale encoding - this is the only sane interpretation of this interface on Unix I can imagine. And in addition to them we may have APIs which use byte strings, for those who prefer the ability to handle all filenames to using a uniform string representation inside the program. Such encodings are not suitable for filenames. Regardless of whether they are suitable, they are used. Usage of ISO-2022 as filename encoding is a bad and unsupported idea. The '/' byte does not necessarily mean that the '/' character is there, so some random subset of characters is excluded. statefulness means that the same filename may be interpreted as different characters depending on context. There is no need to support ISO-2022 as filename encoding in languages and tools. The fact that some tool doesn't support ISO-2022 in filenames is not a flaw in the tool, so there is no need to check what happens when filenames are represented in ISO-2022. If they are, someone should fix his system. I haven't addressed any of the other stuff about ISO-2022, as it isn't really relevant. Whether ISO-2022 is good or bad doesn't matter; what matters is that it is likely to remain in use for the foreseeable future. For transportation, not for the locale encoding nor for filenames. There are no ISO-2022 locales. A program may support it when data it operates on requests recoding between explicit encodings, e.g. if it's found in an email, but there is no need to support it as the default encoding of a program (which e.g. withCString function should use). IMHO it's more important to make them compatible with the representation of strings used in other parts of the program. Why? To limit conversion hassle to I/O, instead of scattering it through the program when filenames and other strings are met. But otherwise programs would continuously have bugs in handling text which is not ISO-8859-1, especially with multibyte encoding where pretending that ISO-8859-2 is ISO-8859-1 too often doesn't work. Why? Because some channels talk in terms of characters, or bytes in a known encoding, instead of bytes in an implicit encoding. E.g. most display channels, apart from raw stdin/stdout and narrow character ncurses; many Internet protocols, apart from irc; .NET and Java; file formats like XML; some databases. And the world is slowly shifting to have more such channels, which replace byte streams in an implicit encoding, because after reaching a critical mass (where encodingless channels don't get in the middle way, losing information about the encoding or losing some characters) they make miltilingual handling more robust. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/ ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
On Sat, Sep 18, 2004 at 10:58:21AM +0200, Marcin 'Qrczak' Kowalczyk wrote: Glynn Clements [EMAIL PROTECTED] writes: Ok, but let it be in addition to, not instead treating them as character strings. Provided that you know the encoding, nothing stops you converting them to strings, should you have a need to do so. There are already APIs which use Strings for filenames. I meant to keep them, let them use a program-settable encoding which defaults to the locale encoding - this is the only sane interpretation of this interface on Unix I can imagine. And in addition to them we may have APIs which use byte strings, for those who prefer the ability to handle all filenames to using a uniform string representation inside the program. Keep in mind, if you make this change to the IO libraries, you also will have to simultaneously fix the Foreign.C.String module to use the same locale as is used by the IO libraries when they deal with FilePaths. Incidentally, this change will break existing darcs repositories... but that should at least be repairable. According to the FFI the CString functions do a locale-based conversion, but of course in practice that isn't the case. -- David Roundy http://www.abridgegame.org ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
Glynn Clements [EMAIL PROTECTED] writes: What I'm suggesting in the above is to sidestep the encoding issue by keeping filenames as byte strings wherever possible. Ok, but let it be in addition to, not instead treating them as character strings. And program-generated email notifications frequently include text with no known encoding (i.e. binary data). No, programs don't dump binary data among diagnostic messages. If they output binary data to stdout, it's their only output and it's redirected to a file or another process. Or are you going to demand that anyone who tries to hack into your system only sends it UTF-8 data so that the alert messages are displayed correctly in your mail program? The email protocol is text-only. It may mangle newlines, it has a maximum line length, some texts may be escaped during transport (e.g. From at the beginning of a line). Arbitrary binary data should be put in base64-or-otherwise-encoded attachments. If the cron program embeds the output as email body, the cron job should not dump arbitrary binary data to stdout. Encoding is not the only problem. Processing data in their original byte encodings makes supporting multiple languages harder. Filenames which are inexpressible as character strings get in the way of clean APIs. When considering only filenames, using bytes would be sufficient, but in overall it's more convenient to Unicodize them like other strings. It also harms reliability. Depending upon the encoding, two distinct byte strings may have the same Unicode representation. Such encodings are not suitable for filenames. http://www.mail-archive.com/[EMAIL PROTECTED]/msg00376.html | ISO-2022-JP will never be a satisfactory terminal encoding (like | ISO-8859-*, EUC-*, UTF-8, Shift_JIS) because | | 1) It is a stateful encoding. What happens when a program starts some | terminal output and then is interrupted using Ctrl-C or Ctrl-Z? The | terminal will remain in the shifted state, while other programs start | doing output. But these programs expect that when they start, the | terminal is in the initial state. The net result will be garbage on | the screen. | | 2) ISO-2022-JP is not filesystem safe. Therefore filenames will never | be able to carry Japanese characters in this encodings. | | Robert Brady writes: | Does ISO-2022 see much/any use as the locale encoding, or it it just used | for interchange? | | Just for interchange. | | Paul Eggert searched for uses of ISO-2022-JP as locale encodings (in | order to convince me), and only came up with a handful of questionable | URLs. He didn't convince me. And there are no plans to support | ISO-2022-JP as a locale encoding in glibc - because of 1) and 2) above. For me ISO-2022 is a brain-damaged concept and should die. Almost nothing supports it anyway. Such tarballs are not portable across systems using different encodings. Well, programs which treat filenames as byte strings to be read from argv[] and passed directly to open() won't have any problems with this. The OS itself may have problems with this; only some filesystems accept arbitrary bytes apart from '\0' and '/' (and with the special meaning for '.'). Exotic characters in filenames are not very portable. A Haskell program in my world can do that too. Just set the encoding to Latin1. But programs should handle this by default, IMHO. IMHO it's more important to make them compatible with the representation of strings used in other parts of the program. Filenames are, for the most part, just tokens to be passed around. Filenames are often stored in text files, whose bytes are interpreted as characters. Applying QP to non-ASCII parts of filenames is suitable only if humans won't edit these files by hand. My specific point is that the Haskell98 API has a very big problem due to the assumption that the encoding is always known. Existing implementations work around the problem by assuming that the encoding is always ISO-8859-1. The API is incomplete and needs to be enhanced. Programs written using the current API will be limited to using the locale encoding. That just adds unnecessary failure modes. But otherwise programs would continuously have bugs in handling text which is not ISO-8859-1, especially with multibyte encoding where pretending that ISO-8859-2 is ISO-8859-1 too often doesn't work. I can't switch my environment to UTF-8 yet precisely because too many programs were written with the attitude you are promoting: they don't care about the encoding, they just pass bytes around. Bugs range from small annoyances like tabular output which doesn't line up, through mangled characters on a graphical display, to full-screen interactive programs being unusable on a UTF-8 terminal. This encoding would be incompatible with most other texts seen by the program. In particular reading a filename from a file would not work without manual recoding. We already have that problem; you can't read non-Latin1
Re: [Haskell-cafe] Writing binary files?
Marcin 'Qrczak' Kowalczyk wrote: What I'm suggesting in the above is to sidestep the encoding issue by keeping filenames as byte strings wherever possible. Ok, but let it be in addition to, not instead treating them as character strings. Provided that you know the encoding, nothing stops you converting them to strings, should you have a need to do so. Processing data in their original byte encodings makes supporting multiple languages harder. Filenames which are inexpressible as character strings get in the way of clean APIs. When considering only filenames, using bytes would be sufficient, but in overall it's more convenient to Unicodize them like other strings. It also harms reliability. Depending upon the encoding, two distinct byte strings may have the same Unicode representation. Such encodings are not suitable for filenames. Regardless of whether they are suitable, they are used. For me ISO-2022 is a brain-damaged concept and should die. Well, it isn't likely to. I haven't addressed any of the other stuff about ISO-2022, as it isn't really relevant. Whether ISO-2022 is good or bad doesn't matter; what matters is that it is likely to remain in use for the foreseeable future. Such tarballs are not portable across systems using different encodings. Well, programs which treat filenames as byte strings to be read from argv[] and passed directly to open() won't have any problems with this. The OS itself may have problems with this; only some filesystems accept arbitrary bytes apart from '\0' and '/' (and with the special meaning for '.'). Exotic characters in filenames are not very portable. No, but most Unix programs manage to handle them without problems. A Haskell program in my world can do that too. Just set the encoding to Latin1. But programs should handle this by default, IMHO. IMHO it's more important to make them compatible with the representation of strings used in other parts of the program. Why? Filenames are, for the most part, just tokens to be passed around. Filenames are often stored in text files, True. whose bytes are interpreted as characters. Sometimes true, sometimes not. Where filenames occur in data files, e.g. configuration files, the program which reads the configuration file typically passes the bytes directly to the OS without interpretation. Applying QP to non-ASCII parts of filenames is suitable only if humans won't edit these files by hand. Who said anything about QP? My specific point is that the Haskell98 API has a very big problem due to the assumption that the encoding is always known. Existing implementations work around the problem by assuming that the encoding is always ISO-8859-1. The API is incomplete and needs to be enhanced. Programs written using the current API will be limited to using the locale encoding. That just adds unnecessary failure modes. But otherwise programs would continuously have bugs in handling text which is not ISO-8859-1, especially with multibyte encoding where pretending that ISO-8859-2 is ISO-8859-1 too often doesn't work. Why? I can't switch my environment to UTF-8 yet precisely because too many programs were written with the attitude you are promoting: they don't care about the encoding, they just pass bytes around. That's all that many programs should be doing. Bugs range from small annoyances like tabular output which doesn't line up, through mangled characters on a graphical display, to full-screen interactive programs being unusable on a UTF-8 terminal. IOW: 1. display doesn't work correctly, 2. display doesn't work correctly, and 3. display doesn't work correctly. You keep citing cases involving graphical display as a reason why all programs should be working with characters all of the time. I haven't suggested that programs should never deal with characters, yet you keep insinuating that is my argument, then proceed to attack it. -- Glynn Clements [EMAIL PROTECTED] ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
Glynn Clements [EMAIL PROTECTED] writes: But this seems to be assuming a closed world. I.e. the only files which the program will ever see are those which were created by you, or by others who are compatible with your conventions. Yes, unless you set the default encoding to Latin1. Some programs use UTF-8 in filenames no matter what the locale is. For example the Evolution mail program which stores mail folders as files under names the user entered in a GUI. This is entirely reasonable for a file which a program creates. If a filename is just a string of bytes, a program can use whatever encoding it wants. But then they display wrong in any other program. If it had just treated them as bytes, rather than trying to interpret them as characters, there wouldn't have been any problems. I suspect it treats some characters in these synthesized newsgroup names, like dots, specially, so it won't work unless it was designed differently. When I switch my environment to UTF-8, which may happen in a few years, I will convert filenames to UTF-8 and set up mount options to translate vfat filenames to/from UTF-8 instead of to ISO-8859-2. But what about files which were been created by other people, who don't use UTF-8? All people sharing a filesystem should use the same encoding. BTW, when ftping files between Windows and Unix, a good ftp client should convert filenames to keep the same characters rather than bytes, so CP-1250 encoded names don't come as garbage in the encoding used on Unix which is definitely different (ISO-8859-2 or UTF-8) or vice versa. I expect good programs to understand that and display them correctly no matter what technique they are using for the display. When it comes to display, you have to have to deal with encoding issues one way or another. But not all programs deal with display. So you advocate using multiple encodings internally. This is in general more complicated than what I advocate: using only Unicode internally, limiting other encodings to I/O boundary. Assuming that everything is UTF-8 allows a lot of potential problems to be ignored. I don't assume UTF-8 when locale doesn't say this. The core OS and network server applications essentially remain encoding-agnostic. Which is a problem when they generate an email, e.g. to send a non-empty output of a cron job, or report unauthorized use of sudo. If the data involved is not pure ASCII, I will often be mangled. It's rarely a problem in practice because filenames, command arguments, error messages, user full names etc. are usually pure ASCII. But this is slowly changing. But, as I keep pointing out, filenames are byte strings, not character strings. You shouldn't be converting them to character strings unless you have to. Processing data in their original byte encodings makes supporting multiple languages harder. Filenames which are inexpressible as character strings get in the way of clean APIs. When considering only filenames, using bytes would be sufficient, but in overall it's more convenient to Unicodize them like other strings. 1. Actually, each user decides which locale they wish to use. Nothing forces two users of a system to use the same locale. Locales may be different, but they should use the same encoding when they share files. This applies to file contents too - various formats don't have a fixed encoding and don't specify the encoding explicitly, so these files are assumed to be in the locale encoding. 2. Even if the locale was constant for all users on a system, there's still the (not exactly minor) issue of networking. Depends on the networking protocols. They might insist that filenames are represented in UTF-8 for example. Or that every program should pass everything through iconv() (and handle the failures)? If it uses Unicode as internal string representation, yes (because the OS API on Unix generally uses byte encodings rather than Unicode). The problem with that is that you need to *know* the source and destination encodings. The program gets to choose one of them, but it may not even know the other one. If it can't know the encoding, it should process the data as a sequence of bytes, and can output it only to another channel which accepts raw bytes. But usually it's either known or can be assumed to be the locale encoding. The term mismatch implies that there have to be at least two things. If they don't match, which one is at fault? If I make a tar file available for you to download, and it contains non-UTF-8 filenames, is that my fault or yours? Such tarballs are not portable across systems using different encodings. If I tar a subdirectory stored on ext2 partition, and you untar it on a vfat partition, whose fault it is that files which differ only in case are conflated? In any case, if a program refuses to deal with a file because it is cannot convert the filename to characters, even when it doesn't have to, it's the program which
Re: [Haskell-cafe] Writing binary files?
Marcin 'Qrczak' Kowalczyk wrote: When I switch my environment to UTF-8, which may happen in a few years, I will convert filenames to UTF-8 and set up mount options to translate vfat filenames to/from UTF-8 instead of to ISO-8859-2. But what about files which were been created by other people, who don't use UTF-8? All people sharing a filesystem should use the same encoding. Again, this is just hand waving the issues away. BTW, when ftping files between Windows and Unix, a good ftp client should convert filenames to keep the same characters rather than bytes, so CP-1250 encoded names don't come as garbage in the encoding used on Unix which is definitely different (ISO-8859-2 or UTF-8) or vice versa. Which is fine if the FTP client can figure out which encoding is used on the remote end. In practice, you have to tell it, i.e. have a list of which servers (or even which directories on which servers) use which encoding. I expect good programs to understand that and display them correctly no matter what technique they are using for the display. When it comes to display, you have to have to deal with encoding issues one way or another. But not all programs deal with display. So you advocate using multiple encodings internally. This is in general more complicated than what I advocate: using only Unicode internally, limiting other encodings to I/O boundary. How do you draw that conclusion from what I wrote here? There are cases where it's advantages to use multiple encodings, but I wasn't suggesting that in the above. What I'm suggesting in the above is to sidestep the encoding issue by keeping filenames as byte strings wherever possible. The core OS and network server applications essentially remain encoding-agnostic. Which is a problem when they generate an email, e.g. to send a non-empty output of a cron job, or report unauthorized use of sudo. If the data involved is not pure ASCII, I will often be mangled. It only gets mangled if you feed it to a program which is making assumptions about the encoding. Non-MIME messages neither specify nor imply an encoding. MIME messages can use either text/plain; charset=x-unknown or application/octet-stream if they don't undertand the encoding. And program-generated email notifications frequently include text with no known encoding (i.e. binary data). Or are you going to demand that anyone who tries to hack into your system only sends it UTF-8 data so that the alert messages are displayed correctly in your mail program? It's rarely a problem in practice because filenames, command arguments, error messages, user full names etc. are usually pure ASCII. But this is slowly changing. To the extent that non-ASCII filenames are used, I've encountered far more filenames in both Latin1 and ISO-2022 than in UTF-8. Japanese FTP sites typically use ISO-2022 for everything; even ASCII names may have \e(B prepended to them. But, as I keep pointing out, filenames are byte strings, not character strings. You shouldn't be converting them to character strings unless you have to. Processing data in their original byte encodings makes supporting multiple languages harder. Filenames which are inexpressible as character strings get in the way of clean APIs. When considering only filenames, using bytes would be sufficient, but in overall it's more convenient to Unicodize them like other strings. It also harms reliability. Depending upon the encoding, two distinct byte strings may have the same Unicode representation. E.g. if you are interfacing to a server which uses ISO-2022 for filenames, you have to get the escapes correct even when they are no-ops in terms of the string representation. If you obtain a directory listing, receive the filename \e(Bfoo.txt, and convert it to Unicode, you get foo.txt. If you then convert it back without the leading escape, the server is going to say file not found. The term mismatch implies that there have to be at least two things. If they don't match, which one is at fault? If I make a tar file available for you to download, and it contains non-UTF-8 filenames, is that my fault or yours? Such tarballs are not portable across systems using different encodings. Well, programs which treat filenames as byte strings to be read from argv[] and passed directly to open() won't have any problems with this. It's only a problem if you make it a problem. If I tar a subdirectory stored on ext2 partition, and you untar it on a vfat partition, whose fault it is that files which differ only in case are conflated? Arguably, it's Microsoft's fault for not considering the problems caused by multiple encodings when they decided that filenames were going to be case-folded. In any case, if a program refuses to deal with a file because it is cannot convert the filename to characters, even when it doesn't have to, it's the program which is at fault. Only if it's a low-level utility,
Re: [Haskell-cafe] Writing binary files?
On Mon, Sep 13, 2004 at 12:01:58PM +0100, Glynn Clements wrote: My view is that, right now, we have the worst of both worlds, and taking a short step backwards (i.e. narrow the Char type and leave the rest alone) is a lot simpler (and more feasible) than the long journey towards real I18N. This being Haskell, I can't imagine a consensus on a step backwards. In any case, a Char type distinct from bytes and the rest is the most valuable part of the current situation. The rest is just libraries, and the solution to that is to create other libraries. (It's true that the Prelude is harder to work around, but even that can be done, as with the new exception interface.) Indeed more than one approach can proceed concurrently, and that's probably what's going to happen: The Right Thing proceeds in stages: 1. new byte-based libraries 2. conversions sitting on top of these 3. the ultimate I18N API The Quick Fix: alter the existing implementation to use the encoding determined by the current locale at the borders. When the Right Thing is finished, the Quick Fix can be recast as a special case. The Right Thing might take a very long (possibly infinite) time, because this is the sort of thing people can argue about endlessly. Still, the first stage would deal with most of the scenarios you raised. It just needs a group of people who care about it to get together and do it. The Quick Fix is the most logical implementation of the current definition of Haskell, and entirely consistent with its general philosophy of presenting the programmer with an idealized (some might say oversimplified) model of computation. From the start, Haskell has supported only character-based I/O, with whatever translations were required to present a uniform view on all platforms. And that's not an entirely bad thing. It won't work all the time, but it will be simple, and good enough for most people. Its existence will not rule out binary I/O or more sophisticated alternatives. Those who need more may be motivated to help finish the Right Thing. ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
I've not been following this debate, but I think I agree with Ross. In particular, the idea of narrowing the Char type really seems like a bad idea to me (if I understand the intent correctly). Not so long ago, I did a whole load of work on the HaXml parser so that, among other things, it would support UTF-8 and UTF-16 Unicode (as required by the XML spec). To do this depends upon having a Char type that can represent the full repertoire of Unicode characters. Other languages have been forced into this (maybe painful) transition; I don't think Haskell can reasonably go backwards if it is to have any hope of surviving. #g -- At 12:31 15/09/04 +0100, [EMAIL PROTECTED] wrote: On Mon, Sep 13, 2004 at 12:01:58PM +0100, Glynn Clements wrote: My view is that, right now, we have the worst of both worlds, and taking a short step backwards (i.e. narrow the Char type and leave the rest alone) is a lot simpler (and more feasible) than the long journey towards real I18N. This being Haskell, I can't imagine a consensus on a step backwards. In any case, a Char type distinct from bytes and the rest is the most valuable part of the current situation. The rest is just libraries, and the solution to that is to create other libraries. (It's true that the Prelude is harder to work around, but even that can be done, as with the new exception interface.) Indeed more than one approach can proceed concurrently, and that's probably what's going to happen: The Right Thing proceeds in stages: 1. new byte-based libraries 2. conversions sitting on top of these 3. the ultimate I18N API The Quick Fix: alter the existing implementation to use the encoding determined by the current locale at the borders. When the Right Thing is finished, the Quick Fix can be recast as a special case. The Right Thing might take a very long (possibly infinite) time, because this is the sort of thing people can argue about endlessly. Still, the first stage would deal with most of the scenarios you raised. It just needs a group of people who care about it to get together and do it. The Quick Fix is the most logical implementation of the current definition of Haskell, and entirely consistent with its general philosophy of presenting the programmer with an idealized (some might say oversimplified) model of computation. From the start, Haskell has supported only character-based I/O, with whatever translations were required to present a uniform view on all platforms. And that's not an entirely bad thing. It won't work all the time, but it will be simple, and good enough for most people. Its existence will not rule out binary I/O or more sophisticated alternatives. Those who need more may be motivated to help finish the Right Thing. ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe Graham Klyne For email: http://www.ninebynine.org/#Contact ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
Glynn Clements [EMAIL PROTECTED] writes: Unless you are the sole user of a system, you have no control over what filenames may occur on it (and even if you are the sole user, you may wish to use packages which don't conform to your rules). For these occasions you may set the encoding to ISO-8859-1. But then you can't sensibly show them to the user in a GUI, nor in ncurses using the wide character API, nor you can't sensibly store them in a file which is to be always encoded in UTF-8 (e.g. XML file where you can't put raw bytes without knowing their encoding). There are two paradigms: manipulate bytes not knowing their encoding, and manipulating characters explicitly encoded in various encodings (possibly UTF-8). The world is slowly migrating from the first to the second. There are limits to the extent to which this can be achieved. E.g. what happens if you set the encoding to UTF-8, then call getDirectoryContents for a directory which contains filenames which aren't valid UTF-8 strings? The library fails. Don't do that. This environment is internally inconsistent. Call it what you like, it's a reality, and one which programs need to deal with. The reality is that filenames are encoded in different encodings depending on the system. Sometimes it's ISO-8859-1, sometimes ISO-8859-2, sometimes UTF-8. We should not ignore the possibility of UTF-8-encoded filenames. In CLisp it fails silently (undecodable filenames are skipped), which is bad. It should fail loudly. Most programs don't care whether any filenames which they deal with are valid in the locale's encoding (or any other encoding). They just receive lists (i.e. NUL-terminated arrays) of bytes and pass them directly to the OS or to libraries. And this is why I can't switch my home environment to UTF-8 yet. Too many programs are broken; almost all terminal programs which use more than stdin and stdout in default modes, i.e. which use line editing or work in full screen. How would you display a filename in a full screen text editor, such that it works in a UTF-8 environment? If the assumed encoding is ISO-8859-*, this program will work regardless of the filenames which it is passed or the contents of the file (modulo the EOL translation on Windows). OTOH, if it were to use UTF-8 (e.g. because that was the locale's encoding), it wouldn't work correctly if either filename or the file's contents weren't valid UTF-8. A program is not supposed to encounter filenames which are not representable in the locale's encoding. In your setting it's impossible to display a filename in a way other than printing to stdout. More accurately, it specifies which encoding to assume when you *need* to know the encoding (i.e. ctype.h etc), but you can't obtain that information from a more reliable source. In the case of filenames there is no more reliable source. My central point is that the existing API forces the encoding to be an issue when it shouldn't be. It is an unavoidable issue because not every interface in a given computer system uses the same encoding. Gtk+ uses UTF-8; you must convert text to UTF-8 in order to display it, and in order to convert you must know its encoding. Well, to an extent it is an implementation issue. Historically, curses never cared about encodings. A character is a byte, you draw bytes on the screen, curses sends them directly to the terminal. This is the old API. But newer ncurses API is prepared even for combining accents. A character is coded with a sequence of wchar_t values, such that all except the first one are combining characters. Furthermore, the curses model relies upon monospaced fonts, and falls down once you encounter CJK text (where a monospaced font means one whose glyphs are an integer multiple of the cell size, not necessarily a single cell). It doesn't fall. Characters may span several columns. There is wcwidth(), and curses specification in X/Open says how it should behave for wide CJK characters. I haven't tested it but I believe ncurses supports them. Extending something like curses to handle encoding issues is far from trivial; which is probably why it hasn't been finished yet. It's almost finished. The API specification was ready in 1997. It works in ncurses modulo unfixed bugs. But programs can't use it unless they use Unicode internally. Although, if you're going to have implicit String - [Word8] converters, there's no reason why you can't do the reverse, and have isAlpha :: Word8 - IO Bool. Although, like ctype.h, this will only work for single-byte encodings. We should not ignore multibyte encodings like UTF-8, which means that Haskell should have a Unicoded character type. And it's already specified in Haskell 98 that Char is such a type! What is missing is API for manipulating binary files, and conversion between byte streams and character streams using particular text encodings. A mail client is expected to respect the encoding set in
Re: [Haskell-cafe] Writing binary files?
Marcin 'Qrczak' Kowalczyk wrote: Unless you are the sole user of a system, you have no control over what filenames may occur on it (and even if you are the sole user, you may wish to use packages which don't conform to your rules). For these occasions you may set the encoding to ISO-8859-1. But then you can't sensibly show them to the user in a GUI, nor in ncurses using the wide character API, nor you can't sensibly store them in a file which is to be always encoded in UTF-8 (e.g. XML file where you can't put raw bytes without knowing their encoding). If you need to preserve the data exactly, you can use octal escapes (\337), URL encoding (%DF) or similar. If you don't, you can just approximate it (e.g. display unrepresentable characters as ?). But this is an inevitable consequence of filenames being bytes rather than chars. [Actually, regarding on-screen display, this is also an issue for Unicode. How many people actually have all of the Unicode glyphs? I certainly don't.] There are two paradigms: manipulate bytes not knowing their encoding, and manipulating characters explicitly encoded in various encodings (possibly UTF-8). The world is slowly migrating from the first to the second. This migration isn't a process which will ever be complete. There will always be plenty of cases where bytes really are just bytes. And even to the extent that it can be done, it will take a long time. Outside of the Free Software ghetto, long-term backward compatibility still means a lot. [E.g.: EBCDIC has been in existence longer than I have and, in spite of the fact that it's about the only widely-used encoding in existence which doesn't have ASCII as a subset, it shows no sign of dying out any time soon.] There are limits to the extent to which this can be achieved. E.g. what happens if you set the encoding to UTF-8, then call getDirectoryContents for a directory which contains filenames which aren't valid UTF-8 strings? The library fails. Don't do that. This environment is internally inconsistent. Call it what you like, it's a reality, and one which programs need to deal with. The reality is that filenames are encoded in different encodings depending on the system. Sometimes it's ISO-8859-1, sometimes ISO-8859-2, sometimes UTF-8. We should not ignore the possibility of UTF-8-encoded filenames. I'm not suggesting we do. In CLisp it fails silently (undecodable filenames are skipped), which is bad. It should fail loudly. No, it shouldn't fail at all. Most programs don't care whether any filenames which they deal with are valid in the locale's encoding (or any other encoding). They just receive lists (i.e. NUL-terminated arrays) of bytes and pass them directly to the OS or to libraries. And this is why I can't switch my home environment to UTF-8 yet. Too many programs are broken; almost all terminal programs which use more than stdin and stdout in default modes, i.e. which use line editing or work in full screen. How would you display a filename in a full screen text editor, such that it works in a UTF-8 environment? So, what are you suggesting? That the whole world switches to UTF-8? Or that every program should pass everything through iconv() (and handle the failures)? Or what? If the assumed encoding is ISO-8859-*, this program will work regardless of the filenames which it is passed or the contents of the file (modulo the EOL translation on Windows). OTOH, if it were to use UTF-8 (e.g. because that was the locale's encoding), it wouldn't work correctly if either filename or the file's contents weren't valid UTF-8. A program is not supposed to encounter filenames which are not representable in the locale's encoding. Huh? What does supposed to mean in this context? That everything would be simpler if reality wasn't how it is? If that's your position, then my response is essentially: Yes, but so what? In your setting it's impossible to display a filename in a way other than printing to stdout. Writing to stdout doesn't amount to displaying anything; stdout doesn't have to be a terminal. More accurately, it specifies which encoding to assume when you *need* to know the encoding (i.e. ctype.h etc), but you can't obtain that information from a more reliable source. In the case of filenames there is no more reliable source. Sure; but that doesn't automatically mean that the locale's encoding is correct for any given filename. The point is that you often don't need to know the encoding. Converting a byte string to a character string when you're just going to be converting it back to the original byte string is pointless. And it introduces unnecessary errors. If the only difference between (decode . encode) and the identity function is that the former sometimes fails, what's the point? My central point is that the existing API forces the encoding to be an issue when it shouldn't be. It is an unavoidable
Re: [Haskell-cafe] Writing binary files?
Udo Stenzel wrote: Note that this needs to include all of the core I/O functions, not just reading/writing streams. E.g. FilePath is currently an alias for String, but (on Unix, at least) filenames are strings of bytes, not characters. Ditto for argv, environment variables, possibly other cases which I've overlooked. I don't think so. They all are sequences of CChars, and C isn't particularly known for keeping bytes and chars apart. CChar is a C char, which is a byte (not necessarily an octet, and not necessarily a character either). I believe, Windows NT has (alternate) filename handling functions that use unicode stringsr. Almost all of the Win32 API functions which handle strings exist in both char and wide-char versions. This would strengthen the view that a filename is a sequence of characters. It would be reasonable to make FilePath equivalent to String on Windows, but not on Unix. Ditto for argv, env, whatnot; they are typically entered from the shell and therefore are characters in the local encoding. Both argv and envp are char**, i.e. lists of byte strings. There is no guarantee that the values can be succesfully decoded according the locale's encoding. The environment is typically set on login, and inherited thereafter. It's typically limited to ASCII, but this isn't guaranteed. Similarly, a program may need to access files which he didn't create, and which have filenames which aren't valid strings according to his locale. E.g. a user may choose a locale which uses UTF-8, but the sysadmin has installed files with ISO-8859-1 filenames. If a Haskell program tries to coerce everything to String using the user's locale, the program will be unable to access such files. 3. The default encoding is settable from Haskell, defaults to ISO-8859-1. Agreed. Oh no, please don't do that. A global, settable encoding is, well, dys-functional. Hidden state makes programs hard to understand and Haskell imho shouldn't go that route. There's already plenty of hidden state in the system libraries upon which a Haskell program depends. And please don't introduce the notion of a default encoding. It isn't an issue of *introducing* it. Many Haskell98 functions (i.e. much of IO, System and Directory) accept or return Strings, yet have to be implemented on top of an OS which accepts or provides char*s. There *has* to be an encoding between the two, and currently it's hardwired to ISO-8859-1. The alternative to a global encoding is for *all* functions which interface to the OS to always either accept or return [CChar] or, if they accept or return Strings, accept an additional argument which specifies the encoding. Also, bear in mind that the functions under discussion are all I/O functions which, by their nature, deal with state (e.g. the state of the filesystem). I'd like to see the following: - Duplicate the IO library. The duplicate should work with [Byte] everywhere where the old library uses String. Byte is some suitable unsigned integer, on most (all?) platforms this will be Word8 Technically it should be CChar. However, it's fairly safe to assume that a byte will always be 8 bits; almost nobody writes code which works on systems where it isn't. However: if we go this route, I suspect that we will also need a convenient method for specifying literal byte strings in Haskell source code. - Provide an explicit conversion between encodings. A simple conversion of type [Word8] - String would suit me, iconv would provide all that is needed. For the general case, you need to allow for stateful encodings (e.g. ISO-2022). Actually, even UTF-8 needs to deal with state if you need to decode byte streams which are split into chunks and the breaks can occur in the middle of a character (e.g. if you're using non-blocking I/O). - iconv takes names of encodings as arguments. Provide some names as constants: one name for the internal encoding (probably UCS4), one name for the canonical external encoding (probably locale dependent). - Then redefine the old IO API in terms of the new API and appropriate conversions. The old API requires an implicit encoding. The OS gives accepts or provides bytes, the old API functions accept or return Chars, and the old API functions don't accept an encoding argument. This is why we are (or, at least, I am) suggesting a settable current encoding. Because the existing API *needs* a current encoding, and I'm assuming that there may be some reluctance to just discarding it completely. While we're at it, do away with the annoying CR/LF problem on Windows, this should simply be part of the local encoding. This way file can always be opened as binary, hSetBinary can be dropped. (This won't wont on ancient platforms where text files and binary files are genuinely different, but these are probably not interesting anyway.) Apart from OS-specific issues, it would be useful to treat EOL conventions
Re: [Haskell-cafe] Writing binary files?
Glynn Clements [EMAIL PROTECTED] writes: [Actually, regarding on-screen display, this is also an issue for Unicode. How many people actually have all of the Unicode glyphs? I certainly don't.] If I don't have a particular character in fonts, I will not create files with it in filenames. Actually I only use 9 Polish letters in addition to ASCII, and even them rarely. Usually it's only a subset of ASCII. Some programs use UTF-8 in filenames no matter what the locale is. For example the Evolution mail program which stores mail folders as files under names the user entered in a GUI. I had to rename some of these files in order to import them to Gnus, as it choked on filenames with strange characters, never mind that it didn't display them correctly (maybe because it tried to map them to virtual newsgroup names, or maybe because they are control characters in ISO-8859-x). If all programs consistently used the locale encoding for filenames, this should have worked. When I switch my environment to UTF-8, which may happen in a few years, I will convert filenames to UTF-8 and set up mount options to translate vfat filenames to/from UTF-8 instead of to ISO-8859-2. I expect good programs to understand that and display them correctly no matter what technique they are using for the display. For example the Epiphany web browser, when I open the file:/home/users/qrczak URL, displays ISO-8859-2-encoded filenames correctly. The virtual HTML file it created from the directory listing has x105; in its title where the directory name had 0xB1 in ISO-8859-2. When I run Epiphany with the locale set to pl_PL.UTF-8, it displays UTF-8 filenames correctly and ISO-8859-2 filenames are not shown at all. It's fine for me that it doesn't deal with wrongly encoded filenames, because it allowed to treat well encoded filenames correctly. For a web page rendered on the screen it makes no sense to display raw bytes. Epiphany treats filenames as sequences of characters encoded according to the locale. And even to the extent that it can be done, it will take a long time. Outside of the Free Software ghetto, long-term backward compatibility still means a lot. Windows has already switched most of its internals to Unicode, and it did it faster than Linux. In CLisp it fails silently (undecodable filenames are skipped), which is bad. It should fail loudly. No, it shouldn't fail at all. Since it uses Unicode as string representation, accepting filenames not encoded in the locale encoding would imply making garbage from filenames correctly encoded in the locale encoding. In a UTF-8 environment character U+00E1 in the filename means bytes 0xC3 0xA1 on ext2 filesystem (and 0x00E1 on vfat filesystem), so it can't at the same time mean 0xE1 on ext2 filesystem. And this is why I can't switch my home environment to UTF-8 yet. Too many programs are broken; almost all terminal programs which use more than stdin and stdout in default modes, i.e. which use line editing or work in full screen. How would you display a filename in a full screen text editor, such that it works in a UTF-8 environment? So, what are you suggesting? That the whole world switches to UTF-8? No, each computer system decides for itself, and announces it in the locale setting. I'm suggesting that programs should respect that and correctly handle all correctly encoded texts, including filenames. Better programs may offer to choose the encoding explicitly when it makes sense (e.g. text file editors for opening a file), but if they don't, they should at least accept the locale encoding. Or that every program should pass everything through iconv() (and handle the failures)? If it uses Unicode as internal string representation, yes (because the OS API on Unix generally uses byte encodings rather than Unicode). This should be done transparently in libraries of respective languages instead of in each program independently. A program is not supposed to encounter filenames which are not representable in the locale's encoding. Huh? What does supposed to mean in this context? That everything would be simpler if reality wasn't how it is? It means that if it encounters a filename encoded differently, it's usually not the fault of the program but of whoever caused the mismatch in the first place. In your setting it's impossible to display a filename in a way other than printing to stdout. Writing to stdout doesn't amount to displaying anything; stdout doesn't have to be a terminal. I know, it's not the point. The point is that other display channels than stdout connected to a terminal often work in terms of characters rather than bytes of some implicit encoding. For example various GUI frameworks, and wide character ncurses. Sure; but that doesn't automatically mean that the locale's encoding is correct for any given filename. The point is that you often don't need to know the encoding. What if I do need to know the encoding? I must assume
Re: [Haskell-cafe] Writing binary files?
Marcin 'Qrczak' Kowalczyk wrote: [Actually, regarding on-screen display, this is also an issue for Unicode. How many people actually have all of the Unicode glyphs? I certainly don't.] If I don't have a particular character in fonts, I will not create files with it in filenames. Actually I only use 9 Polish letters in addition to ASCII, and even them rarely. Usually it's only a subset of ASCII. But this seems to be assuming a closed world. I.e. the only files which the program will ever see are those which were created by you, or by others who are compatible with your conventions. Some programs use UTF-8 in filenames no matter what the locale is. For example the Evolution mail program which stores mail folders as files under names the user entered in a GUI. This is entirely reasonable for a file which a program creates. If a filename is just a string of bytes, a program can use whatever encoding it wants. I had to rename some of these files in order to import them to Gnus, as it choked on filenames with strange characters, never mind that it didn't display them correctly (maybe because it tried to map them to virtual newsgroup names, or maybe because they are control characters in ISO-8859-x). If it had just treated them as bytes, rather than trying to interpret them as characters, there wouldn't have been any problems. If all programs consistently used the locale encoding for filenames, this should have worked. But again, for this to work in general, you have to assume a closed world. When I switch my environment to UTF-8, which may happen in a few years, I will convert filenames to UTF-8 and set up mount options to translate vfat filenames to/from UTF-8 instead of to ISO-8859-2. But what about files which were been created by other people, who don't use UTF-8? I expect good programs to understand that and display them correctly no matter what technique they are using for the display. When it comes to display, you have to have to deal with encoding issues one way or another. But not all programs deal with display. For example the Epiphany web browser, when I open the file:/home/users/qrczak URL, displays ISO-8859-2-encoded filenames correctly. The virtual HTML file it created from the directory listing has x105; in its title where the directory name had 0xB1 in ISO-8859-2. When I run Epiphany with the locale set to pl_PL.UTF-8, it displays UTF-8 filenames correctly and ISO-8859-2 filenames are not shown at all. For many (probably most) programs, omitting such files would be an unacceptable failure. And even to the extent that it can be done, it will take a long time. Outside of the Free Software ghetto, long-term backward compatibility still means a lot. Windows has already switched most of its internals to Unicode, and it did it faster than Linux. Microsoft is actively hostile to both backwards compatibility and cross-platform compatibility. I consider the fact that some Unix (primarily Linux) developers seem equally hostile to be a problem. Having said that, with Linux developers, the issue usually due to not being bothered. Assuming that everything is UTF-8 allows a lot of potential problems to be ignored. Fortunately, the problem is mostly consigned to the periphery, i.e. the desktop, where most programs have to deal with display issues (so you *have* to decode bytes into characters), and it isn't too critical if they have limitations. The core OS and network server applications essentially remain encoding-agnostic. In CLisp it fails silently (undecodable filenames are skipped), which is bad. It should fail loudly. No, it shouldn't fail at all. Since it uses Unicode as string representation, accepting filenames not encoded in the locale encoding would imply making garbage from filenames correctly encoded in the locale encoding. In a UTF-8 environment character U+00E1 in the filename means bytes 0xC3 0xA1 on ext2 filesystem (and 0x00E1 on vfat filesystem), so it can't at the same time mean 0xE1 on ext2 filesystem. But, as I keep pointing out, filenames are byte strings, not character strings. You shouldn't be converting them to character strings unless you have to. And this is why I can't switch my home environment to UTF-8 yet. Too many programs are broken; almost all terminal programs which use more than stdin and stdout in default modes, i.e. which use line editing or work in full screen. How would you display a filename in a full screen text editor, such that it works in a UTF-8 environment? So, what are you suggesting? That the whole world switches to UTF-8? No, each computer system decides for itself, and announces it in the locale setting. I'm suggesting that programs should respect that and correctly handle all correctly encoded texts, including filenames. 1. Actually, each user decides which locale they wish to use. Nothing forces two users of a system to use the same locale. 2. Even if
Re: [Haskell-cafe] Writing binary files?
I modestly re-propose the I/O model which I first proposed last year: http://www.haskell.org/pipermail/haskell/2003-July/012312.html http://www.haskell.org/pipermail/haskell/2003-July/012313.html http://www.haskell.org/pipermail/haskell/2003-July/012350.html http://www.haskell.org/pipermail/haskell/2003-July/012352.html ... -- Ben ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
RE: [Haskell-cafe] Writing binary files?
On 15 September 2004 12:32, [EMAIL PROTECTED] wrote: On Mon, Sep 13, 2004 at 12:01:58PM +0100, Glynn Clements wrote: My view is that, right now, we have the worst of both worlds, and taking a short step backwards (i.e. narrow the Char type and leave the rest alone) is a lot simpler (and more feasible) than the long journey towards real I18N. This being Haskell, I can't imagine a consensus on a step backwards. In any case, a Char type distinct from bytes and the rest is the most valuable part of the current situation. The rest is just libraries, and the solution to that is to create other libraries. (It's true that the Prelude is harder to work around, but even that can be done, as with the new exception interface.) Indeed more than one approach can proceed concurrently, and that's probably what's going to happen: The Right Thing proceeds in stages: 1. new byte-based libraries 2. conversions sitting on top of these 3. the ultimate I18N API The Quick Fix: alter the existing implementation to use the encoding determined by the current locale at the borders. I wish I had some more time to work on this, but I implemented a prototype of an ultimate i18n API recently. This is derived from the API that Ben Rudiak-Gould proposed last year. It is in two layers: the InputStream/OutputStream classes provide raw byte I/O, and the TextStream class provides a conversion on top of that. The prototype uses iconv for conversions. You can make Streams from all sorts of things: files, sockets, pipes, and even Haskell arrays. InputStream and OutputStreams are just classes, so you can implement your own. IIRC, I managed to get it working with speed comparable to GHC's current IO library. Here's a tarball that works with GHC 6.2.1 on a Unix platform, just --make to build it: http://www.haskell.org/~simonmar/new-io.tar.gz If anyone would like to pick this up and run with it, I'd be delighted. I'm not likely to get back to in the short term, at least. Cheers, Simon ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
Simon Marlow [EMAIL PROTECTED] writes: Here's a tarball that works with GHC 6.2.1 on a Unix platform, just --make to build it: http://www.haskell.org/~simonmar/new-io.tar.gz Found a bug already... In System/IO/Stream.hs, line 183: streamReadBufrer s 0 buf = return 0 streamReadBuffer s len ptr = ... Note the different spellings of the function name. Regards, Malcolm ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
Graham Klyne wrote: In particular, the idea of narrowing the Char type really seems like a bad idea to me (if I understand the intent correctly). Not so long ago, I did a whole load of work on the HaXml parser so that, among other things, it would support UTF-8 and UTF-16 Unicode (as required by the XML spec). To do this depends upon having a Char type that can represent the full repertoire of Unicode characters. Note: I wasn't proposing doing away with wide character support altogether. Essentially, I was suggesting making Char a byte and having e.g. WideChar for wide characters. The reason being that the existing Haskell98 API uses Char for functions which are actually dealing with bytes. In an ideal world, the IO, System and Directory modules (and the Prelude I/O functions) would have used Byte, leaving Char to represent a (wide) character. However, that isn't the hand we've been dealt. -- Glynn Clements [EMAIL PROTECTED] ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
Glynn Clements wrote: Marcin 'Qrczak' Kowalczyk wrote: [...] Note that this needs to include all of the core I/O functions, not just reading/writing streams. E.g. FilePath is currently an alias for String, but (on Unix, at least) filenames are strings of bytes, not characters. Ditto for argv, environment variables, possibly other cases which I've overlooked. I don't think so. They all are sequences of CChars, and C isn't particularly known for keeping bytes and chars apart. I believe, Windows NT has (alternate) filename handling functions that use unicode stringsr. This would strengthen the view that a filename is a sequence of characters. Ditto for argv, env, whatnot; they are typically entered from the shell and therefore are characters in the local encoding. 3. The default encoding is settable from Haskell, defaults to ISO-8859-1. Agreed. Oh no, please don't do that. A global, settable encoding is, well, dys-functional. Hidden state makes programs hard to understand and Haskell imho shouldn't go that route. And please don't introduce the notion of a default encoding. I'd like to see the following: - Duplicate the IO library. The duplicate should work with [Byte] everywhere where the old library uses String. Byte is some suitable unsigned integer, on most (all?) platforms this will be Word8 - Provide an explicit conversion between encodings. A simple conversion of type [Word8] - String would suit me, iconv would provide all that is needed. - iconv takes names of encodings as arguments. Provide some names as constants: one name for the internal encoding (probably UCS4), one name for the canonical external encoding (probably locale dependent). - Then redefine the old IO API in terms of the new API and appropriate conversions. While we're at it, do away with the annoying CR/LF problem on Windows, this should simply be part of the local encoding. This way file can always be opened as binary, hSetBinary can be dropped. (This won't wont on ancient platforms where text files and binary files are genuinely different, but these are probably not interesting anyway.) The same thoughts apply to filenames. Make them [Word8] and convert explicitly. By the way, I think a path should be a list of names (that is of type [[Word8]]) and the library would be concerned with putting in the right path separator. Add functions to read and show pathnames in the local conventions and we'll never need to worry about path separators again. There are limits to the extent to which this can be achieved. E.g. what happens if you set the encoding to UTF-8, then call getDirectoryContents for a directory which contains filenames which aren't valid UTF-8 strings? Well, then you did something stupid, didn't you? If you don't know the encoding you shouldn't decode anything. That's a strong point against any implicit decoding, I think. Also, if efficiency is a concern, lists probably shouldn't be passed between filesystem operations and iconv. I think, we need a better representation here (like PackedString for Word8), not a convoluted API. Regards, Udo. -- If Perl is the solution, you're solving the wrong problem. -- Erik Naggum pgpahJRCZ73FY.pgp Description: PGP signature ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
David Menendez wrote: I'd like to see the following: - Duplicate the IO library. The duplicate should work with [Byte] everywhere where the old library uses String. Byte is some suitable unsigned integer, on most (all?) platforms this will be Word8 - Provide an explicit conversion between encodings. A simple conversion of type [Word8] - String would suit me, iconv would provide all that is needed. I like this idea, but I say there should be a bit-oriented layer beneath everything. The byte stream is inherent, as that's (usually) what the OS gives you. Everything else is synthesised. -- Glynn Clements [EMAIL PROTECTED] ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
Glynn Clements writes: David Menendez wrote: I'd like to see the following: - Duplicate the IO library. The duplicate should work with [Byte] everywhere where the old library uses String. Byte is some suitable unsigned integer, on most (all?) platforms this will be Word8 - Provide an explicit conversion between encodings. A simple conversion of type [Word8] - String would suit me, iconv would provide all that is needed. I like this idea, but I say there should be a bit-oriented layer beneath everything. The byte stream is inherent, as that's (usually) what the OS gives you. Everything else is synthesised. I was unclear. I meant the bit layer would be beneath everything conceptually. On today's machines, it would be implemented in terms of a byte stream and the conversion to the byte stream type would get compiled away. -- David Menendez [EMAIL PROTECTED] http://www.eyrie.org/~zednenem/ ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
Glynn Clements [EMAIL PROTECTED] writes: Right now, the attempt at providing I18N for free, by defining Char to mean Unicode, has essentially backfired, IMHO. Anything that isn't ISO-8859-1 just doesn't work for the most part, and anyone who wants Basically, I'm inclined to agree with what you say. A minor point is that the other ISO-8859 encodings (or really, any single-byte encoding) works equally well, as long as you don't want to mix them. So I guess you really want to say Anything that isn't a single-byte encoding... (Except for string constants, I guess, but perhaps you could just use its byte representation in the source? The length could be slightly surprising, though.) -kzm -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
Marcin 'Qrczak' Kowalczyk wrote: 1. API for manipulating byte sequences in I/O (without representing them in String type). Note that this needs to include all of the core I/O functions, not just reading/writing streams. E.g. FilePath is currently an alias for String, but (on Unix, at least) filenames are strings of bytes, not characters. Ditto for argv, environment variables, possibly other cases which I've overlooked. They don't hold binary data; they hold data intended to be interpreted as text. No. They frequently hold data intended to be passed to system functions which interpret them simply as bytes, without regard to encoding. If the encoding of the text doesn't agree with the locale, the environment setup is broken and 'ls' and 'env' misbehave on an UTF-8 terminal. ls and env just write bytes to stdout (which may or may not refer to the terminal). A particular terminal may not display them correctly, but that's a separate issue. Unless you are the sole user of a system, you have no control over what filenames may occur on it (and even if you are the sole user, you may wish to use packages which don't conform to your rules). As environment variables frequently contain pathnames, this fact may get propagated to the environment (however, system directories are usually restricted to ASCII, so this aspect is less likely to be an issue). A program can explicitly set the default encoding to ISO-8859-1 if it wishes to do something in a broken environment. 4. Libraries are reviewed to ensure that they work with various encoding settings. There are limits to the extent to which this can be achieved. E.g. what happens if you set the encoding to UTF-8, then call getDirectoryContents for a directory which contains filenames which aren't valid UTF-8 strings? The library fails. Don't do that. This environment is internally inconsistent. Call it what you like, it's a reality, and one which programs need to deal with. Most programs don't care whether any filenames which they deal with are valid in the locale's encoding (or any other encoding). They just receive lists (i.e. NUL-terminated arrays) of bytes and pass them directly to the OS or to libraries. I feel that the default encoding should be one whose decoder cannot fail, e.g. ISO-8859-1. But filenames on my filesystem and most file contents are *not* encoded in ISO-8859-1. Assuming that they are ISO-8859-1 is plainly wrong. For the most part, assuming that they are encoded in *any* coding system is wrong. However, If you treat them as ISO-8859-* (it doesn't matter which one, so long as you're consistent), the Haskell I/O functions will at least pass them through unmodified. Consider a trivial cp program: main = do [src, dst] - getArgs text - readFile src writeFile dst text If the assumed encoding is ISO-8859-*, this program will work regardless of the filenames which it is passed or the contents of the file (modulo the EOL translation on Windows). OTOH, if it were to use UTF-8 (e.g. because that was the locale's encoding), it wouldn't work correctly if either filename or the file's contents weren't valid UTF-8. You should have to explicitly request the use of the locale's encoding (analogous to calling setlocale(LC_CTYPE, ) at the start of a C program; there's a good reason why C doesn't do this without being explicitly told to). C usually uses the paradigm of representing text in their original 8-bit encodings. This is why getting C programs to work in a UTF-8 locale is such a pain. Only some programs use wchar_t internally. Many C programs don't care about encodings. It's only if you actually have to interpret the bytes (e.g. ctype.h, strcoll) that encodings start to matter. At which point, you have to know the encoding. Java and C# uses the paradigm of representing text in Unicode internally, recoding it on boundaries with the external world. The second paradigm has a cost that you must be aware what encodings are used in texts you manipulate. And that cost can be a pretty high; e.g. gratuitously failing in the case where you have no idea which encoding is used but where you shouldn't actually need to know. Locale gives a reasonable default for simple programs which aren't supposed to work with multiple encodings, and it specifies the encoding of texts which don't have an encoding specified elsewhere (terminal I/O, filenames, environment variables). More accurately, it specifies which encoding to assume when you *need* to know the encoding (i.e. ctype.h etc), but you can't obtain that information from a more reliable source. My central point is that the existing API forces the encoding to be an issue when it shouldn't be. ncurses wide character API is still broken. I reported bugs, the author acknowledged them, but hasn't fixed them. (Attributes are ignored on add_wch; get_wch is wrong for
Re: [Haskell-cafe] Writing binary files?
Glynn Clements [EMAIL PROTECTED] writes: 1. API for manipulating byte sequences in I/O (without representing them in String type). Note that this needs to include all of the core I/O functions, not just reading/writing streams. E.g. FilePath is currently an alias for String, but (on Unix, at least) filenames are strings of bytes, not characters. Ditto for argv, environment variables, possibly other cases which I've overlooked. They don't hold binary data; they hold data intended to be interpreted as text. If the encoding of the text doesn't agree with the locale, the environment setup is broken and 'ls' and 'env' misbehave on an UTF-8 terminal. A program can explicitly set the default encoding to ISO-8859-1 if it wishes to do something in a broken environment. 4. Libraries are reviewed to ensure that they work with various encoding settings. There are limits to the extent to which this can be achieved. E.g. what happens if you set the encoding to UTF-8, then call getDirectoryContents for a directory which contains filenames which aren't valid UTF-8 strings? The library fails. Don't do that. This environment is internally inconsistent. I feel that the default encoding should be one whose decoder cannot fail, e.g. ISO-8859-1. But filenames on my filesystem and most file contents are *not* encoded in ISO-8859-1. Assuming that they are ISO-8859-1 is plainly wrong. You should have to explicitly request the use of the locale's encoding (analogous to calling setlocale(LC_CTYPE, ) at the start of a C program; there's a good reason why C doesn't do this without being explicitly told to). C usually uses the paradigm of representing text in their original 8-bit encodings. This is why getting C programs to work in a UTF-8 locale is such a pain. Only some programs use wchar_t internally. Java and C# uses the paradigm of representing text in Unicode internally, recoding it on boundaries with the external world. The second paradigm has a cost that you must be aware what encodings are used in texts you manipulate. Locale gives a reasonable default for simple programs which aren't supposed to work with multiple encodings, and it specifies the encoding of texts which don't have an encoding specified elsewhere (terminal I/O, filenames, environment variables). It also has benefits: 1. It's easier to work with multiple encodings, because the internal representation can represent text decoded from any of them and is the same in all places of the program. 2. It's much easier to work in a UTF-8 environment, and to work with libraries which use Unicode internally (e.g. Gtk+ or Qt). 3. isAlpha, toUpper etc. are true pure functions. (Haskell API is broken in a different way here: toUpper should be defined in terms of strings, not characters.) Actually, the more I think about it, the more I think that simple, stupid programs probably shouldn't be using Unicode at all. This attitude causes them to break in a UTF-8 environment, which is why I can't use it as a default yet. ncurses wide character API is still broken. I reported bugs, the author acknowledged them, but hasn't fixed them. (Attributes are ignored on add_wch; get_wch is wrong for non-ASCII keys pressed if the locale is different from ISO-8859-1 and UTF-8.) It seems people don't use that API yet, because C traditionally uses the model of representing texts in byte sequences. But the narrow character API of ncurses is unusable with UTF-8 - this is not an implementation limitation but inherent limitation of the interface. I.e. Char, String, string literals, and the I/O functions in Prelude, IO etc should all be using bytes, with a distinct wide-character API available for people who want to make the (substantial) effort involved in writing (genuinely) internationalised programs. This would cause excessive duplication of APIs. Look, Java and C# don't do that. Only file contents handling needs a byte API, because many files don't contain text. This would imply isAlpha :: Char - IO Bool. Right now, the attempt at providing I18N for free, by defining Char to mean Unicode, has essentially backfired, IMHO. Because it needs to be accompanied with character recoders, both invoked explicitly (also lazily) and attached to file handles, and with a way to obtain recoders for various encodings. Assuming that the same encoding is used everywhere and programs can just copy bytes without interpreting them no longer works today. A mail client is expected to respect the encoding set in headers. Oh, and because bytes are being stored in Chars, the type system won't help if you neglect to decode a string, or if you decode it twice. This is why I said 1. API for manipulating byte sequences in I/O (without representing them in String type). 2. If you assume ISO-8859-1, you can always convert back to Word8 then re-decode as UTF-8. If you assume UTF-8, anything which is neither UTF-8 nor ASCII will fail far more
Re: [Haskell-cafe] Writing binary files?
Glynn Clements wrote: Actually, the more I think about it, the more I think that simple, stupid programs probably shouldn't be using Unicode at all. I.e. Char, String, string literals, and the I/O functions in Prelude, IO etc should all be using bytes, with a distinct wide-character API available for people who want to make the (substantial) effort involved in writing (genuinely) internationalised programs. I have become very sympathetic to this viewpoint. In particular, it is nigh-impossible to just move bits into and out of a Haskell program. There certainly isn't a simple, portable way to do so. In the absence of such a mechanism, there are *no* *portable* *workarounds*. The absence of real internationalization is galling, but not nearly as galling as the absence of simple, stupid, reproducible I/O facilities. Forget even ISO-8859-1. Just give me bytes. Personally, I would take the C approach: redefine Char to mean a byte (i.e. CChar), treat string literals as bytes, keep the existing type signatures on all of the existing Haskell98 functions, and provide a completely new wide-character API for those who wish to use it. Much as we may hate to admit it, this would probably just work in 99% of all cases, and would make it a simple exercise to build working solutions. Given the frequency with which this issue crops up, and the associated lack of action to date, I'd rather not have to wait until someone finally gets around to designing the new, improved, genuinely-I18N-ised API before we can read/write arbitrary files without too much effort. This is simple stuff, and far more basic than the details of test representation. Why must it be so hard? And why must we drag students through the muck of allocating and managing explicit byte buffers and (God forbid) understanding the FFI just to get a few bytes on and off disk without the system monkeying with them en route? Simplify. Please. -Jan-Willem Maessen ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
Glynn Clements writes: The problem is that API for that yet is not even designed, so programs can't be written such that they will work after the default encoding change. Personally, I would take the C approach: redefine Char to mean a byte (i.e. CChar), treat string literals as bytes, keep the existing type signatures on all of the existing Haskell98 functions, and provide a completely new wide-character API for those who wish to use it. I really don't like having a type called Char that's defined to be a byte, rather than a character. On the other hand, I don't know how much of a pain it would be to replace Char with Word8 in the file IO library, which would be my preferred temporary solution. Given the frequency with which this issue crops up, and the associated lack of action to date, I'd rather not have to wait until someone finally gets around to designing the new, improved, genuinely-I18N-ised API before we can read/write arbitrary files without too much effort. Any I18N-ized API would need a bit-level layer underneath, right? In fact, a good low-level IO library could support multiple higher-level APIs. Has there been any progress on something like that? -- David Menendez [EMAIL PROTECTED] | In this house, we obey the laws http://www.eyrie.org/~zednenem |of thermodynamics! ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
Glynn Clements wrote: [...] main :: IO () main = do h - openBinaryFile out.dat WriteMode hPutStr h $ map (octetToChar . bitsToOctet) bits hClose h Hmmm, using string I/O when one really wants to do binary I/O gives me a bad feeling. Haskell characters are defined to be Unicode characters, so the above only works because current Haskell implementations usually get this wrong (either no Unicode support at all and/or ignoring any encodings and doing I/O only with the lower 8 bits of the characters)... hGetBuf/hPutBuf plus their non-blocking variants are the only way to *really* do binary I/O currently. Cheers, S. ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
Abraham Egnor wrote: Passing a Ptr isn't that onerous; it's easy enough to make functions that have the signature you'd like: import System.IO import Data.Word (Word8) import Foreign.Marshal.Array hPutBytes :: Handle - [Word8] - IO () hPutBytes h ws = withArray ws $ \p - hPutBuf h p $ length ws hGetBytes :: Handle - Int - IO [Word8] hGetBytes h c = allocaArray c $ \p - do c' - hGetBuf h p c peekArray c' p The problem with this approach is that the entire array has to be held in memory, which could be an issue if the amount of data involved is large. -- Glynn Clements [EMAIL PROTECTED] ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
Glynn Clements wrote: The problem with this approach is that the entire array has to be held in memory, which could be an issue if the amount of data involved is large. Simple reasoning: If the amount of data is large, you don't want the overhead of lists because it kills performance. If the amount of data is small, you can easily use similar code to read/write a single byte. :-) Of course things are a bit different when you are in the blissful position where lazy I/O is what you want. This implies that you expect a stream of data of a single type. How often is this really the case? And I'm not sure if this the correct way of doing things even when the data involved wouldn't fit into memory all at once. I'd prefer something mmap-like then... Cheers, S. ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
Sven Panne wrote: Also, changing the existing functions to deal with encodings is likely to break a lot of things (i.e. anything which reads or writes data which is in neither UTF-8 nor the locale-specified encoding). Hmmm, the Unicode tables start with ISO-Latin-1, so what would exactly break when we stipulate that the standard encoding for string I/O in Haskell is ISO-Latin-1? That would essentially be formally specifying the existing behaviour, which wouldn't break anything, including the mechanism for reading/writing binary data which I suggested (and which is the only choice if your Haskell implementation doesn't have h{Get,Put}Buf). The problems would come if it was decided to change the existing behaviour, i.e. use something other than Latin1. -- Glynn Clements [EMAIL PROTECTED] ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
Sven Panne [EMAIL PROTECTED] writes: Hmmm, the Unicode tables start with ISO-Latin-1, so what would exactly break when we stipulate that the standard encoding for string I/O in Haskell is ISO-Latin-1? Additional encodings could be specified e.g. via a new open variant. That the encoding of most file contents is not ISO-Latin-1 in practice. The locale mechanism specifies a default. It's also a default for other things: filenames (on Unix), program invocation arguments, environment variables etc. Some other places have an encoding hardwired (e.g. Gtk+ uses UTF-8 and Qt uses UTF-16), and yet others have it specified as a part of the protocol (email, usenet, WWW). Unfortunately changing a Haskell implementation to actually convert between the external encodings and Unicode must be done in all those places at once, otherwise there will be mismatches and e.g. printing program invocation arguments to a file will have a wrong effect. Most Haskell programs currently work because they misuse Chars to represent characters in the implicit default encoding. As long as they don't use isAlpha or toUpper on non-ASCII characters, and as long as they don't try to support several encodings at once. These two paradigms: A. Represent strings using their original encoding. B. Use Unicode internally, convert it at the boundaries. should not be mixed in one string type, or confusion will arise. For at least some of these places, e.g. file contents or socket data, a program must have a way to specify a different encoding, and also to manipulate raw bytes without recoding. But the default encoding should come from the locale instead of being ISO-8859-1. A Char value should always mean a Unicode code point and not e.g. an ISO-8859-2-coded value. This is the B paradigm and it must be applied consistently. I did this for my language http://kokogut.sourceforge.net/ and it works. Only some things are hard, e.g. reading a file whose encoding is specified inside it (trying to apply the default encoding might fail, even if the text before the encoding name is all ASCII, because of buffering); it's possible but needs care. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/ ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
Marcin 'Qrczak' Kowalczyk wrote: But the default encoding should come from the locale instead of being ISO-8859-1. The problem with that is that, if the locale's encoding is UTF-8, a lot of stuff is going to break (i.e. anything in ISO-8859-* which isn't limited to the 7-bit ASCII subset). The advantage of assuming ISO-8859-* is that the decoder can't fail; every possible stream of bytes is valid. This isn't the case for UTF-8. The advantage of ISO-8859-1 in particular is that it's trivial to convert the string back into the bytes which were actually read. The key problem with using the locale is that you frequently encounter files which aren't in the locale's encoding, and for which the encoding can't easily be deduced. If you assume ISO-8859-*, you can at least read them in, manipulate the contents (in any way that doesn't require interpreting any non-ASCII characters), and write out the results. OTOH, if you assume UTF-8 (e.g. because that happens to be the locale's encoding), the decoder is likely to abort shortly after the first non-ASCII character it finds (either that, or it will just silently drop characters). -- Glynn Clements [EMAIL PROTECTED] ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
Glynn Clements [EMAIL PROTECTED] writes: But the default encoding should come from the locale instead of being ISO-8859-1. The problem with that is that, if the locale's encoding is UTF-8, a lot of stuff is going to break (i.e. anything in ISO-8859-* which isn't limited to the 7-bit ASCII subset). What about this transition path: 1. API for manipulating byte sequences in I/O (without representing them in String type). 2. API for conversion between explicitly specified encodings and byte sequences, including attaching converters to Handles. There is also a way to obtain the locale encoding. 3. The default encoding is settable from Haskell, defaults to ISO-8859-1. 4. Libraries are reviewed to ensure that they work with various encoding settings. 5. The default encoding is settable from Haskell, defaults to the locale encoding. Points 1-3 don't change the behavior of existing programs, but they allow to start writing libraries and programs which manipulate something other than texts in the default encoding and will work in future. After relevant libraries work with the default encoding changed, programs which use them may begin their main function with setting the default encoding to the locale encoding. Finally, when we consider libraries and programs which break in this setting obsolete, the default is changed. The advantage of assuming ISO-8859-* is that the decoder can't fail; every possible stream of bytes is valid. Assuming. But UTF-8 is not ISO-8859-*. When someday I change most of my files and filenames from ISO-8859-2 to UTF-8, and change the locale, the assumption will be wrong. I can't change that now, because too many programs would break. The current ISO-8859-1 assumption is also wrong. A program written in Haskell which sorts strings would break for non-ASCII letters even now that they are ISO-8859-2 unless specified otherwise. The key problem with using the locale is that you frequently encounter files which aren't in the locale's encoding, and for which the encoding can't easily be deduced. Programs should either explicitly set the encoding for I/O on these files to ISO-8859-1, or manipulate them as binary data. The problem is that API for that yet is not even designed, so programs can't be written such that they will work after the default encoding change. OTOH, if you assume UTF-8 (e.g. because that happens to be the locale's encoding), the decoder is likely to abort shortly after the first non-ASCII character it finds (either that, or it will just silently drop characters). Detectable errors should not be automatically silenced, so it would fail. So the change to the default encoding must be done some time after it's possible to write programs which would not fail. -- __( Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/ ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
Marcin 'Qrczak' Kowalczyk wrote: But the default encoding should come from the locale instead of being ISO-8859-1. The problem with that is that, if the locale's encoding is UTF-8, a lot of stuff is going to break (i.e. anything in ISO-8859-* which isn't limited to the 7-bit ASCII subset). What about this transition path: 1. API for manipulating byte sequences in I/O (without representing them in String type). Note that this needs to include all of the core I/O functions, not just reading/writing streams. E.g. FilePath is currently an alias for String, but (on Unix, at least) filenames are strings of bytes, not characters. Ditto for argv, environment variables, possibly other cases which I've overlooked. 2. API for conversion between explicitly specified encodings and byte sequences, including attaching converters to Handles. There is also a way to obtain the locale encoding. 3. The default encoding is settable from Haskell, defaults to ISO-8859-1. Agreed. 4. Libraries are reviewed to ensure that they work with various encoding settings. There are limits to the extent to which this can be achieved. E.g. what happens if you set the encoding to UTF-8, then call getDirectoryContents for a directory which contains filenames which aren't valid UTF-8 strings? 5. The default encoding is settable from Haskell, defaults to the locale encoding. I feel that the default encoding should be one whose decoder cannot fail, e.g. ISO-8859-1. You should have to explicitly request the use of the locale's encoding (analogous to calling setlocale(LC_CTYPE, ) at the start of a C program; there's a good reason why C doesn't do this without being explicitly told to). Actually, the more I think about it, the more I think that simple, stupid programs probably shouldn't be using Unicode at all. I.e. Char, String, string literals, and the I/O functions in Prelude, IO etc should all be using bytes, with a distinct wide-character API available for people who want to make the (substantial) effort involved in writing (genuinely) internationalised programs. Right now, the attempt at providing I18N for free, by defining Char to mean Unicode, has essentially backfired, IMHO. Anything that isn't ISO-8859-1 just doesn't work for the most part, and anyone who wants to provide real I18N first has to work around the pseudo-I18N that's already there (e.g. convert Chars back into Word8s so that they can decode them into real Chars). Oh, and because bytes are being stored in Chars, the type system won't help if you neglect to decode a string, or if you decode it twice. The advantage of assuming ISO-8859-* is that the decoder can't fail; every possible stream of bytes is valid. Assuming. But UTF-8 is not ISO-8859-*. When someday I change most of my files and filenames from ISO-8859-2 to UTF-8, and change the locale, the assumption will be wrong. I can't change that now, because too many programs would break. The current ISO-8859-1 assumption is also wrong. A program written in Haskell which sorts strings would break for non-ASCII letters even now that they are ISO-8859-2 unless specified otherwise. 1. In that situation, you can't avoid the encoding issues. It doesn't matter what the default is, because you're going to have to set the encoding anyhow. 2. If you assume ISO-8859-1, you can always convert back to Word8 then re-decode as UTF-8. If you assume UTF-8, anything which is neither UTF-8 nor ASCII will fail far more severely than just getting the collation order wrong. The key problem with using the locale is that you frequently encounter files which aren't in the locale's encoding, and for which the encoding can't easily be deduced. Programs should either explicitly set the encoding for I/O on these files to ISO-8859-1, or manipulate them as binary data. Well, my view is essentially that files should be treated as containing bytes unless you explicitly choose to decode them, at which point you have to specify the encoding. The problem is that API for that yet is not even designed, so programs can't be written such that they will work after the default encoding change. Personally, I would take the C approach: redefine Char to mean a byte (i.e. CChar), treat string literals as bytes, keep the existing type signatures on all of the existing Haskell98 functions, and provide a completely new wide-character API for those who wish to use it. That gets the failed attempt at I18N out of everyone's way with a minimum of effort and with maximum backwards compatibility for existing code. Given the frequency with which this issue crops up, and the associated lack of action to date, I'd rather not have to wait until someone finally gets around to designing the new, improved, genuinely-I18N-ised API before we can read/write arbitrary files without too much effort. My main concern is that someone will get sick of waiting and make the wrong fix, i.e. keep the
[Haskell-cafe] Writing binary files?
Hi, I would like to write and read binary files in Haskell. I saw the System.IO module, but you need a (Ptr a) value for using that, and I don't need positions. I only want to read a complete binary file and write another binary file. In 2001 somebody else came up with the same subject, but then there wasn't a real solution. Now, 3 years later, I can imagine there's *something*. What's that *something*? Regards, Ron __ Do you Yahoo!? New and Improved Yahoo! Mail - Send 10MB messages! http://promotions.yahoo.com/new_mail ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
There's a Binary module that comes with GHC that you can get somewhere (I believe Simon M wrote it). I have hacked it up a bit and added support for bit-based writing, to bring it more in line with the NHC module. Mine, with various information, etc., is available at: http://www.isi.edu/~hdaume/haskell/NewBinary/ On Sat, 11 Sep 2004, Ron de Bruijn wrote: Hi, I would like to write and read binary files in Haskell. I saw the System.IO module, but you need a (Ptr a) value for using that, and I don't need positions. I only want to read a complete binary file and write another binary file. In 2001 somebody else came up with the same subject, but then there wasn't a real solution. Now, 3 years later, I can imagine there's *something*. What's that *something*? Regards, Ron __ Do you Yahoo!? New and Improved Yahoo! Mail - Send 10MB messages! http://promotions.yahoo.com/new_mail ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe -- Hal Daume III | [EMAIL PROTECTED] Arrest this man, he talks in maths. | www.isi.edu/~hdaume ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
Ron de Bruijn wrote: I would like to write and read binary files in Haskell. I saw the System.IO module, but you need a (Ptr a) value for using that, and I don't need positions. I only want to read a complete binary file and write another binary file. You just need to open the files with System.IO.openBinaryFile instead of openFile (files opened with the latter will have automatic LF/CRLF translation on Windows). Converting between Chars and octets (i.e. Word8) is just a matter of: import Char (ord, chr) charToOctet :: Char - Word8 charToOctet = fromIntegral . ord octetToChar :: Word8 - Char octetToChar = chr . fromIntegral -- Glynn Clements [EMAIL PROTECTED] ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
Hal Daume III wrote: There's a Binary module that comes with GHC that you can get somewhere (I believe Simon M wrote it). I have hacked it up a bit and added support for bit-based writing, to bring it more in line with the NHC module. Mine, with various information, etc., is available at: http://www.isi.edu/~hdaume/haskell/NewBinary/ Hmmm, I'm not sure if that is what Ron asked for. What I guess is needed is support for things like: read the next 4 bytes as a low-endian unsigned integer read the next 8 bytes as a big-endian IEEE 754 double write the Int16 as a low-endian signed integer write the (StorableArray Int Int32) as big-endian signed integers ... plus perhaps some String I/O with a few encodings. Alas, we do *not* have something in our standard libs, although there were a few discussions about it. I know that one can debate ages about byte orders, external representation of arrays and floats, etc. Nevertheless, I guess covering only low-/big-endian systems, IEEE 754 floats, and arrays as a simple 0-based sequence of its elements (with an explicit length stated somehow) would make at least 90% of all users happy and would be sufficient for most real world file formats. Currently one is bound to hGetBuf/hPutBuf, which is not really a comfortable way of doing binary I/O. Cheers, S. ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
--- Sven Panne [EMAIL PROTECTED] wrote: Hal Daume III wrote: There's a Binary module that comes with GHC that you can get somewhere (I believe Simon M wrote it). I have hacked it up a bit and added support for bit-based writing, to bring it more in line with the NHC module. Mine, with various information, etc., is available at: http://www.isi.edu/~hdaume/haskell/NewBinary/ Hmmm, I'm not sure if that is what Ron asked for. What I guess is needed is support for things like: read the next 4 bytes as a low-endian unsigned integer read the next 8 bytes as a big-endian IEEE 754 double write the Int16 as a low-endian signed integer write the (StorableArray Int Int32) as big-endian signed integers ... plus perhaps some String I/O with a few encodings. Alas, we do *not* have something in our standard libs, although there were a few discussions about it. I know that one can debate ages about byte orders, external representation of arrays and floats, etc. Nevertheless, I guess covering only low-/big-endian systems, IEEE 754 floats, and arrays as a simple 0-based sequence of its elements (with an explicit length stated somehow) would make at least 90% of all users happy and would be sufficient for most real world file formats. Currently one is bound to hGetBuf/hPutBuf, which is not really a comfortable way of doing binary I/O. Cheers, S. Basically, I just want to have a function, that converts a list with zeros and ones to a binary file and the other way around. If I write to a file, it would take 8 bytes. But I want it to take 8 bits. __ Do you Yahoo!? Yahoo! Mail Address AutoComplete - You start. We finish. http://promotions.yahoo.com/new_mail ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Writing binary files?
Ron de Bruijn wrote: Basically, I just want to have a function, that converts a list with zeros and ones to a binary file and the other way around. If I write to a file, it would take 8 bytes. But I want it to take 8 bits. import Char (digitToInt, chr) import Word (Word8) import System.IO (openBinaryFile) import IO (IOMode(..), hPutStr, hClose) bitsToOctet :: [Char] - Word8 bitsToOctet ds = fromIntegral $ sum $ zipWith (*) powers digits where powers = [2^n | n - [7,6..0]] digits = map digitToInt ds octetToChar :: Word8 - Char octetToChar = chr . fromIntegral bits :: [[Char]] bits = [ 01101000 -- 0x68 'h' , 01100101 -- 0x65 'e' , 01101100 -- 0x6c 'l' , 01101100 -- 0x6c 'l' , 0110 -- 0x6f 'o' , 1010 -- 0x0a '\n' ] main :: IO () main = do h - openBinaryFile out.dat WriteMode hPutStr h $ map (octetToChar . bitsToOctet) bits hClose h -- Glynn Clements [EMAIL PROTECTED] ___ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe