Glynn Clements <[EMAIL PROTECTED]> writes: > What I'm suggesting in the above is to sidestep the encoding issue > by keeping filenames as byte strings wherever possible.
Ok, but let it be in addition to, not instead treating them as character strings. > And program-generated email notifications frequently include text with > no known encoding (i.e. binary data). No, programs don't dump binary data among diagnostic messages. If they output binary data to stdout, it's their only output and it's redirected to a file or another process. > Or are you going to demand that anyone who tries to hack into your > system only sends it UTF-8 data so that the alert messages are > displayed correctly in your mail program? The email protocol is text-only. It may mangle newlines, it has a maximum line length, some texts may be escaped during transport (e.g. "From " at the beginning of a line). Arbitrary binary data should be put in base64-or-otherwise-encoded attachments. If the cron program embeds the output as email body, the cron job should not dump arbitrary binary data to stdout. Encoding is not the only problem. >> Processing data in their original byte encodings makes supporting >> multiple languages harder. Filenames which are inexpressible as >> character strings get in the way of clean APIs. When considering only >> filenames, using bytes would be sufficient, but in overall it's more >> convenient to Unicodize them like other strings. > > It also harms reliability. Depending upon the encoding, two distinct > byte strings may have the same Unicode representation. Such encodings are not suitable for filenames. http://www.mail-archive.com/[EMAIL PROTECTED]/msg00376.html | ISO-2022-JP will never be a satisfactory terminal encoding (like | ISO-8859-*, EUC-*, UTF-8, Shift_JIS) because | | 1) It is a stateful encoding. What happens when a program starts some | terminal output and then is interrupted using Ctrl-C or Ctrl-Z? The | terminal will remain in the shifted state, while other programs start | doing output. But these programs expect that when they start, the | terminal is in the initial state. The net result will be garbage on | the screen. | | 2) ISO-2022-JP is not filesystem safe. Therefore filenames will never | be able to carry Japanese characters in this encodings. | | Robert Brady writes: | > Does ISO-2022 see much/any use as the locale encoding, or it it just used | > for interchange? | | Just for interchange. | | Paul Eggert searched for uses of ISO-2022-JP as locale encodings (in | order to convince me), and only came up with a handful of questionable | URLs. He didn't convince me. And there are no plans to support | ISO-2022-JP as a locale encoding in glibc - because of 1) and 2) above. For me ISO-2022 is a brain-damaged concept and should die. Almost nothing supports it anyway. >> Such tarballs are not portable across systems using different encodings. > > Well, programs which treat filenames as byte strings to be read from > argv[] and passed directly to open() won't have any problems with this. The OS itself may have problems with this; only some filesystems accept arbitrary bytes apart from '\0' and '/' (and with the special meaning for '.'). Exotic characters in filenames are not very portable. >> A Haskell program in my world can do that too. Just set the encoding >> to Latin1. > > But programs should handle this by default, IMHO. IMHO it's more important to make them compatible with the representation of strings used in other parts of the program. > Filenames are, for the most part, just "tokens" to be passed around. Filenames are often stored in text files, whose bytes are interpreted as characters. Applying QP to non-ASCII parts of filenames is suitable only if humans won't edit these files by hand. >> > My specific point is that the Haskell98 API has a very big problem due >> > to the assumption that the encoding is always known. Existing >> > implementations work around the problem by assuming that the encoding >> > is always ISO-8859-1. >> >> The API is incomplete and needs to be enhanced. Programs written using >> the current API will be limited to using the locale encoding. > > That just adds unnecessary failure modes. But otherwise programs would continuously have bugs in handling text which is not ISO-8859-1, especially with multibyte encoding where pretending that ISO-8859-2 is ISO-8859-1 too often doesn't work. I can't switch my environment to UTF-8 yet precisely because too many programs were written with the attitude you are promoting: they don't care about the encoding, they just pass bytes around. Bugs range from small annoyances like tabular output which doesn't line up, through mangled characters on a graphical display, to full-screen interactive programs being unusable on a UTF-8 terminal. >> This encoding would be incompatible with most other texts seen by the >> program. In particular reading a filename from a file would not work >> without manual recoding. > > We already have that problem; you can't read non-Latin1 strings from > files. This is going to be fixed. Some time after the API enhancements it should become the default. > BTW, that's why Emacs (and XEmacs) support ISO-2022 much better than > they do UTF-8. Because MuLE was written by Japanese developers. And that's why I haven't used Emacs for years. The default installation of XEmacs (at least on PLD Linux Distribution) doesn't handle *any* non-ASCII characters properly. When I enter some Polish letters and save a file, it produces some ISO-2022 garbage that nothing can read, including the XEmacs itself. When I open the existing file, remove the escaped nonsense, enter Polish letters again and save the file again, all non-ASCII characters are replaced with tildes. GNU Emacs is better, but still doesn't respect the locale and must be explicitly told about the encoding. The locale mechanism was invented precisely to avoid informing each and every program in its own configuration about the encoding and other things to be used by default. Emacs ignores this. >> > Which is one of the reasons why they are likely to persist for longer >> > than UTF-8 "true believers" might like. >> >> My I/O design doesn't force UTF-8, it works with ISO-8859-x as well. > > But I was specifically addressing Unicode versus multiple encodings > internally. The size of the Unicode "alphabet" effectively prohibits > using codepoints as indices. ISO-2022 is even less suitable. I can't imagine a ISO-2022 regexp. As long as more than 256 distinct characters are needed, ISO-8859-x are not suitable at all, so it doesn't matter that they would be more convenient if they worked. -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/ _______________________________________________ Haskell-Cafe mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell-cafe