Re: [Haskell-cafe] Re: Strings and utf-8
Am I wrong to think that UTF8 should be THE standard? I believe it can encode anything encoded by other encodings. All the UTF-* encodings can encode the same code points. There are different trade offs though. Can't we consider non-utf8 text as legacy? I don't like that word, but I do think it is the right way to go for text. If you know your text has a diferent encoding, just use 'iconv' to convert it, or a special Haskell library for conversion. The important thing (I think) is to have an abstract concept that encompasses all the necessary characters (i.e. Unicode) and then a few well specified encodings with different trade offs. A Unicode Haskell library should handle at least a few of them (and more importantly keep track of the encoding.) -- Johan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Strings and utf-8
On Wed, 2007-11-28 at 17:38 -0200, Maurício wrote: (...) When it's phrased as truncates to 8 bits it sounds so simple, surely all we need to do is not truncate to 8 bits right? The problem is, what encoding should it pick? UTF8, 16, 32, EBDIC? (...) One sensible suggestion many people have made is that H98 file IO should use the locale encoding and do Unicode/String - locale conversion. (...) I'm really afraid of solutions where the behavior of your program changes with an environment variable that not everybody has configured properly, or even know to exist. Be afraid of all your standard Unix utils in that case. They are all locale dependent, not just for encoding but also for sorting order and the language of messages. Using the locale is standard Unix behaviour (and these days the locale usually specifies UTF8 encoding). On OSX the default should be UTF8. On Windows it's a bit less clear, supposedly text files should use UTF16 but nobody actually does that as far as I can see. Duncan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Strings and utf-8
Duncan Coutts wrote: On Wed, 2007-11-28 at 17:38 -0200, Maurício wrote: (...) When it's phrased as truncates to 8 bits it sounds so simple, surely all we need to do is not truncate to 8 bits right? The problem is, what encoding should it pick? UTF8, 16, 32, EBDIC? (...) One sensible suggestion many people have made is that H98 file IO should use the locale encoding and do Unicode/String - locale conversion. (...) I'm really afraid of solutions where the behavior of your program changes with an environment variable that not everybody has configured properly, or even know to exist. Be afraid of all your standard Unix utils in that case. They are all locale dependent, not just for encoding but also for sorting order and the language of messages. Language of messages is quite different from language of a file you read. Suppose I am English, and I have a russian friend, Vlad. My default locale is, say, latin-1, and his is something cyrillic. I might well open files including my own files, and his files. The locale of the current user is simple no guide to the correct encoding to read a file in, and not a particularly reliable guide to writing a file out. Locale makes perfect sense for messages (you are communicating with the user, his locale tells you what language he speaks). It makes much less sense for file IO. Jules ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Strings and utf-8
On Thu, 2007-11-29 at 13:05 +, Jules Bean wrote: Language of messages is quite different from language of a file you read. Suppose I am English, and I have a russian friend, Vlad. My default locale is, say, latin-1, and his is something cyrillic. I might well open files including my own files, and his files. The locale of the current user is simple no guide to the correct encoding to read a file in, and not a particularly reliable guide to writing a file out. Locale makes perfect sense for messages (you are communicating with the user, his locale tells you what language he speaks). It makes much less sense for file IO. Yes, it's a fundamental limitation of the unix locale system and multi-user systems. However it's no less wrong than just picking UTF8 all the time. Obviously one needs a text file api that allows one to specify the encoding for the cases where you happen to know it, but for the H98 file api where there is no way of specifying an encoding, what's better than using the unix default method? (at least on unix) Duncan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Strings and utf-8
A translation of http://www.ahinea.com/en/tech/perl-unicode-struggle.html from perl to haskell would be a very useful piece of documentation, I think. That explanation really helped me get to grips with the encoding stuff, in a perl context. thomas. Duncan Coutts [EMAIL PROTECTED] Sent by: [EMAIL PROTECTED] 11/29/2007 07:44 AM To Maurício [EMAIL PROTECTED] cc haskell-cafe@haskell.org Subject Re: [Haskell-cafe] Re: Strings and utf-8 On Wed, 2007-11-28 at 17:38 -0200, Maurício wrote: (...) When it's phrased as truncates to 8 bits it sounds so simple, surely all we need to do is not truncate to 8 bits right? The problem is, what encoding should it pick? UTF8, 16, 32, EBDIC? (...) One sensible suggestion many people have made is that H98 file IO should use the locale encoding and do Unicode/String - locale conversion. (...) I'm really afraid of solutions where the behavior of your program changes with an environment variable that not everybody has configured properly, or even know to exist. Be afraid of all your standard Unix utils in that case. They are all locale dependent, not just for encoding but also for sorting order and the language of messages. Using the locale is standard Unix behaviour (and these days the locale usually specifies UTF8 encoding). On OSX the default should be UTF8. On Windows it's a bit less clear, supposedly text files should use UTF16 but nobody actually does that as far as I can see. Duncan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe --- This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden.___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Strings and utf-8
Thomas Hartman wrote: A translation of http://www.ahinea.com/en/tech/perl-unicode-struggle.html from perl to haskell would be a very useful piece of documentation, I think. Perl encodes both Unicode and binary data as the same (dynamic) data type. Haskell - at least in theory - has two different types for them, namely [Char] for characters and [Word8] or ByteString for sequences of bytes. I think the Haskell approach is better, because the programmer in most cases knows whether he wants to treat his data as characters or as bytes. Perl does it the Perlish We guess at what the coder means way, which leads to a lot of frustration when Perl guesses wrong. The problems of the Haskeller trying to use Unicode, I think, will be different from those of the Perl hacker trying to use Unicode: the Haskeller will have to search for third-party modules to do what he wants, and finding those modules is the problem. The Perl hacker has all the Unicode support built in, but has to fight Perl occasionally to keep it from doing byte operations on his Unicode data. I had a colleague here go all but insane last week trying to use 'split' on a Unicode string in Perl on Windows. split would break the string in the middle of a UTF-8 wide character, crashing UTF-8 processing later on. Reinier ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe