Re: [Haskell-cafe] getting crazy with character encoding
On Wed, Sep 12, 2007 at 05:19:22PM +0200, Andrea Rossato wrote: And so it's my job to convert it in what I need. Luckily I've just discovered (and now I'm reading) some of John Meacham's code on locale. This is going to be very helpful (unfortunately I don't see Licenses coming with HsLocale, but if I'm reading correctly there is something like this in Riot - and this was BSD3 released). it is BSD3. in general, pretty much everything I write is BSD3 except for large projects as a whole which get GPL=2. Though I am more than happy to BSD3 any incidentally useful parts of my projects that others would find useful. John -- John Meacham - ⑆repetae.net⑆john⑈ ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] getting crazy with character encoding
On Sep 12, 2007, at 10:18 , Andrea Rossato wrote: supposed that, in a Linux system, in an utf-8 locale, you create a file with non ascii characters. For instance: touch abèèè Now, I would expect that the output of a shell command such as ls ab* would be a string/list of 5 chars. Instead I find it to be a list of 8 chars...;-) That is expected. The low level filesystem storage doesn't know about character sets, so non-ASCII filenames must be encoded in e.g. UTF-8. 8 characters is therefore correct, and you must do UTF-8 decoding on input because Haskell does not do so automatically. This will also be true with getdirent() aka getDirectoryContents. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] [EMAIL PROTECTED] system administrator [openafs,heimdal,too many hats] [EMAIL PROTECTED] electrical and computer engineering, carnegie mellon universityKF8NH ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] getting crazy with character encoding
Andrea Rossato wrote: Hi, supposed that, in a Linux system, in an utf-8 locale, you create a file with non ascii characters. For instance: touch abèèè Now, I would expect that the output of a shell command such as ls ab* would be a string/list of 5 chars. Instead I find it to be a list of 8 chars...;-) The file name may have five *characters*, but if it's encoded as UTF-8, then it has eight *bytes*. It appears that in spite of the locale definition, hGetContents is treating each byte as a separate character without translating the multi-byte sequences *from* UTF-8, and then putStrLn sends each of those bytes to standard output without translating the non-ASCII characters *to* UTF-8. So the second line of your program's output is correct...but only by accident. Futzing around a little bit in ghci, I see that I can define a string \1488, but if I send that string to putStrLn, I get nothing, when I should get א (the Hebrew letter aleph). I � Unicode. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] getting crazy with character encoding
On Wed, Sep 12, 2007 at 10:53:29AM -0400, Brandon S. Allbery KF8NH wrote: That is expected. The low level filesystem storage doesn't know about character sets, so non-ASCII filenames must be encoded in e.g. UTF-8. 8 characters is therefore correct, and you must do UTF-8 decoding on input because Haskell does not do so automatically. Ahh, now I eventually get it! So, as far as I understand, I'm getting bytes that are automatically translated into an iso-8859-1 string, if I'm correctly reading this old post by Glynn: http://tinyurl.com/2fhl43 And so it's my job to convert it in what I need. Luckily I've just discovered (and now I'm reading) some of John Meacham's code on locale. This is going to be very helpful (unfortunately I don't see Licenses coming with HsLocale, but if I'm reading correctly there is something like this in Riot - and this was BSD3 released). Thanks for your kind attention. Andrea ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] getting crazy with character encoding
On 12/09/2007, Seth Gordon [EMAIL PROTECTED] wrote: I � Unicode. Was it intentional that the central character appears as a little '?', even though the aleph on the line above worked? Either way it would be very amusing, but for different reasons... D ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] getting crazy with character encoding
On Wed, Sep 12, 2007 at 11:16:25AM -0400, Seth Gordon wrote: It appears that in spite of the locale definition, hGetContents is treating each byte as a separate character without translating the multi-byte sequences *from* UTF-8, and then putStrLn sends each of those bytes to standard output without translating the non-ASCII characters *to* UTF-8. So the second line of your program's output is correct...but only by accident. that's it indeed. As I said in the message I've just sent, I've read that the String/CString conversion is automatically done in ISO-8859-1, so èèè, which are 6 bytes in utf-8, are translated into 6 iso-8859-1 characters. What puzzles me is the behavior of putStrLn. Thanks for your time. Andrea ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] getting crazy with character encoding
Andrea Rossato wrote: What puzzles me is the behavior of putStrLn. putStrLn is sending the following bytes to standard output: 97, 98, 195, 168, 195, 168, 195, 168, 10 Since the code that renders characters in your terminal emulator is expecting UTF-8[*], each (195, 168) pair of bytes is rendered as è. The Unix utility od can be very helpful in figuring out problems like this. [*]At least on my computer, I get the same result *even if* I change LANG from en_US.utf8 to C. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] getting crazy with character encoding
Dougal Stanton wrote: On 12/09/2007, Seth Gordon [EMAIL PROTECTED] wrote: I � Unicode. Was it intentional that the central character appears as a little '?', even though the aleph on the line above worked? It was intentional. If I ♡ed Unicode, I would have said so. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] getting crazy with character encoding
On Wed, Sep 12, 2007 at 11:40:11AM -0400, Seth Gordon wrote: The Unix utility od can be very helpful in figuring out problems like this. Thanks for pointing me to od, I didn't know it. [*]At least on my computer, I get the same result *even if* I change LANG from en_US.utf8 to C. As far as I understand it is the terminal emulator responsible for translating the bytes to characters. If I run it in a console I get abAAA (sort of) no matter what my LANG is - 8 single 8 -bit characters. Cheers, Andrea ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] getting crazy with character encoding
On 9/12/07, Andrea Rossato [EMAIL PROTECTED] wrote: If I run it in a console I get abAAA (sort of) no matter what my LANG is - 8 single 8 -bit characters. It's possible to set your Linux console to grok UTF8. I don't remember the details, but I'm sure you can Google for it. By the way, does anyone know The Right Way to deal with UTF-8 in Haskell? I.e., take that 8 byte UTF-8 string and convert it to a 5 character Unicode string (so it can be manipulated)? ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] getting crazy with character encoding
David Benbennick wrote: On 9/12/07, Andrea Rossato [EMAIL PROTECTED] wrote: If I run it in a console I get abAAA (sort of) no matter what my LANG is - 8 single 8 -bit characters. It's possible to set your Linux console to grok UTF8. I don't remember the details, but I'm sure you can Google for it. By the way, does anyone know The Right Way to deal with UTF-8 in Haskell? I.e., take that 8 byte UTF-8 string and convert it to a 5 character Unicode string (so it can be manipulated)? There is no UTF8 decode support in the standard libraries. There are some contributed libraries which can do it. Data.CompactString is one. Jules ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] getting crazy with character encoding
mailing_list: On Wed, Sep 12, 2007 at 11:16:25AM -0400, Seth Gordon wrote: It appears that in spite of the locale definition, hGetContents is treating each byte as a separate character without translating the multi-byte sequences *from* UTF-8, and then putStrLn sends each of those bytes to standard output without translating the non-ASCII characters *to* UTF-8. So the second line of your program's output is correct...but only by accident. that's it indeed. As I said in the message I've just sent, I've read that the String/CString conversion is automatically done in ISO-8859-1, so èèè, which are 6 bytes in utf-8, are translated into 6 iso-8859-1 characters. What puzzles me is the behavior of putStrLn. Thanks for your time. Have you tried the utf8-string conversion library? http://hackage.haskell.org/cgi-bin/hackage-scripts/package/utf8-string-0.1 -- Don ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe