Re: [Haskell-cafe] getting crazy with character encoding

2007-09-13 Thread John Meacham
On Wed, Sep 12, 2007 at 05:19:22PM +0200, Andrea Rossato wrote:
 And so it's my job to convert it in what I need. Luckily I've just
 discovered (and now I'm reading) some of John Meacham's code on
 locale. This is going to be very helpful (unfortunately I don't see
 Licenses coming with HsLocale, but if I'm reading correctly there is
 something like this in Riot - and this was BSD3 released).

it is BSD3. in general, pretty much everything I write is BSD3 except
for large projects as a whole which get GPL=2. Though I am more than
happy to BSD3 any incidentally useful parts of my projects that others
would find useful.

John

-- 
John Meacham - ⑆repetae.net⑆john⑈
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] getting crazy with character encoding

2007-09-12 Thread Brandon S. Allbery KF8NH


On Sep 12, 2007, at 10:18 , Andrea Rossato wrote:

supposed that, in a Linux system, in an utf-8 locale, you create a  
file

with non ascii characters. For instance:
touch abèèè

Now, I would expect that the output of a shell command such as
ls ab*
would be a string/list of 5 chars. Instead I find it to be a list of 8
chars...;-)


That is expected.  The low level filesystem storage doesn't know  
about character sets, so non-ASCII filenames must be encoded in e.g.  
UTF-8.  8 characters is therefore correct, and you must do UTF-8  
decoding on input because Haskell does not do so automatically.


This will also be true with getdirent() aka getDirectoryContents.

--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] [EMAIL PROTECTED]
system administrator [openafs,heimdal,too many hats] [EMAIL PROTECTED]
electrical and computer engineering, carnegie mellon universityKF8NH


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] getting crazy with character encoding

2007-09-12 Thread Seth Gordon

Andrea Rossato wrote:

Hi,

supposed that, in a Linux system, in an utf-8 locale, you create a file
with non ascii characters. For instance:
touch abèèè

Now, I would expect that the output of a shell command such as 
ls ab*

would be a string/list of 5 chars. Instead I find it to be a list of 8
chars...;-)


The file name may have five *characters*, but if it's encoded as UTF-8, 
then it has eight *bytes*.


It appears that in spite of the locale definition, hGetContents is 
treating each byte as a separate character without translating the 
multi-byte sequences *from* UTF-8, and then putStrLn sends each of those 
bytes to standard output without translating the non-ASCII characters 
*to* UTF-8.  So the second line of your program's output is 
correct...but only by accident.


Futzing around a little bit in ghci, I see that I can define a string 
\1488, but if I send that string to putStrLn, I get nothing, when I 
should get א (the Hebrew letter aleph).


I � Unicode.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] getting crazy with character encoding

2007-09-12 Thread Andrea Rossato
On Wed, Sep 12, 2007 at 10:53:29AM -0400, Brandon S. Allbery KF8NH wrote:
  That is expected.  The low level filesystem storage doesn't know about 
  character sets, so non-ASCII filenames must be encoded in e.g. UTF-8.  8 
  characters is therefore correct, and you must do UTF-8 decoding on input 
  because Haskell does not do so automatically.

Ahh, now I eventually get it! So, as far as I understand, I'm getting
bytes that are automatically translated into an iso-8859-1 string, if
I'm correctly reading this old post by Glynn:
http://tinyurl.com/2fhl43

And so it's my job to convert it in what I need. Luckily I've just
discovered (and now I'm reading) some of John Meacham's code on
locale. This is going to be very helpful (unfortunately I don't see
Licenses coming with HsLocale, but if I'm reading correctly there is
something like this in Riot - and this was BSD3 released).

Thanks for your kind attention.

Andrea

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] getting crazy with character encoding

2007-09-12 Thread Dougal Stanton
On 12/09/2007, Seth Gordon [EMAIL PROTECTED] wrote:

 I � Unicode.

Was it intentional that the central character appears as a little '?',
even though the aleph on the line above worked? Either way it would be
very amusing, but for different reasons...


D
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] getting crazy with character encoding

2007-09-12 Thread Andrea Rossato
On Wed, Sep 12, 2007 at 11:16:25AM -0400, Seth Gordon wrote:
  It appears that in spite of the locale definition, hGetContents is treating 
  each byte as a separate character without translating the multi-byte 
  sequences *from* UTF-8, and then putStrLn sends each of those bytes to 
  standard output without translating the non-ASCII characters *to* UTF-8.  So 
  the second line of your program's output is correct...but only by accident.

that's it indeed. As I said in the message I've just sent, I've read
that the String/CString conversion is automatically done in
ISO-8859-1, so èèè, which are 6 bytes in utf-8, are translated
into 6 iso-8859-1 characters.

What puzzles me is the behavior of putStrLn.

Thanks for your time.

Andrea

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] getting crazy with character encoding

2007-09-12 Thread Seth Gordon

Andrea Rossato wrote:

What puzzles me is the behavior of putStrLn.


putStrLn is sending the following bytes to standard output:

97, 98, 195, 168, 195, 168, 195, 168, 10

Since the code that renders characters in your terminal emulator is 
expecting UTF-8[*], each (195, 168) pair of bytes is rendered as è.


The Unix utility od can be very helpful in figuring out problems like 
this.


[*]At least on my computer, I get the same result *even if* I change 
LANG from en_US.utf8 to C.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] getting crazy with character encoding

2007-09-12 Thread Seth Gordon

Dougal Stanton wrote:

On 12/09/2007, Seth Gordon [EMAIL PROTECTED] wrote:


I � Unicode.


Was it intentional that the central character appears as a little '?',
even though the aleph on the line above worked?


It was intentional.  If I ♡ed Unicode, I would have said so.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] getting crazy with character encoding

2007-09-12 Thread Andrea Rossato
On Wed, Sep 12, 2007 at 11:40:11AM -0400, Seth Gordon wrote:
  The Unix utility od can be very helpful in figuring out problems like 
  this.

Thanks for pointing me to od, I didn't know it.

  [*]At least on my computer, I get the same result *even if* I change LANG 
  from en_US.utf8 to C.

As far as I understand it is the terminal emulator responsible for
translating the bytes to characters. If I run it in a console I get
abAAA (sort of) no matter what my LANG is - 8 single 8 -bit
characters.

Cheers,
Andrea

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] getting crazy with character encoding

2007-09-12 Thread David Benbennick
On 9/12/07, Andrea Rossato [EMAIL PROTECTED] wrote:
 If I run it in a console I get
 abAAA (sort of) no matter what my LANG is - 8 single 8 -bit
 characters.

It's possible to set your Linux console to grok UTF8.  I don't
remember the details, but I'm sure you can Google for it.

By the way, does anyone know The Right Way to deal with UTF-8 in
Haskell?  I.e., take that 8 byte UTF-8 string and convert it to a 5
character Unicode string (so it can be manipulated)?
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] getting crazy with character encoding

2007-09-12 Thread Jules Bean

David Benbennick wrote:

On 9/12/07, Andrea Rossato [EMAIL PROTECTED] wrote:

If I run it in a console I get
abAAA (sort of) no matter what my LANG is - 8 single 8 -bit
characters.


It's possible to set your Linux console to grok UTF8.  I don't
remember the details, but I'm sure you can Google for it.

By the way, does anyone know The Right Way to deal with UTF-8 in
Haskell?  I.e., take that 8 byte UTF-8 string and convert it to a 5
character Unicode string (so it can be manipulated)?



There is no UTF8 decode support in the standard libraries.

There are some contributed libraries which can do it. Data.CompactString 
is one.


Jules
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] getting crazy with character encoding

2007-09-12 Thread Don Stewart
mailing_list:
 On Wed, Sep 12, 2007 at 11:16:25AM -0400, Seth Gordon wrote:
   It appears that in spite of the locale definition, hGetContents is 
  treating 
   each byte as a separate character without translating the multi-byte 
   sequences *from* UTF-8, and then putStrLn sends each of those bytes to 
   standard output without translating the non-ASCII characters *to* UTF-8.  
  So 
   the second line of your program's output is correct...but only by accident.
 
 that's it indeed. As I said in the message I've just sent, I've read
 that the String/CString conversion is automatically done in
 ISO-8859-1, so èèè, which are 6 bytes in utf-8, are translated
 into 6 iso-8859-1 characters.
 
 What puzzles me is the behavior of putStrLn.
 
 Thanks for your time.

Have you tried the utf8-string conversion library?


http://hackage.haskell.org/cgi-bin/hackage-scripts/package/utf8-string-0.1

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe