Re: [Haskell-cafe] Re: Strings and utf-8

2007-11-30 Thread Johan Tibell
 Am I wrong to think that UTF8 should be THE
 standard? I believe it can encode anything
 encoded by other encodings.

All the UTF-* encodings can encode the same code points. There are
different trade offs though.

 Can't we consider non-utf8 text as legacy?
 I don't like that word, but I do think it is
 the right way to go for text. If you know
 your text has a diferent encoding, just use
 'iconv' to convert it, or a special Haskell
 library for conversion.

The important thing (I think) is to have an abstract concept that
encompasses all the necessary characters (i.e. Unicode) and then a few
well specified encodings with different trade offs. A Unicode Haskell
library should handle at least a few of them (and more importantly
keep track of the encoding.)

-- Johan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Strings and utf-8

2007-11-29 Thread Duncan Coutts
On Wed, 2007-11-28 at 17:38 -0200, Maurí­cio wrote:
 (...)  When it's phrased as truncates to 8
   bits it sounds so simple, surely all we need
   to do is not truncate to 8 bits right?
  
   The problem is, what encoding should it pick?
   UTF8, 16, 32, EBDIC? (...)
  
   One sensible suggestion many people have made
   is that H98 file IO should use the locale
   encoding and do Unicode/String - locale
   conversion. (...)
 
 I'm really afraid of solutions where the behavior
 of your program changes with an environment
 variable that not everybody has configured
 properly, or even know to exist.

Be afraid of all your standard Unix utils in that case. They are all
locale dependent, not just for encoding but also for sorting order and
the language of messages.

Using the locale is standard Unix behaviour (and these days the locale
usually specifies UTF8 encoding). On OSX the default should be UTF8. On
Windows it's a bit less clear, supposedly text files should use UTF16
but nobody actually does that as far as I can see.

Duncan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Strings and utf-8

2007-11-29 Thread Jules Bean

Duncan Coutts wrote:

On Wed, 2007-11-28 at 17:38 -0200, Maurí­cio wrote:

(...)  When it's phrased as truncates to 8

  bits it sounds so simple, surely all we need
  to do is not truncate to 8 bits right?
 
  The problem is, what encoding should it pick?
  UTF8, 16, 32, EBDIC? (...)
 
  One sensible suggestion many people have made
  is that H98 file IO should use the locale
  encoding and do Unicode/String - locale
  conversion. (...)

I'm really afraid of solutions where the behavior
of your program changes with an environment
variable that not everybody has configured
properly, or even know to exist.


Be afraid of all your standard Unix utils in that case. They are all
locale dependent, not just for encoding but also for sorting order and
the language of messages.


Language of messages is quite different from language of a file you read.

Suppose I am English, and I have a russian friend, Vlad.

My default locale is, say, latin-1, and his is something cyrillic.

I might well open files including my own files, and his files. The 
locale of the current user is simple no guide to the correct encoding to 
read a file in, and not a particularly reliable guide to writing a file out.


Locale makes perfect sense for messages (you are communicating with the 
user, his locale tells you what language he speaks). It makes much less 
sense for file IO.


Jules
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Strings and utf-8

2007-11-29 Thread Duncan Coutts
On Thu, 2007-11-29 at 13:05 +, Jules Bean wrote:

 Language of messages is quite different from language of a file you read.
 
 Suppose I am English, and I have a russian friend, Vlad.
 
 My default locale is, say, latin-1, and his is something cyrillic.
 
 I might well open files including my own files, and his files. The 
 locale of the current user is simple no guide to the correct encoding to 
 read a file in, and not a particularly reliable guide to writing a file out.
 
 Locale makes perfect sense for messages (you are communicating with the 
 user, his locale tells you what language he speaks). It makes much less 
 sense for file IO.

Yes, it's a fundamental limitation of the unix locale system and
multi-user systems. However it's no less wrong than just picking UTF8
all the time. Obviously one needs a text file api that allows one to
specify the encoding for the cases where you happen to know it, but for
the H98 file api where there is no way of specifying an encoding, what's
better than using the unix default method? (at least on unix)

Duncan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Strings and utf-8

2007-11-29 Thread Thomas Hartman
A translation of

http://www.ahinea.com/en/tech/perl-unicode-struggle.html

from perl to haskell would be a very useful piece of documentation, I
think.

That explanation really helped me get to grips with the encoding stuff, in
a perl context.

thomas.





Duncan Coutts [EMAIL PROTECTED]
Sent by: [EMAIL PROTECTED]
11/29/2007 07:44 AM

To
Maurí­cio [EMAIL PROTECTED]
cc
haskell-cafe@haskell.org
Subject
Re: [Haskell-cafe] Re: Strings and utf-8






On Wed, 2007-11-28 at 17:38 -0200, Maurí­cio wrote:
 (...)  When it's phrased as truncates to 8
   bits it sounds so simple, surely all we need
   to do is not truncate to 8 bits right?
  
   The problem is, what encoding should it pick?
   UTF8, 16, 32, EBDIC? (...)
  
   One sensible suggestion many people have made
   is that H98 file IO should use the locale
   encoding and do Unicode/String - locale
   conversion. (...)

 I'm really afraid of solutions where the behavior
 of your program changes with an environment
 variable that not everybody has configured
 properly, or even know to exist.

Be afraid of all your standard Unix utils in that case. They are all
locale dependent, not just for encoding but also for sorting order and
the language of messages.

Using the locale is standard Unix behaviour (and these days the locale
usually specifies UTF8 encoding). On OSX the default should be UTF8. On
Windows it's a bit less clear, supposedly text files should use UTF16
but nobody actually does that as far as I can see.

Duncan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe



---

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Strings and utf-8

2007-11-29 Thread Reinier Lamers

Thomas Hartman wrote:



A translation of

http://www.ahinea.com/en/tech/perl-unicode-struggle.html

from perl to haskell would be a very useful piece of documentation, I 
think. 


Perl encodes both Unicode and binary data as the same (dynamic) data 
type. Haskell - at least in theory - has two different types for them, 
namely [Char] for characters and [Word8] or ByteString for sequences of 
bytes. I think the Haskell approach is better, because the programmer in 
most cases knows whether he wants to treat his data as characters or as 
bytes. Perl does it the Perlish We guess at what the coder means way, 
which leads to a lot of frustration when Perl guesses wrong.


The problems of the Haskeller trying to use Unicode, I think, will be 
different from those of the Perl hacker trying to use Unicode: the 
Haskeller will have to search for third-party modules to do what he 
wants, and finding those modules is the problem. The Perl hacker has all 
the Unicode support built in, but has to fight Perl occasionally to keep 
it from doing byte operations on his Unicode data.


I had a colleague here go all but insane last week trying to use 'split' 
on a Unicode string in Perl on Windows. split would break the string in 
the middle of a UTF-8 wide character, crashing UTF-8 processing later on.


Reinier
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe