Re: [Haskell-cafe] Writing binary files?

2004-09-15 Thread ross
On Mon, Sep 13, 2004 at 12:01:58PM +0100, Glynn Clements wrote:
 My view is that, right now, we have the worst of both worlds, and
 taking a short step backwards (i.e. narrow the Char type and leave the
 rest alone) is a lot simpler (and more feasible) than the long journey
 towards real I18N.

This being Haskell, I can't imagine a consensus on a step backwards.
In any case, a Char type distinct from bytes and the rest is the most
valuable part of the current situation.  The rest is just libraries,
and the solution to that is to create other libraries.  (It's true
that the Prelude is harder to work around, but even that can be done,
as with the new exception interface.)  Indeed more than one approach
can proceed concurrently, and that's probably what's going to happen:

The Right Thing proceeds in stages:
1. new byte-based libraries
2. conversions sitting on top of these
3. the ultimate I18N API

The Quick Fix: alter the existing implementation to use the
encoding determined by the current locale at the borders.

When the Right Thing is finished, the Quick Fix can be recast as a
special case.  The Right Thing might take a very long (possibly infinite)
time, because this is the sort of thing people can argue about endlessly.
Still, the first stage would deal with most of the scenarios you raised.
It just needs a group of people who care about it to get together and
do it. 

The Quick Fix is the most logical implementation of the current
definition of Haskell, and entirely consistent with its general
philosophy of presenting the programmer with an idealized (some might
say oversimplified) model of computation.  From the start, Haskell
has supported only character-based I/O, with whatever translations
were required to present a uniform view on all platforms.  And that's
not an entirely bad thing.  It won't work all the time, but it will be
simple, and good enough for most people.  Its existence will not rule
out binary I/O or more sophisticated alternatives.  Those who need more
may be motivated to help finish the Right Thing.
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-15 Thread Graham Klyne
I've not been following this debate, but I think I agree with Ross.
In particular, the idea of narrowing the Char type really seems like a 
bad idea to me (if I understand the intent correctly).  Not so long ago, I 
did a whole load of work on the HaXml parser so that, among other things, 
it would support UTF-8 and UTF-16 Unicode (as required by the XML 
spec).  To do this depends upon having a Char type that can represent the 
full repertoire of Unicode characters.

Other languages have been forced into this (maybe painful) transition;  I 
don't think Haskell can reasonably go backwards if it is to have any hope 
of surviving.

#g
--
At 12:31 15/09/04 +0100, [EMAIL PROTECTED] wrote:
On Mon, Sep 13, 2004 at 12:01:58PM +0100, Glynn Clements wrote:
 My view is that, right now, we have the worst of both worlds, and
 taking a short step backwards (i.e. narrow the Char type and leave the
 rest alone) is a lot simpler (and more feasible) than the long journey
 towards real I18N.
This being Haskell, I can't imagine a consensus on a step backwards.
In any case, a Char type distinct from bytes and the rest is the most
valuable part of the current situation.  The rest is just libraries,
and the solution to that is to create other libraries.  (It's true
that the Prelude is harder to work around, but even that can be done,
as with the new exception interface.)  Indeed more than one approach
can proceed concurrently, and that's probably what's going to happen:
The Right Thing proceeds in stages:
1. new byte-based libraries
2. conversions sitting on top of these
3. the ultimate I18N API
The Quick Fix: alter the existing implementation to use the
encoding determined by the current locale at the borders.
When the Right Thing is finished, the Quick Fix can be recast as a
special case.  The Right Thing might take a very long (possibly infinite)
time, because this is the sort of thing people can argue about endlessly.
Still, the first stage would deal with most of the scenarios you raised.
It just needs a group of people who care about it to get together and
do it.
The Quick Fix is the most logical implementation of the current
definition of Haskell, and entirely consistent with its general
philosophy of presenting the programmer with an idealized (some might
say oversimplified) model of computation.  From the start, Haskell
has supported only character-based I/O, with whatever translations
were required to present a uniform view on all platforms.  And that's
not an entirely bad thing.  It won't work all the time, but it will be
simple, and good enough for most people.  Its existence will not rule
out binary I/O or more sophisticated alternatives.  Those who need more
may be motivated to help finish the Right Thing.
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe

Graham Klyne
For email:
http://www.ninebynine.org/#Contact
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-15 Thread Marcin 'Qrczak' Kowalczyk
Glynn Clements [EMAIL PROTECTED] writes:

 Unless you are the sole user of a system, you have no control over
 what filenames may occur on it (and even if you are the sole user,
 you may wish to use packages which don't conform to your rules).

For these occasions you may set the encoding to ISO-8859-1. But then
you can't sensibly show them to the user in a GUI, nor in ncurses
using the wide character API, nor you can't sensibly store them in a
file which is to be always encoded in UTF-8 (e.g. XML file where you
can't put raw bytes without knowing their encoding).

There are two paradigms: manipulate bytes not knowing their encoding,
and manipulating characters explicitly encoded in various encodings
(possibly UTF-8). The world is slowly migrating from the first to the
second.

  There are limits to the extent to which this can be achieved. E.g. 
  what happens if you set the encoding to UTF-8, then call
  getDirectoryContents for a directory which contains filenames which
  aren't valid UTF-8 strings?
 
 The library fails. Don't do that. This environment is internally
 inconsistent.

 Call it what you like, it's a reality, and one which programs need to
 deal with.

The reality is that filenames are encoded in different encodings
depending on the system. Sometimes it's ISO-8859-1, sometimes
ISO-8859-2, sometimes UTF-8. We should not ignore the possibility
of UTF-8-encoded filenames.

In CLisp it fails silently (undecodable filenames are skipped), which
is bad. It should fail loudly.

 Most programs don't care whether any filenames which they deal with
 are valid in the locale's encoding (or any other encoding). They just
 receive lists (i.e. NUL-terminated arrays) of bytes and pass them
 directly to the OS or to libraries.

And this is why I can't switch my home environment to UTF-8 yet. Too
many programs are broken; almost all terminal programs which use more
than stdin and stdout in default modes, i.e. which use line editing or
work in full screen. How would you display a filename in a full screen
text editor, such that it works in a UTF-8 environment?

 If the assumed encoding is ISO-8859-*, this program will work
 regardless of the filenames which it is passed or the contents of the
 file (modulo the EOL translation on Windows). OTOH, if it were to use
 UTF-8 (e.g. because that was the locale's encoding), it wouldn't work
 correctly if either filename or the file's contents weren't valid
 UTF-8.

A program is not supposed to encounter filenames which are not
representable in the locale's encoding. In your setting it's
impossible to display a filename in a way other than printing
to stdout.

 More accurately, it specifies which encoding to assume when you *need*
 to know the encoding (i.e. ctype.h etc), but you can't obtain that
 information from a more reliable source.

In the case of filenames there is no more reliable source.

 My central point is that the existing API forces the encoding to be
 an issue when it shouldn't be.

It is an unavoidable issue because not every interface in a given
computer system uses the same encoding. Gtk+ uses UTF-8; you must
convert text to UTF-8 in order to display it, and in order to convert
you must know its encoding.

 Well, to an extent it is an implementation issue. Historically, curses
 never cared about encodings. A character is a byte, you draw bytes on
 the screen, curses sends them directly to the terminal.

This is the old API. But newer ncurses API is prepared even for
combining accents. A character is coded with a sequence of wchar_t
values, such that all except the first one are combining characters.

 Furthermore, the curses model relies upon monospaced fonts, and falls
 down once you encounter CJK text (where a monospaced font means one
 whose glyphs are an integer multiple of the cell size, not necessarily
 a single cell).

It doesn't fall. Characters may span several columns. There is wcwidth(),
and curses specification in X/Open says how it should behave for wide
CJK characters. I haven't tested it but I believe ncurses supports
them.

 Extending something like curses to handle encoding issues is far
 from trivial; which is probably why it hasn't been finished yet.

It's almost finished. The API specification was ready in 1997.
It works in ncurses modulo unfixed bugs.

But programs can't use it unless they use Unicode internally.

 Although, if you're going to have implicit String - [Word8]
 converters, there's no reason why you can't do the reverse, and have
 isAlpha :: Word8 - IO Bool. Although, like ctype.h, this will only
 work for single-byte encodings.

We should not ignore multibyte encodings like UTF-8, which means that
Haskell should have a Unicoded character type. And it's already
specified in Haskell 98 that Char is such a type!

What is missing is API for manipulating binary files, and conversion
between byte streams and character streams using particular text
encodings.

 A mail client is expected to respect the encoding set in 

Re: [Haskell-cafe] Writing binary files?

2004-09-15 Thread Glynn Clements

Marcin 'Qrczak' Kowalczyk wrote:

  Unless you are the sole user of a system, you have no control over
  what filenames may occur on it (and even if you are the sole user,
  you may wish to use packages which don't conform to your rules).
 
 For these occasions you may set the encoding to ISO-8859-1. But then
 you can't sensibly show them to the user in a GUI, nor in ncurses
 using the wide character API, nor you can't sensibly store them in a
 file which is to be always encoded in UTF-8 (e.g. XML file where you
 can't put raw bytes without knowing their encoding).

If you need to preserve the data exactly, you can use octal escapes
(\337), URL encoding (%DF) or similar. If you don't, you can just
approximate it (e.g. display unrepresentable characters as ?). But
this is an inevitable consequence of filenames being bytes rather than
chars.

[Actually, regarding on-screen display, this is also an issue for
Unicode. How many people actually have all of the Unicode glyphs? I
certainly don't.]

 There are two paradigms: manipulate bytes not knowing their encoding,
 and manipulating characters explicitly encoded in various encodings
 (possibly UTF-8). The world is slowly migrating from the first to the
 second.

This migration isn't a process which will ever be complete. There will
always be plenty of cases where bytes really are just bytes.

And even to the extent that it can be done, it will take a long time. 
Outside of the Free Software ghetto, long-term backward compatibility
still means a lot.

[E.g.: EBCDIC has been in existence longer than I have and, in spite
of the fact that it's about the only widely-used encoding in existence
which doesn't have ASCII as a subset, it shows no sign of dying out
any time soon.]

   There are limits to the extent to which this can be achieved. E.g. 
   what happens if you set the encoding to UTF-8, then call
   getDirectoryContents for a directory which contains filenames which
   aren't valid UTF-8 strings?
  
  The library fails. Don't do that. This environment is internally
  inconsistent.
 
  Call it what you like, it's a reality, and one which programs need to
  deal with.
 
 The reality is that filenames are encoded in different encodings
 depending on the system. Sometimes it's ISO-8859-1, sometimes
 ISO-8859-2, sometimes UTF-8. We should not ignore the possibility
 of UTF-8-encoded filenames.

I'm not suggesting we do.

 In CLisp it fails silently (undecodable filenames are skipped), which
 is bad. It should fail loudly.

No, it shouldn't fail at all.

  Most programs don't care whether any filenames which they deal with
  are valid in the locale's encoding (or any other encoding). They just
  receive lists (i.e. NUL-terminated arrays) of bytes and pass them
  directly to the OS or to libraries.
 
 And this is why I can't switch my home environment to UTF-8 yet. Too
 many programs are broken; almost all terminal programs which use more
 than stdin and stdout in default modes, i.e. which use line editing or
 work in full screen. How would you display a filename in a full screen
 text editor, such that it works in a UTF-8 environment?

So, what are you suggesting? That the whole world switches to UTF-8? 
Or that every program should pass everything through iconv() (and
handle the failures)? Or what?

  If the assumed encoding is ISO-8859-*, this program will work
  regardless of the filenames which it is passed or the contents of the
  file (modulo the EOL translation on Windows). OTOH, if it were to use
  UTF-8 (e.g. because that was the locale's encoding), it wouldn't work
  correctly if either filename or the file's contents weren't valid
  UTF-8.
 
 A program is not supposed to encounter filenames which are not
 representable in the locale's encoding.

Huh? What does supposed to mean in this context? That everything
would be simpler if reality wasn't how it is?

If that's your position, then my response is essentially: Yes, but so
what?

 In your setting it's
 impossible to display a filename in a way other than printing
 to stdout.

Writing to stdout doesn't amount to displaying anything; stdout
doesn't have to be a terminal.

  More accurately, it specifies which encoding to assume when you *need*
  to know the encoding (i.e. ctype.h etc), but you can't obtain that
  information from a more reliable source.
 
 In the case of filenames there is no more reliable source.

Sure; but that doesn't automatically mean that the locale's encoding
is correct for any given filename. The point is that you often don't
need to know the encoding.

Converting a byte string to a character string when you're just going
to be converting it back to the original byte string is pointless. And
it introduces unnecessary errors. If the only difference between
(decode . encode) and the identity function is that the former
sometimes fails, what's the point?

  My central point is that the existing API forces the encoding to be
  an issue when it shouldn't be.
 
 It is an unavoidable 

Re: [Haskell-cafe] Writing binary files?

2004-09-15 Thread Glynn Clements

Udo Stenzel wrote:

  Note that this needs to include all of the core I/O functions, not
  just reading/writing streams. E.g. FilePath is currently an alias for
  String, but (on Unix, at least) filenames are strings of bytes, not
  characters. Ditto for argv, environment variables, possibly other
  cases which I've overlooked.
 
 I don't think so.  They all are sequences of CChars, and C isn't
 particularly known for keeping bytes and chars apart.

CChar is a C char, which is a byte (not necessarily an octet, and
not necessarily a character either).

 I believe,
 Windows NT has (alternate) filename handling functions that use unicode
 stringsr.

Almost all of the Win32 API functions which handle strings exist in
both char and wide-char versions.

 This would strengthen the view that a filename is a sequence
 of characters.

It would be reasonable to make FilePath equivalent to String on
Windows, but not on Unix.

 Ditto for argv, env, whatnot; they are typically entered
 from the shell and therefore are characters in the local encoding.

Both argv and envp are char**, i.e. lists of byte strings. There is no
guarantee that the values can be succesfully decoded according the
locale's encoding.

The environment is typically set on login, and inherited thereafter. 
It's typically limited to ASCII, but this isn't guaranteed. Similarly,
a program may need to access files which he didn't create, and which
have filenames which aren't valid strings according to his locale.

E.g. a user may choose a locale which uses UTF-8, but the sysadmin has
installed files with ISO-8859-1 filenames. If a Haskell program tries
to coerce everything to String using the user's locale, the program
will be unable to access such files.

   3. The default encoding is settable from Haskell, defaults to
  ISO-8859-1.
  
  Agreed.
 
 Oh no, please don't do that.  A global, settable encoding is, well,
 dys-functional.  Hidden state makes programs hard to understand and
 Haskell imho shouldn't go that route.

There's already plenty of hidden state in the system libraries upon
which a Haskell program depends.

 And please don't introduce the notion of a default encoding.

It isn't an issue of *introducing* it. Many Haskell98 functions (i.e. 
much of IO, System and Directory) accept or return Strings, yet have
to be implemented on top of an OS which accepts or provides char*s. 
There *has* to be an encoding between the two, and currently it's
hardwired to ISO-8859-1.

The alternative to a global encoding is for *all* functions which
interface to the OS to always either accept or return [CChar] or, if
they accept or return Strings, accept an additional argument which
specifies the encoding.

Also, bear in mind that the functions under discussion are all I/O
functions which, by their nature, deal with state (e.g. the state of
the filesystem).

 I'd like to see the following:
 
 - Duplicate the IO library.  The duplicate should work with [Byte]
   everywhere where the old library uses String.  Byte is some suitable
   unsigned integer, on most (all?) platforms this will be Word8

Technically it should be CChar. However, it's fairly safe to assume
that a byte will always be 8 bits; almost nobody writes code which
works on systems where it isn't.

However: if we go this route, I suspect that we will also need a
convenient method for specifying literal byte strings in Haskell
source code.

 - Provide an explicit conversion between encodings.  A simple conversion
   of type [Word8] - String would suit me, iconv would provide all that
   is needed.

For the general case, you need to allow for stateful encodings (e.g. 
ISO-2022). Actually, even UTF-8 needs to deal with state if you need
to decode byte streams which are split into chunks and the breaks can
occur in the middle of a character (e.g. if you're using non-blocking
I/O).

 - iconv takes names of encodings as arguments.  Provide some names as
   constants: one name for the internal encoding (probably UCS4), one
   name for the canonical external encoding (probably locale dependent).
 
 - Then redefine the old IO API in terms of the new API and appropriate
   conversions.

The old API requires an implicit encoding. The OS gives accepts or
provides bytes, the old API functions accept or return Chars, and the
old API functions don't accept an encoding argument.

This is why we are (or, at least, I am) suggesting a settable current
encoding. Because the existing API *needs* a current encoding, and I'm
assuming that there may be some reluctance to just discarding it
completely.

 While we're at it, do away with the annoying CR/LF problem on Windows,
 this should simply be part of the local encoding.  This way file can
 always be opened as binary, hSetBinary can be dropped.  (This won't wont
 on ancient platforms where text files and binary files are genuinely
 different, but these are probably not interesting anyway.)

Apart from OS-specific issues, it would be useful to treat EOL
conventions 

Re: [Haskell-cafe] Writing binary files?

2004-09-15 Thread Marcin 'Qrczak' Kowalczyk
Glynn Clements [EMAIL PROTECTED] writes:

 [Actually, regarding on-screen display, this is also an issue for
 Unicode. How many people actually have all of the Unicode glyphs?
 I certainly don't.]

If I don't have a particular character in fonts, I will not create
files with it in filenames. Actually I only use 9 Polish letters in
addition to ASCII, and even them rarely. Usually it's only a subset
of ASCII.

Some programs use UTF-8 in filenames no matter what the locale is. For
example the Evolution mail program which stores mail folders as files
under names the user entered in a GUI. I had to rename some of these
files in order to import them to Gnus, as it choked on filenames with
strange characters, never mind that it didn't display them correctly
(maybe because it tried to map them to virtual newsgroup names, or
maybe because they are control characters in ISO-8859-x).

If all programs consistently used the locale encoding for filenames,
this should have worked.

When I switch my environment to UTF-8, which may happen in a few
years, I will convert filenames to UTF-8 and set up mount options to
translate vfat filenames to/from UTF-8 instead of to ISO-8859-2.

I expect good programs to understand that and display them correctly
no matter what technique they are using for the display. For example
the Epiphany web browser, when I open the file:/home/users/qrczak URL,
displays ISO-8859-2-encoded filenames correctly. The virtual HTML file
it created from the directory listing has x105; in its title where
the directory name had 0xB1 in ISO-8859-2. When I run Epiphany with
the locale set to pl_PL.UTF-8, it displays UTF-8 filenames correctly
and ISO-8859-2 filenames are not shown at all.

It's fine for me that it doesn't deal with wrongly encoded filenames,
because it allowed to treat well encoded filenames correctly. For a
web page rendered on the screen it makes no sense to display raw
bytes. Epiphany treats filenames as sequences of characters encoded
according to the locale.

 And even to the extent that it can be done, it will take a long time. 
 Outside of the Free Software ghetto, long-term backward compatibility
 still means a lot.

Windows has already switched most of its internals to Unicode, and it
did it faster than Linux.

 In CLisp it fails silently (undecodable filenames are skipped), which
 is bad. It should fail loudly.

 No, it shouldn't fail at all.

Since it uses Unicode as string representation, accepting filenames
not encoded in the locale encoding would imply making garbage from
filenames correctly encoded in the locale encoding. In a UTF-8
environment character U+00E1 in the filename means bytes 0xC3 0xA1
on ext2 filesystem (and 0x00E1 on vfat filesystem), so it can't at
the same time mean 0xE1 on ext2 filesystem.

 And this is why I can't switch my home environment to UTF-8 yet. Too
 many programs are broken; almost all terminal programs which use more
 than stdin and stdout in default modes, i.e. which use line editing or
 work in full screen. How would you display a filename in a full screen
 text editor, such that it works in a UTF-8 environment?

 So, what are you suggesting? That the whole world switches to UTF-8?

No, each computer system decides for itself, and announces it in the
locale setting. I'm suggesting that programs should respect that and
correctly handle all correctly encoded texts, including filenames.

Better programs may offer to choose the encoding explicitly when it
makes sense (e.g. text file editors for opening a file), but if they
don't, they should at least accept the locale encoding.

 Or that every program should pass everything through iconv()
 (and handle the failures)?

If it uses Unicode as internal string representation, yes (because the
OS API on Unix generally uses byte encodings rather than Unicode).

This should be done transparently in libraries of respective languages
instead of in each program independently.

 A program is not supposed to encounter filenames which are not
 representable in the locale's encoding.

 Huh? What does supposed to mean in this context? That everything
 would be simpler if reality wasn't how it is?

It means that if it encounters a filename encoded differently, it's
usually not the fault of the program but of whoever caused the
mismatch in the first place.

 In your setting it's impossible to display a filename in a way
 other than printing to stdout.

 Writing to stdout doesn't amount to displaying anything; stdout
 doesn't have to be a terminal.

I know, it's not the point. The point is that other display channels
than stdout connected to a terminal often work in terms of characters
rather than bytes of some implicit encoding. For example various GUI
frameworks, and wide character ncurses.

 Sure; but that doesn't automatically mean that the locale's encoding
 is correct for any given filename. The point is that you often don't
 need to know the encoding.

What if I do need to know the encoding? I must assume 

[Haskell-cafe] Re: Writing binary files?

2004-09-15 Thread Gabriel Ebner
Glynn Clements [EMAIL PROTECTED] writes:

 3. The default encoding is settable from Haskell, defaults to
ISO-8859-1.

 Agreed.

So every haskell program that did more than just passing raw bytes
From stdin to stdout should decode the appropriate environment
variables, and set the encoding by itself?  IMO that's too much of
redundancy, the RTS should actually do that.

 There are limits to the extent to which this can be achieved. E.g.
 what happens if you set the encoding to UTF-8, then call
 getDirectoryContents for a directory which contains filenames which
 aren't valid UTF-8 strings?

Then you _seriously_ messed up.  Your terminal would produce garbage,
Nautilus would break, ...

 5. The default encoding is settable from Haskell, defaults to the
locale encoding.

 I feel that the default encoding should be one whose decoder cannot
 fail, e.g. ISO-8859-1. You should have to explicitly request the use
 of the locale's encoding (analogous to calling setlocale(LC_CTYPE, )
 at the start of a C program; there's a good reason why C doesn't do
 this without being explicitly told to).

So that any haskell program that doesn't call setlocale and outputs
anything else than US-ASCII will produce garbage on an UTF-8 system?

 Actually, the more I think about it, the more I think that simple,
 stupid programs probably shouldn't be using Unicode at all.

Care to give any examples?  Everything that has been mentioned until
now would break with an UTF-8 locale:
- ls (sorting would break),
- env (sorting too)

 I.e. Char, String, string literals, and the I/O functions in
 Prelude, IO etc should all be using bytes,

I don't want the same mess as in C, where strings and raw data are the
very same.  Haskell has a nice type system and nicely defined types
for binary data ([Word8]) and for Strings (String), why don't use it?

 with a distinct wide-character API available for people who want to
 make the (substantial) effort involved in writing (genuinely)
 internationalised programs.

If you introduce an entirely new i18n-only API, then it'll surely
become difficult. :-)

 Anything that isn't ISO-8859-1 just doesn't work for the most part,
 and anyone who wants to provide real I18N first has to work around
 the pseudo-I18N that's already there (e.g. convert Chars back into
 Word8s so that they can decode them into real Chars).

One more reason to fix the I/O functions to handle encodings and have
a seperate/underlying binary I/O API.

 Oh, and because bytes are being stored in Chars, the type system won't
 help if you neglect to decode a string, or if you decode it twice.

Yes, that's the problem with the current approach, i.e. that there's
no easy way get a list of Word8's out of a handle.

  The advantage of assuming ISO-8859-* is that the decoder can't fail;
  every possible stream of bytes is valid.
 
 Assuming. But UTF-8 is not ISO-8859-*. When someday I change most of
 my files and filenames from ISO-8859-2 to UTF-8, and change the
 locale, the assumption will be wrong. I can't change that now, because
 too many programs would break.
 
 The current ISO-8859-1 assumption is also wrong. A program written in
 Haskell which sorts strings would break for non-ASCII letters even now
 that they are ISO-8859-2 unless specified otherwise.

 1. In that situation, you can't avoid the encoding issues. It doesn't
 matter what the default is, because you're going to have to set the
 encoding anyhow.

Why do you always want me to set the encoding?  That should be the job
of the RTS.  It's ok to use a different API to get Strings instead of
Word8's out of a handle, but _manually_ having to set the encoding?
IIRC, Haskell is meant to be portable, and locale handling is pretty
platform-dependent.

 2. If you assume ISO-8859-1, you can always convert back to Word8

If I want a list of Word8's, then I should be able to get them without
extracting them from a string.

 then re-decode as UTF-8. If you assume UTF-8, anything which is neither
 UTF-8 nor ASCII will fail far more severely than just getting the
 collation order wrong.

If I use String's to handle binary data, then I should expect things
to break.  If I want to get text, and it's not in the expected
encoding, then the user has messed up.

 Well, my view is essentially that files should be treated as
 containing bytes unless you explicitly choose to decode them, at
 which point you have to specify the encoding.

Why do you always want me to _manually_ specify an encoding?  If I
want bytes, I'll use the (currently being discussed, see beginning of
this thread) binary I/O API, if I want String's (i.e. text), I'll use
the current I/O API (which is pretty text-orientated anyway, see
hPutStrLn, hGetLine, ...).

 completely new wide-character API for those who wish to use it.

Which would make it horrendously difficult to do even basic I18N.

 That gets the failed attempt at I18N out of everyone's way with a
 minimum of effort and with maximum backwards compatibility for

Re: [Haskell-cafe] Writing binary files?

2004-09-15 Thread Glynn Clements

Marcin 'Qrczak' Kowalczyk wrote:

  [Actually, regarding on-screen display, this is also an issue for
  Unicode. How many people actually have all of the Unicode glyphs?
  I certainly don't.]
 
 If I don't have a particular character in fonts, I will not create
 files with it in filenames. Actually I only use 9 Polish letters in
 addition to ASCII, and even them rarely. Usually it's only a subset
 of ASCII.

But this seems to be assuming a closed world. I.e. the only files
which the program will ever see are those which were created by you,
or by others who are compatible with your conventions.

 Some programs use UTF-8 in filenames no matter what the locale is. For
 example the Evolution mail program which stores mail folders as files
 under names the user entered in a GUI.

This is entirely reasonable for a file which a program creates. If a
filename is just a string of bytes, a program can use whatever
encoding it wants.

 I had to rename some of these
 files in order to import them to Gnus, as it choked on filenames with
 strange characters, never mind that it didn't display them correctly
 (maybe because it tried to map them to virtual newsgroup names, or
 maybe because they are control characters in ISO-8859-x).

If it had just treated them as bytes, rather than trying to interpret
them as characters, there wouldn't have been any problems.

 If all programs consistently used the locale encoding for filenames,
 this should have worked.

But again, for this to work in general, you have to assume a closed
world.

 When I switch my environment to UTF-8, which may happen in a few
 years, I will convert filenames to UTF-8 and set up mount options to
 translate vfat filenames to/from UTF-8 instead of to ISO-8859-2.

But what about files which were been created by other people, who
don't use UTF-8?

 I expect good programs to understand that and display them correctly
 no matter what technique they are using for the display.

When it comes to display, you have to have to deal with encoding
issues one way or another. But not all programs deal with display.

 For example
 the Epiphany web browser, when I open the file:/home/users/qrczak URL,
 displays ISO-8859-2-encoded filenames correctly. The virtual HTML file
 it created from the directory listing has x105; in its title where
 the directory name had 0xB1 in ISO-8859-2. When I run Epiphany with
 the locale set to pl_PL.UTF-8, it displays UTF-8 filenames correctly
 and ISO-8859-2 filenames are not shown at all.

For many (probably most) programs, omitting such files would be an
unacceptable failure.

  And even to the extent that it can be done, it will take a long time. 
  Outside of the Free Software ghetto, long-term backward compatibility
  still means a lot.
 
 Windows has already switched most of its internals to Unicode, and it
 did it faster than Linux.

Microsoft is actively hostile to both backwards compatibility and
cross-platform compatibility.

I consider the fact that some Unix (primarily Linux) developers seem
equally hostile to be a problem.

Having said that, with Linux developers, the issue usually due to not
being bothered. Assuming that everything is UTF-8 allows a lot of
potential problems to be ignored.

Fortunately, the problem is mostly consigned to the periphery, i.e. 
the desktop, where most programs have to deal with display issues (so
you *have* to decode bytes into characters), and it isn't too critical
if they have limitations.

The core OS and network server applications essentially remain
encoding-agnostic.

  In CLisp it fails silently (undecodable filenames are skipped), which
  is bad. It should fail loudly.
 
  No, it shouldn't fail at all.
 
 Since it uses Unicode as string representation, accepting filenames
 not encoded in the locale encoding would imply making garbage from
 filenames correctly encoded in the locale encoding. In a UTF-8
 environment character U+00E1 in the filename means bytes 0xC3 0xA1
 on ext2 filesystem (and 0x00E1 on vfat filesystem), so it can't at
 the same time mean 0xE1 on ext2 filesystem.

But, as I keep pointing out, filenames are byte strings, not character
strings. You shouldn't be converting them to character strings unless
you have to.

  And this is why I can't switch my home environment to UTF-8 yet. Too
  many programs are broken; almost all terminal programs which use more
  than stdin and stdout in default modes, i.e. which use line editing or
  work in full screen. How would you display a filename in a full screen
  text editor, such that it works in a UTF-8 environment?
 
  So, what are you suggesting? That the whole world switches to UTF-8?
 
 No, each computer system decides for itself, and announces it in the
 locale setting. I'm suggesting that programs should respect that and
 correctly handle all correctly encoded texts, including filenames.

1. Actually, each user decides which locale they wish to use. Nothing
forces two users of a system to use the same locale.

2. Even if 

[Haskell-cafe] Layered I/O

2004-09-15 Thread oleg

Binary i/o is not specifically a Haskell problem. Other programming
systems, for example, Scheme, have been struggling with the same
issues. Scheme Binary I/O proposal may therefore be of some interest
in this group.
http://srfi.schemers.org/srfi-56/
It is deliberately made to be the easiest to implement and the least
controversial.

There exist more ambitious proposals. The key feature is a layered, or
stackable i/o. Enclosed is a justification message that I wrote two
years ago for Haskell-Cafe, but somehow did not post. An early draft
of the ambitious proposal, cast in the context of Scheme,
is available here:
http://pobox.com/~oleg/ftp/Scheme/io.txt

More polished drafts exist, and even a prototype
implementation. Unfortunately, once it became clear that the ideas are
working out, the motivation fizzled.


The discussion of i18n i/o highlighted the need for general overlay
streams. We should be able to place a processing layer onto a handle
-- and to peel it off and place another one. The layers can do
character encoding, subranging (limiting the stream to the specified
number of basic units), base64 and other decoding, signature
collecting and verification, etc.

Processing of a response from a web server illustrates a genuine need for
such overlayed processing. Suppose we have established a connection to
a web server, sent a GET or POST request and are now reading the
response. It starts as follows:
HTTP/1.1 200 Have it
Content-type: text/plain; charset=iso-2022-jp
Content-length: 12345
Date: Tuesday, August 13, 2002
empty-line

To read the response line and the content headers, our stream must be
in a ASCII, Latin-1 or UTF-8 encoding (regardless of the current
locale). The body of the message is encoded in iso-2022-jp. This
encoding may have nothing to do with the current locale. Furthermore,
many character encodings cannot be reliably detected
automatically. Therefore, after we read the headers we must forcibly
set our stream to use the iso-2022-jp encoding. ISO-2022-JP is the
encoding for Japanese characters
[http://www.faqs.org/rfcs/rfc1554.html] It is a variable-length
stateful encoding: the start of the specifically Japanese encoding is
indicated by \e$B.  After that, the reader should read _two octets_
from the input stream (and pass them to the application as they
are). The server has indicated that it was sending 12345 _octets_ of
data. We cannot tell offhand how many _characters_ of data we will
read, because of the variable-length encoding. However, we must not
even attempt to read the 12346-th octet: HTTP/1.1 connections are, in
general, persistent, and we must not read more data than were
sent. Otherwise, we deadlock. Therefore, our stream must be able to
give us Japanese characters and still must count octets. The HTTP
stream will not, in general, give EOF condition at the end of data.

That is not the only complication. Suppose the web server replied:
Content-type: text/plain; charset=iso-2022-jp
Content-transfer-encoding: chunked

We should expect to read a sequence of chunks of the format:
length CRLF body CRLF
where length is a hexadecimal number, and body is encoded as
indicated in the Content-type. Therefore, after we read the header, we
should keep our stream in the ASCII mode to read the length
field. After that, we should switch the encoding into
ISO-2022-JP. After we have consumed length octets, we should switch
the stream back to ASCII, verify the trailing CRLF, and read the
length of the next chunk. The ISO-2022-JP encoding is stateful and
wide (a character is formed by several octets). It may well happen
that a wide character is split between two chunks: one octet of a
character will be in one chunk and the other octets in following
chunk. Therefore, when we switch from the ISO-2022-JP encoding to
ASCII and back, we must preserve the state of the encoding.

This is not the end of the story however. A web server may send us a
multi-part reply: a multi-part MIME entity made of several MIME
entities, each with its own encoding and transfer modes. Neither of
these encoding have anything to do with the current locale. Therefore,
we may need to switch encodings back and forth quite a few times.

Decoding of such complex streams becomes easier if we can overlay
different processing layers on a stream. We start with a TCP handle,
overlay an ASCII stream and read the headers, then overlay a stream
that reads a specified number of units (and returns EOF when read that
many). On the top of the latter we place an ISO-2022-JP decoder. Or we
choose a base64-decoder overlayed with a PCS7 signed entity decoder
and with a signature verification layer.

OpenSSL is one package that offers i/o overlays and stream
composition. Overlaying of parsers, encoders and hash accumulators is
very common in that particular domain. I have implemented such a
facility in two languages, e.g., to overlay an endian stream on the 

Re: [Haskell-cafe] Layered I/O

2004-09-15 Thread MR K P SCHUPKE

My thoughts on I/O, binary and chars can be summerised:

1) Produce a new WordN based IO library.   
2) Character strings cannot be separated from their encodings
   (ie they must be encoded somehow - even if that encoding
   is ascii). I would approch this using parameterised phantom
   types, so for example you would have:

data Ascii  
data Chr a = Char

   Then the encoding becomes explicit in the type system, but of course
   functions like equality on types cvan still be expressed generically:

charEq :: Chr a - Chr a - Bool

   but of course, you must be comparing the same encodings (enforced
   by type system)!

   The phantom type could be used in IO, IE to define a new encodeing:

data MyEncoding

instance ChrToBinary MyEncoding where
chrToBinary = ...

instance BinaryToChr MyEncoding where
binaryToChr = ...


Keean.
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-15 Thread Ben Rudiak-Gould
I modestly re-propose the I/O model which I first proposed last year:

http://www.haskell.org/pipermail/haskell/2003-July/012312.html
http://www.haskell.org/pipermail/haskell/2003-July/012313.html
http://www.haskell.org/pipermail/haskell/2003-July/012350.html
http://www.haskell.org/pipermail/haskell/2003-July/012352.html
...

-- Ben

___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


RE: [Haskell-cafe] Writing binary files?

2004-09-15 Thread Simon Marlow
On 15 September 2004 12:32, [EMAIL PROTECTED] wrote:

 On Mon, Sep 13, 2004 at 12:01:58PM +0100, Glynn Clements wrote:
 My view is that, right now, we have the worst of both worlds, and
 taking a short step backwards (i.e. narrow the Char type and leave
 the rest alone) is a lot simpler (and more feasible) than the long
 journey towards real I18N.
 
 This being Haskell, I can't imagine a consensus on a step backwards.
 In any case, a Char type distinct from bytes and the rest is the most
 valuable part of the current situation.  The rest is just libraries,
 and the solution to that is to create other libraries.  (It's true
 that the Prelude is harder to work around, but even that can be done,
 as with the new exception interface.)  Indeed more than one approach
 can proceed concurrently, and that's probably what's going to happen:
 
 The Right Thing proceeds in stages:
 1. new byte-based libraries
 2. conversions sitting on top of these
 3. the ultimate I18N API
 
 The Quick Fix: alter the existing implementation to use the
 encoding determined by the current locale at the borders.

I wish I had some more time to work on this, but I implemented a
prototype of an ultimate i18n API recently.  This is derived from the
API that Ben Rudiak-Gould proposed last year.

It is in two layers: the InputStream/OutputStream classes provide raw
byte I/O, and the TextStream class provides a conversion on top of that.
The prototype uses iconv for conversions.  You can make Streams from all
sorts of things: files, sockets, pipes, and even Haskell arrays.
InputStream and OutputStreams are just classes, so you can implement
your own.

IIRC, I managed to get it working with speed comparable to GHC's current
IO library.

Here's a tarball that works with GHC 6.2.1 on a Unix platform, just
--make to build it:

  http://www.haskell.org/~simonmar/new-io.tar.gz

If anyone would like to pick this up and run with it, I'd be delighted.
I'm not likely to get back to in the short term, at least.

Cheers,
Simon
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-15 Thread Malcolm Wallace
Simon Marlow [EMAIL PROTECTED] writes:

 Here's a tarball that works with GHC 6.2.1 on a Unix platform, just
 --make to build it:
 
   http://www.haskell.org/~simonmar/new-io.tar.gz

Found a bug already...
In System/IO/Stream.hs, line 183:

streamReadBufrer s 0 buf = return 0
streamReadBuffer s len ptr = ...

Note the different spellings of the function name.
Regards,
Malcolm
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] FilePath handling

2004-09-15 Thread Marcin 'Qrczak' Kowalczyk
Henning Thielemann [EMAIL PROTECTED] writes:

 I even plead for an abstract data type FilePath which supports
 operations like 'enter a directory', 'go one level higher' and so
 on.

Beware of Common Lisp history:

http://www.gigamonkeys.com/book/practical-a-portable-pathname-library.html

As we discussed in chapter TK, Common Lisp provides an abstraction,
the pathname, that is supposed to insulate us from the details of how
different OS's and file systems name files. However the gods of
abstraction give and they take away. While, with a bit of care,
pathnames can be used to write code that will be portable between
different OS's it can, ironically, be quite tricky to write pathname
code that that will be portable between different Common Lisp
implementations on the *same* OS.

The root of the problem is that the pathname abstraction was designed
to represent file names on a wide variety of file systems. The set of
file systems still in general use at the time Common Lisp was being
standardized was much more variegated than the set of file systems
that are commonly used now. Unfortunately, by making pathnames
abstract enough to account for a wide variety of file systems, Common
Lisp's designers left implementors with a fair number of choices to be
made when mapping the pathname abstraction onto a particular file
system. Consequently different implementors, each implementing the
pathname abstraction for the same file system, by making different
choices at a few key junctions, could end up with conforming
implementations that nonetheless provide different behavior for
several of the main pathname-related functions.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Unicoded filenames

2004-09-15 Thread Marcin 'Qrczak' Kowalczyk

Here is what happens when a language provides only narrow-char API for
filenames:

 Start of forwarded message 
Date: Wed, 15 Sep 2004 15:18:00 +0100
From: Peter Jolly [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
CC: caml-list [EMAIL PROTECTED]
Subject: Re: [Caml-list] How do I create files with UTF-8 file names?

Mattias Waldau wrote:
 I have a filename as an UTF-8 encoded string. I need to be able to 
 handle strange chars like accents, Asian chars etc.
 
 Is there any way to create a file with that name? I only need it on Win32.

Windows uses UTF-16 for filenames, but provides a non-Unicode interface 
for legacy applications; the standard open() function that OCaml's 
open_out wraps appears to use the legacy interface.  The precise 
codepage this uses is system-dependent, and AFAIK there's no way for a 
program to determine what it is without calling out to the Win32 API, 
but you can be pretty sure it won't be UTF-8.

In other words, there is no reliable way to use a filename containing 
non-ASCII characters with OCaml's standard library.

 Or should I solve this problem by talking directly to the Win32-api?

This is probably the best solution.  A combination of CreateFileW() and 
MultiByteToWideChar() should do what you want.

---
To unsubscribe, mail [EMAIL PROTECTED] Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners


 End of forwarded message 

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-15 Thread Glynn Clements

Graham Klyne wrote:

 In particular, the idea of narrowing the Char type really seems like a 
 bad idea to me (if I understand the intent correctly).  Not so long ago, I 
 did a whole load of work on the HaXml parser so that, among other things, 
 it would support UTF-8 and UTF-16 Unicode (as required by the XML 
 spec).  To do this depends upon having a Char type that can represent the 
 full repertoire of Unicode characters.

Note: I wasn't proposing doing away with wide character support
altogether. Essentially, I was suggesting making Char a byte and
having e.g. WideChar for wide characters. The reason being that the
existing Haskell98 API uses Char for functions which are actually
dealing with bytes.

In an ideal world, the IO, System and Directory modules (and the
Prelude I/O functions) would have used Byte, leaving Char to represent
a (wide) character. However, that isn't the hand we've been dealt.

-- 
Glynn Clements [EMAIL PROTECTED]
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] FilePath handling [Was: Writing binary files?]

2004-09-15 Thread Glynn Clements

Henning Thielemann wrote:

 Udo Stenzel wrote:
 
  The same thoughts apply to filenames.  Make them [Word8] and convert
  explicitly.  By the way, I think a path should be a list of names (that
  is of type [[Word8]]) and the library would be concerned with putting in
  the right path separator.  Add functions to read and show pathnames in
  the local conventions and we'll never need to worry about path
  separators again.
 
 I even plead for an abstract data type FilePath which supports operations
 like 'enter a directory', 'go one level higher' and so on.

Are you referring to pure operations on the FilePath, e.g. appending
and removing entries? That's reasonable enough. But it needs to be
borne in mind that there's a difference between:

setCurrentDirectory ..
and:
dir - getCurrentDirectory
setCurrentDirectory $ parentDirectory dir

[where parentDirectory is a pure FilePath - FilePath function.]

if the last component in the path is a symlink.

If you want to make FilePath an instance of Eq, the situation gets
much more complicated.

-- 
Glynn Clements [EMAIL PROTECTED]
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Unicoded filenames

2004-09-15 Thread Glynn Clements

Marcin 'Qrczak' Kowalczyk wrote:

 Here is what happens when a language provides only narrow-char API for
 filenames:

  I have a filename as an UTF-8 encoded string. I need to be able to 
  handle strange chars like accents, Asian chars etc.
  
  Is there any way to create a file with that name? I only need it on Win32.
 
 Windows uses UTF-16 for filenames, but provides a non-Unicode interface 
 for legacy applications; the standard open() function that OCaml's 
 open_out wraps appears to use the legacy interface.  The precise 
 codepage this uses is system-dependent, and AFAIK there's no way for a 
 program to determine what it is without calling out to the Win32 API, 
 but you can be pretty sure it won't be UTF-8.
 
 In other words, there is no reliable way to use a filename containing 
 non-ASCII characters with OCaml's standard library.

No, this is what happens when an API imposes restrictions upon the
filenames which it can handle.

Essentially, it's due to two (or possibly three) factors:

1. The fact that Windows uses wide strings, rather than multi-byte
strings, for filenames.

2. The fact that Windows' compatibility interface is broken, i.e. it
only lets you access filenames which can be represented in the current
codepage (which, to me, is highly analogous to only supporting
filenames which are valid in the current locale).

3. Possibly that OCaml insists upon using UTF-8. [I don't know that
this is the case, but the fact that they specifically mention UTF-8
suggests that it might be.]

IOW, this incident seems to oppose, rather than support, the
filenames-as-characters viewpoint.

-- 
Glynn Clements [EMAIL PROTECTED]
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Writing binary files?

2004-09-15 Thread Gabriel Ebner
Glynn Clements [EMAIL PROTECTED] writes:

 The RTS doesn't know the encoding. Assuming that the data will use the
 locale's encoding will be wrong too often.

If the program wants to get bytes, it should get bytes explicitly, not
some sort of pseudo-Unicode String.

 Like so many other people, you're making an argument based upon
 fiction (specifically, that you have a closed world where everything
 always uses the same encoding) then deeming anyone who is unable to
 maintain the fiction to be wrong.

Everything's fine here with LANG=de_AT.utf8.  And I can't recall
having any problems with it.  But well, YMMV.

 No. If a program just passes bytes around, everything will work so
 long as the inputs use the encoding which the outputs are assumed to
 use. And if the inputs aren't in the correct encoding, then you have
 to deal with encodings manually regardless of the default behaviour.

The only programs that just pass bytes around that come to mind are
basic Unix utilities.  Basically everything else will somehow process
the data.

 Sorting according to codepoints inevitably involves decoding. However,
 getting the order wrong is usually considered less problematic than
 failing outright.

But more difficult to debug.

 Tough. You already have it, and will do for the foreseeable future. 
 Many existing APIs (including the core Unix API), protocols and file
 formats are defined in terms of byte strings with no encoding
 specified or implied.

Guess why I like Haskell (the language; the implementations are not up
to that ideal yet).

 I'd like to. But many of the functions which provide or accept binary
 data (e.g. FilePath) insist on represent it using Strings.

Good point.  Adding functions that accept bytes instead of strings
would be a major undertaking.

 I18N is inherently difficult. Lots of textual data exists in lots of
 different encodings, and the encoding is frequently unspecified.

That's the problem with the current API.  You can neither easily
read/write bytes nor strings in a specified encoding.

 The problem is that we also need to fix them to handle *no encoding*.

That's binary data. (assuming you didn't want to say 'unknown')

 Also, binary data and text aren't disjoint. Everything is binary; some
 of it is *also* text.

Simon's new-io proposal does this very nicely.  Stdin is by default a
binary stream and you can obtain a TextInputStream for it using either
the locale's encoding or a specified encoding.  That's the way I'd
like it to be.

 Or out of getDirectoryContents, getArgs, getEnv etc. Or to pass a list
 of Word8s to a handle, or to openFile, getEnv etc.

That's a real issue.  Adding new functions with a bin- is the only
solution that comes to my mind.

 The point is that, currently, you can't. Nothing in the core Haskell98
 API actually uses Word8, it all uses Char/String.

That's the intent of this thread. :-)

 Because we don't have an oracle which will magically determine the
 encoding for you.

That oracle is called locale setting.  If I want to read text and
can't determine the encoding by other ways (protocol spec, ...), then
it's what the user set his locale setting to.

 If you want text, well, tough; what comes out most system calls and
 core library functions (not just read()) are bytes.

Which need to be interpreted by the program depending on where these
bytes come from.

 There isn't any magic wand which will turn them into characters
 without knowing the encoding.

If I know the encoding, I should be able to set it.  If I don't, it's
the locale setting.

  completely new wide-character API for those who wish to use it.
 
 Which would make it horrendously difficult to do even basic I18N.

 Why?

Having different types for single-byte and multi-byte strings together
with seperate functions to handle them (that's what I assume you mean
by a new wide-character API) with single-byte strings being the
preferred one (the cause of being a seperate API) would make sorting,
upper/lower case testing etc. not exactly easier.

 I know. That's what I'm saying. The problem is that the broken code
 is the Haskell98 API.

No, it's not broken.  It just has some missing features (i.e. I/O /
env functions accepting bytes instead of strings).

  String's are a list of unicode characters, [Word8] is a
 list of bytes.

 And what comes out of (and goes into) most core library functions is
 the latter.

Strictly speaking, the former comes out with the semantics of the latter. :-)
Maybe bugs should be filed?

  Gabriel.


pgpZrzOaMmiCZ.pgp
Description: PGP signature
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe