Re: [Haskell-cafe] Writing binary files

2006-08-22 Thread Alistair Bayley

On 21/08/06, Udo Stenzel [EMAIL PROTECTED] wrote:

Neil Mitchell wrote:
 I'm trying to write out a binary file, in particular I want the
 following functions:

 hPutInt :: Handle - Int - IO ()

 hGetInt :: Handle - IO Int

 For the purposes of these functions, Int = 32 bits, and its got to
 roundtrip - Put then Get must be the same.

 How would I do this? I see Ptr, Storable and other things, but nothing
 which seems directly useable for me.


hPutInt h = hPutStr h . map chr . map (0xff ..)
  . take 4 . iterate (`shiftR` 8)

hGetInt h = replicateM 4 (hGetChar h) =
return . foldr (\i d - i `shiftL` 8 .|. ord d) 0

This of course assumes that a Char is read/written as a single low-order
byte without any conversion.  But you'd have to assume a lot more if you
started messing with pointers.  (Strange, somehow I get the feeling, the
above is way too easy to be the answer you wanted.)


Udo.


What's wrong with the following i.e. what assumptions is it making
(w.r.t. pointers) that I've missed? Is endian-ness an issue here?

Alistair


hPutInt :: Handle - Int32 - IO ()
hGetInt :: Handle - IO Int32

int32 :: Int32
int32 = 0

hPutInt h i = do
 alloca $ \p - do
 poke p i
 hPutBuf h p (sizeOf i)

hGetInt h = do
 alloca $ \p - do
 bytes - hGetBuf h p (sizeOf int32)
 when (bytes  sizeOf int32) (error too few bytes read)
 peek p
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re[2]: [Haskell-cafe] Writing binary files

2006-08-22 Thread Bulat Ziganshin
Hello Alistair,

Tuesday, August 22, 2006, 1:29:22 PM, you wrote:

 What's wrong with the following i.e. what assumptions is it making
 (w.r.t. pointers) that I've missed? Is endian-ness an issue here?

data written by your module on big-endian machine, can't be read by
the same module in the little-end machine

   bytes - hGetBuf h p (sizeOf int32)

or

bytes - hGetBuf h p (sizeOf (0::Int32))

-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Writing binary files

2006-08-21 Thread Neil Mitchell

Hi,

I'm trying to write out a binary file, in particular I want the
following functions:

hPutInt :: Handle - Int - IO ()

hGetInt :: Handle - IO Int

For the purposes of these functions, Int = 32 bits, and its got to
roundtrip - Put then Get must be the same.

How would I do this? I see Ptr, Storable and other things, but nothing
which seems directly useable for me.

Thanks

Neil
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files

2006-08-21 Thread Udo Stenzel
Neil Mitchell wrote:
 I'm trying to write out a binary file, in particular I want the
 following functions:
 
 hPutInt :: Handle - Int - IO ()
 
 hGetInt :: Handle - IO Int
 
 For the purposes of these functions, Int = 32 bits, and its got to
 roundtrip - Put then Get must be the same.
 
 How would I do this? I see Ptr, Storable and other things, but nothing
 which seems directly useable for me.


hPutInt h = hPutStr h . map chr . map (0xff ..)
  . take 4 . iterate (`shiftR` 8)

hGetInt h = replicateM 4 (hGetChar h) =
return . foldr (\i d - i `shiftL` 8 .|. ord d) 0

This of course assumes that a Char is read/written as a single low-order
byte without any conversion.  But you'd have to assume a lot more if you
started messing with pointers.  (Strange, somehow I get the feeling, the
above is way too easy to be the answer you wanted.)


Udo.
-- 
Worrying is like rocking in a rocking chair -- It gives
you something to do, but it doesn't get you anywhere.


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files

2006-08-21 Thread Neil Mitchell

Hi


hPutInt h = hPutStr h . map chr . map (0xff ..)
  . take 4 . iterate (`shiftR` 8)

hGetInt h = replicateM 4 (hGetChar h) =
return . foldr (\i d - i `shiftL` 8 .|. ord d) 0

This of course assumes that a Char is read/written as a single low-order
byte without any conversion.  But you'd have to assume a lot more if you
started messing with pointers.  (Strange, somehow I get the feeling, the
above is way too easy to be the answer you wanted.)


It's exactly the answer I was hoping for!

Thanks

Neil
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-18 Thread Marcin 'Qrczak' Kowalczyk
Glynn Clements [EMAIL PROTECTED] writes:

 Ok, but let it be in addition to, not instead treating them as
 character strings.

 Provided that you know the encoding, nothing stops you converting
 them to strings, should you have a need to do so.

There are already APIs which use Strings for filenames. I meant to
keep them, let them use a program-settable encoding which defaults to
the locale encoding - this is the only sane interpretation of this
interface on Unix I can imagine. And in addition to them we may have
APIs which use byte strings, for those who prefer the ability to
handle all filenames to using a uniform string representation inside
the program.

 Such encodings are not suitable for filenames.

 Regardless of whether they are suitable, they are used.

Usage of ISO-2022 as filename encoding is a bad and unsupported idea.
The '/' byte does not necessarily mean that the '/' character is
there, so some random subset of characters is excluded. statefulness
means that the same filename may be interpreted as different
characters depending on context.

There is no need to support ISO-2022 as filename encoding in languages
and tools. The fact that some tool doesn't support ISO-2022 in
filenames is not a flaw in the tool, so there is no need to check
what happens when filenames are represented in ISO-2022. If they are,
someone should fix his system.

 I haven't addressed any of the other stuff about ISO-2022, as it isn't
 really relevant. Whether ISO-2022 is good or bad doesn't matter; what
 matters is that it is likely to remain in use for the foreseeable
 future.

For transportation, not for the locale encoding nor for filenames.
There are no ISO-2022 locales. A program may support it when data it
operates on requests recoding between explicit encodings, e.g. if it's
found in an email, but there is no need to support it as the default
encoding of a program (which e.g. withCString function should use).

 IMHO it's more important to make them compatible with the
 representation of strings used in other parts of the program.

 Why?

To limit conversion hassle to I/O, instead of scattering it through
the program when filenames and other strings are met.

 But otherwise programs would continuously have bugs in handling text
 which is not ISO-8859-1, especially with multibyte encoding where
 pretending that ISO-8859-2 is ISO-8859-1 too often doesn't work.

 Why?

Because some channels talk in terms of characters, or bytes in a known
encoding, instead of bytes in an implicit encoding. E.g. most display
channels, apart from raw stdin/stdout and narrow character ncurses;
many Internet protocols, apart from irc; .NET and Java; file formats
like XML; some databases.

And the world is slowly shifting to have more such channels, which
replace byte streams in an implicit encoding, because after reaching
a critical mass (where encodingless channels don't get in the middle
way, losing information about the encoding or losing some characters)
they make miltilingual handling more robust.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-18 Thread David Roundy
On Sat, Sep 18, 2004 at 10:58:21AM +0200, Marcin 'Qrczak' Kowalczyk wrote:
 Glynn Clements [EMAIL PROTECTED] writes:
 
  Ok, but let it be in addition to, not instead treating them as
  character strings.
 
  Provided that you know the encoding, nothing stops you converting them
  to strings, should you have a need to do so.
 
 There are already APIs which use Strings for filenames. I meant to keep
 them, let them use a program-settable encoding which defaults to the
 locale encoding - this is the only sane interpretation of this interface
 on Unix I can imagine. And in addition to them we may have APIs which use
 byte strings, for those who prefer the ability to handle all filenames to
 using a uniform string representation inside the program.

Keep in mind, if you make this change to the IO libraries, you also will
have to simultaneously fix the Foreign.C.String module to use the same
locale as is used by the IO libraries when they deal with FilePaths.
Incidentally, this change will break existing darcs repositories... but
that should at least be repairable.  According to the FFI the CString
functions do a locale-based conversion, but of course in practice that
isn't the case.
-- 
David Roundy
http://www.abridgegame.org
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-17 Thread Marcin 'Qrczak' Kowalczyk
Glynn Clements [EMAIL PROTECTED] writes:

 What I'm suggesting in the above is to sidestep the encoding issue
 by keeping filenames as byte strings wherever possible.

Ok, but let it be in addition to, not instead treating them as
character strings.

 And program-generated email notifications frequently include text with
 no known encoding (i.e. binary data).

No, programs don't dump binary data among diagnostic messages. If they
output binary data to stdout, it's their only output and it's redirected
to a file or another process.

 Or are you going to demand that anyone who tries to hack into your
 system only sends it UTF-8 data so that the alert messages are
 displayed correctly in your mail program?

The email protocol is text-only. It may mangle newlines, it has
a maximum line length, some texts may be escaped during transport
(e.g. From  at the beginning of a line). Arbitrary binary data
should be put in base64-or-otherwise-encoded attachments.

If the cron program embeds the output as email body, the cron job
should not dump arbitrary binary data to stdout. Encoding is not the
only problem.

 Processing data in their original byte encodings makes supporting
 multiple languages harder. Filenames which are inexpressible as
 character strings get in the way of clean APIs. When considering only
 filenames, using bytes would be sufficient, but in overall it's more
 convenient to Unicodize them like other strings.

 It also harms reliability. Depending upon the encoding, two distinct
 byte strings may have the same Unicode representation.

Such encodings are not suitable for filenames.

http://www.mail-archive.com/[EMAIL PROTECTED]/msg00376.html

| ISO-2022-JP will never be a satisfactory terminal encoding (like
| ISO-8859-*, EUC-*, UTF-8, Shift_JIS) because
|
| 1) It is a stateful encoding. What happens when a program starts some
| terminal output and then is interrupted using Ctrl-C or Ctrl-Z? The
| terminal will remain in the shifted state, while other programs start
| doing output. But these programs expect that when they start, the
| terminal is in the initial state. The net result will be garbage on
| the screen.
|
| 2) ISO-2022-JP is not filesystem safe. Therefore filenames will never
| be able to carry Japanese characters in this encodings.
|
| Robert Brady writes:
|  Does ISO-2022 see much/any use as the locale encoding, or it it just used
|  for interchange?
|
| Just for interchange.
|
| Paul Eggert searched for uses of ISO-2022-JP as locale encodings (in
| order to convince me), and only came up with a handful of questionable
| URLs. He didn't convince me. And there are no plans to support
| ISO-2022-JP as a locale encoding in glibc - because of 1) and 2) above.

For me ISO-2022 is a brain-damaged concept and should die. Almost
nothing supports it anyway.

 Such tarballs are not portable across systems using different encodings.

 Well, programs which treat filenames as byte strings to be read from
 argv[] and passed directly to open() won't have any problems with this.

The OS itself may have problems with this; only some filesystems
accept arbitrary bytes apart from '\0' and '/' (and with the special
meaning for '.'). Exotic characters in filenames are not very
portable.

 A Haskell program in my world can do that too. Just set the encoding
 to Latin1.

 But programs should handle this by default, IMHO.

IMHO it's more important to make them compatible with the
representation of strings used in other parts of the program.

 Filenames are, for the most part, just tokens to be passed around.

Filenames are often stored in text files, whose bytes are interpreted
as characters. Applying QP to non-ASCII parts of filenames is suitable
only if humans won't edit these files by hand.

  My specific point is that the Haskell98 API has a very big problem due
  to the assumption that the encoding is always known. Existing
  implementations work around the problem by assuming that the encoding
  is always ISO-8859-1.
 
 The API is incomplete and needs to be enhanced. Programs written using
 the current API will be limited to using the locale encoding.

 That just adds unnecessary failure modes.

But otherwise programs would continuously have bugs in handling text
which is not ISO-8859-1, especially with multibyte encoding where
pretending that ISO-8859-2 is ISO-8859-1 too often doesn't work.

I can't switch my environment to UTF-8 yet precisely because too many
programs were written with the attitude you are promoting: they don't
care about the encoding, they just pass bytes around.

Bugs range from small annoyances like tabular output which doesn't
line up, through mangled characters on a graphical display, to
full-screen interactive programs being unusable on a UTF-8 terminal.

 This encoding would be incompatible with most other texts seen by the
 program. In particular reading a filename from a file would not work
 without manual recoding.

 We already have that problem; you can't read non-Latin1 

Re: [Haskell-cafe] Writing binary files?

2004-09-17 Thread Glynn Clements

Marcin 'Qrczak' Kowalczyk wrote:

  What I'm suggesting in the above is to sidestep the encoding issue
  by keeping filenames as byte strings wherever possible.
 
 Ok, but let it be in addition to, not instead treating them as
 character strings.

Provided that you know the encoding, nothing stops you converting them
to strings, should you have a need to do so.

  Processing data in their original byte encodings makes supporting
  multiple languages harder. Filenames which are inexpressible as
  character strings get in the way of clean APIs. When considering only
  filenames, using bytes would be sufficient, but in overall it's more
  convenient to Unicodize them like other strings.
 
  It also harms reliability. Depending upon the encoding, two distinct
  byte strings may have the same Unicode representation.
 
 Such encodings are not suitable for filenames.

Regardless of whether they are suitable, they are used.

 For me ISO-2022 is a brain-damaged concept and should die.

Well, it isn't likely to.

I haven't addressed any of the other stuff about ISO-2022, as it isn't
really relevant. Whether ISO-2022 is good or bad doesn't matter; what
matters is that it is likely to remain in use for the foreseeable
future.

  Such tarballs are not portable across systems using different encodings.
 
  Well, programs which treat filenames as byte strings to be read from
  argv[] and passed directly to open() won't have any problems with this.
 
 The OS itself may have problems with this; only some filesystems
 accept arbitrary bytes apart from '\0' and '/' (and with the special
 meaning for '.'). Exotic characters in filenames are not very
 portable.

No, but most Unix programs manage to handle them without problems.

  A Haskell program in my world can do that too. Just set the encoding
  to Latin1.
 
  But programs should handle this by default, IMHO.
 
 IMHO it's more important to make them compatible with the
 representation of strings used in other parts of the program.

Why?

  Filenames are, for the most part, just tokens to be passed around.
 
 Filenames are often stored in text files,

True.

 whose bytes are interpreted as characters.

Sometimes true, sometimes not.

Where filenames occur in data files, e.g. configuration files, the
program which reads the configuration file typically passes the bytes
directly to the OS without interpretation.

 Applying QP to non-ASCII parts of filenames is suitable
 only if humans won't edit these files by hand.

Who said anything about QP?

   My specific point is that the Haskell98 API has a very big problem due
   to the assumption that the encoding is always known. Existing
   implementations work around the problem by assuming that the encoding
   is always ISO-8859-1.
  
  The API is incomplete and needs to be enhanced. Programs written using
  the current API will be limited to using the locale encoding.
 
  That just adds unnecessary failure modes.
 
 But otherwise programs would continuously have bugs in handling text
 which is not ISO-8859-1, especially with multibyte encoding where
 pretending that ISO-8859-2 is ISO-8859-1 too often doesn't work.

Why?

 I can't switch my environment to UTF-8 yet precisely because too many
 programs were written with the attitude you are promoting: they don't
 care about the encoding, they just pass bytes around.

That's all that many programs should be doing.

 Bugs range from small annoyances like tabular output which doesn't
 line up, through mangled characters on a graphical display, to
 full-screen interactive programs being unusable on a UTF-8 terminal.

IOW:

1. display doesn't work correctly,
2. display doesn't work correctly, and
3. display doesn't work correctly.

You keep citing cases involving graphical display as a reason why all
programs should be working with characters all of the time.

I haven't suggested that programs should never deal with characters,
yet you keep insinuating that is my argument, then proceed to attack
it.

-- 
Glynn Clements [EMAIL PROTECTED]
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-16 Thread Marcin 'Qrczak' Kowalczyk
Glynn Clements [EMAIL PROTECTED] writes:

 But this seems to be assuming a closed world. I.e. the only files
 which the program will ever see are those which were created by you,
 or by others who are compatible with your conventions.

Yes, unless you set the default encoding to Latin1.

 Some programs use UTF-8 in filenames no matter what the locale is. For
 example the Evolution mail program which stores mail folders as files
 under names the user entered in a GUI.

 This is entirely reasonable for a file which a program creates. If a
 filename is just a string of bytes, a program can use whatever
 encoding it wants.

But then they display wrong in any other program.

 If it had just treated them as bytes, rather than trying to interpret
 them as characters, there wouldn't have been any problems.

I suspect it treats some characters in these synthesized newsgroup
names, like dots, specially, so it won't work unless it was designed
differently.

 When I switch my environment to UTF-8, which may happen in a few
 years, I will convert filenames to UTF-8 and set up mount options to
 translate vfat filenames to/from UTF-8 instead of to ISO-8859-2.

 But what about files which were been created by other people, who
 don't use UTF-8?

All people sharing a filesystem should use the same encoding.

BTW, when ftping files between Windows and Unix, a good ftp client
should convert filenames to keep the same characters rather than
bytes, so CP-1250 encoded names don't come as garbage in the encoding
used on Unix which is definitely different (ISO-8859-2 or UTF-8) or
vice versa.

 I expect good programs to understand that and display them
 correctly no matter what technique they are using for the display.

 When it comes to display, you have to have to deal with encoding
 issues one way or another. But not all programs deal with display.

So you advocate using multiple encodings internally. This is in
general more complicated than what I advocate: using only Unicode
internally, limiting other encodings to I/O boundary.

 Assuming that everything is UTF-8 allows a lot of potential problems
 to be ignored.

I don't assume UTF-8 when locale doesn't say this.

 The core OS and network server applications essentially remain
 encoding-agnostic.

Which is a problem when they generate an email, e.g. to send a
non-empty output of a cron job, or report unauthorized use of sudo.
If the data involved is not pure ASCII, I will often be mangled.

It's rarely a problem in practice because filenames, command
arguments, error messages, user full names etc. are usually pure
ASCII. But this is slowly changing.

 But, as I keep pointing out, filenames are byte strings, not
 character strings. You shouldn't be converting them to character
 strings unless you have to.

Processing data in their original byte encodings makes supporting
multiple languages harder. Filenames which are inexpressible as
character strings get in the way of clean APIs. When considering only
filenames, using bytes would be sufficient, but in overall it's more
convenient to Unicodize them like other strings.

 1. Actually, each user decides which locale they wish to use. Nothing
 forces two users of a system to use the same locale.

Locales may be different, but they should use the same encoding when
they share files. This applies to file contents too - various formats
don't have a fixed encoding and don't specify the encoding explicitly,
so these files are assumed to be in the locale encoding.

 2. Even if the locale was constant for all users on a system, there's
 still the (not exactly minor) issue of networking.

Depends on the networking protocols. They might insist that filenames
are represented in UTF-8 for example.

  Or that every program should pass everything through iconv()
  (and handle the failures)?
 
 If it uses Unicode as internal string representation, yes (because the
 OS API on Unix generally uses byte encodings rather than Unicode).

 The problem with that is that you need to *know* the source and
 destination encodings. The program gets to choose one of them, but it
 may not even know the other one.

If it can't know the encoding, it should process the data as a
sequence of bytes, and can output it only to another channel which
accepts raw bytes.

But usually it's either known or can be assumed to be the locale
encoding.

 The term mismatch implies that there have to be at least two things.
 If they don't match, which one is at fault? If I make a tar file
 available for you to download, and it contains non-UTF-8 filenames, is
 that my fault or yours?

Such tarballs are not portable across systems using different encodings.

If I tar a subdirectory stored on ext2 partition, and you untar it on
a vfat partition, whose fault it is that files which differ only in
case are conflated?

 In any case, if a program refuses to deal with a file because it is
 cannot convert the filename to characters, even when it doesn't have
 to, it's the program which 

Re: [Haskell-cafe] Writing binary files?

2004-09-16 Thread Glynn Clements

Marcin 'Qrczak' Kowalczyk wrote:

  When I switch my environment to UTF-8, which may happen in a few
  years, I will convert filenames to UTF-8 and set up mount options to
  translate vfat filenames to/from UTF-8 instead of to ISO-8859-2.
 
  But what about files which were been created by other people, who
  don't use UTF-8?
 
 All people sharing a filesystem should use the same encoding.

Again, this is just hand waving the issues away.

 BTW, when ftping files between Windows and Unix, a good ftp client
 should convert filenames to keep the same characters rather than
 bytes, so CP-1250 encoded names don't come as garbage in the encoding
 used on Unix which is definitely different (ISO-8859-2 or UTF-8) or
 vice versa.

Which is fine if the FTP client can figure out which encoding is used
on the remote end. In practice, you have to tell it, i.e. have a list
of which servers (or even which directories on which servers) use
which encoding.

  I expect good programs to understand that and display them
  correctly no matter what technique they are using for the display.
 
  When it comes to display, you have to have to deal with encoding
  issues one way or another. But not all programs deal with display.
 
 So you advocate using multiple encodings internally. This is in
 general more complicated than what I advocate: using only Unicode
 internally, limiting other encodings to I/O boundary.

How do you draw that conclusion from what I wrote here?

There are cases where it's advantages to use multiple encodings, but I
wasn't suggesting that in the above. What I'm suggesting in the above
is to sidestep the encoding issue by keeping filenames as byte strings
wherever possible.

  The core OS and network server applications essentially remain
  encoding-agnostic.
 
 Which is a problem when they generate an email, e.g. to send a
 non-empty output of a cron job, or report unauthorized use of sudo.
 If the data involved is not pure ASCII, I will often be mangled.

It only gets mangled if you feed it to a program which is making
assumptions about the encoding. Non-MIME messages neither specify nor
imply an encoding. MIME messages can use either
text/plain; charset=x-unknown or application/octet-stream if they
don't undertand the encoding.

And program-generated email notifications frequently include text with
no known encoding (i.e. binary data). Or are you going to demand that
anyone who tries to hack into your system only sends it UTF-8 data so
that the alert messages are displayed correctly in your mail program?

 It's rarely a problem in practice because filenames, command
 arguments, error messages, user full names etc. are usually pure
 ASCII. But this is slowly changing.

To the extent that non-ASCII filenames are used, I've encountered far
more filenames in both Latin1 and ISO-2022 than in UTF-8. Japanese FTP
sites typically use ISO-2022 for everything; even ASCII names may have
\e(B prepended to them.

  But, as I keep pointing out, filenames are byte strings, not
  character strings. You shouldn't be converting them to character
  strings unless you have to.
 
 Processing data in their original byte encodings makes supporting
 multiple languages harder. Filenames which are inexpressible as
 character strings get in the way of clean APIs. When considering only
 filenames, using bytes would be sufficient, but in overall it's more
 convenient to Unicodize them like other strings.

It also harms reliability. Depending upon the encoding, two distinct
byte strings may have the same Unicode representation.

E.g. if you are interfacing to a server which uses ISO-2022 for
filenames, you have to get the escapes correct even when they are
no-ops in terms of the string representation. If you obtain a
directory listing, receive the filename \e(Bfoo.txt, and convert it
to Unicode, you get foo.txt. If you then convert it back without the
leading escape, the server is going to say file not found.

  The term mismatch implies that there have to be at least two things.
  If they don't match, which one is at fault? If I make a tar file
  available for you to download, and it contains non-UTF-8 filenames, is
  that my fault or yours?
 
 Such tarballs are not portable across systems using different encodings.

Well, programs which treat filenames as byte strings to be read from
argv[] and passed directly to open() won't have any problems with
this. It's only a problem if you make it a problem.

 If I tar a subdirectory stored on ext2 partition, and you untar it on
 a vfat partition, whose fault it is that files which differ only in
 case are conflated?

Arguably, it's Microsoft's fault for not considering the problems
caused by multiple encodings when they decided that filenames were
going to be case-folded.

  In any case, if a program refuses to deal with a file because it is
  cannot convert the filename to characters, even when it doesn't have
  to, it's the program which is at fault.
 
 Only if it's a low-level utility, 

Re: [Haskell-cafe] Writing binary files?

2004-09-15 Thread ross
On Mon, Sep 13, 2004 at 12:01:58PM +0100, Glynn Clements wrote:
 My view is that, right now, we have the worst of both worlds, and
 taking a short step backwards (i.e. narrow the Char type and leave the
 rest alone) is a lot simpler (and more feasible) than the long journey
 towards real I18N.

This being Haskell, I can't imagine a consensus on a step backwards.
In any case, a Char type distinct from bytes and the rest is the most
valuable part of the current situation.  The rest is just libraries,
and the solution to that is to create other libraries.  (It's true
that the Prelude is harder to work around, but even that can be done,
as with the new exception interface.)  Indeed more than one approach
can proceed concurrently, and that's probably what's going to happen:

The Right Thing proceeds in stages:
1. new byte-based libraries
2. conversions sitting on top of these
3. the ultimate I18N API

The Quick Fix: alter the existing implementation to use the
encoding determined by the current locale at the borders.

When the Right Thing is finished, the Quick Fix can be recast as a
special case.  The Right Thing might take a very long (possibly infinite)
time, because this is the sort of thing people can argue about endlessly.
Still, the first stage would deal with most of the scenarios you raised.
It just needs a group of people who care about it to get together and
do it. 

The Quick Fix is the most logical implementation of the current
definition of Haskell, and entirely consistent with its general
philosophy of presenting the programmer with an idealized (some might
say oversimplified) model of computation.  From the start, Haskell
has supported only character-based I/O, with whatever translations
were required to present a uniform view on all platforms.  And that's
not an entirely bad thing.  It won't work all the time, but it will be
simple, and good enough for most people.  Its existence will not rule
out binary I/O or more sophisticated alternatives.  Those who need more
may be motivated to help finish the Right Thing.
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-15 Thread Graham Klyne
I've not been following this debate, but I think I agree with Ross.
In particular, the idea of narrowing the Char type really seems like a 
bad idea to me (if I understand the intent correctly).  Not so long ago, I 
did a whole load of work on the HaXml parser so that, among other things, 
it would support UTF-8 and UTF-16 Unicode (as required by the XML 
spec).  To do this depends upon having a Char type that can represent the 
full repertoire of Unicode characters.

Other languages have been forced into this (maybe painful) transition;  I 
don't think Haskell can reasonably go backwards if it is to have any hope 
of surviving.

#g
--
At 12:31 15/09/04 +0100, [EMAIL PROTECTED] wrote:
On Mon, Sep 13, 2004 at 12:01:58PM +0100, Glynn Clements wrote:
 My view is that, right now, we have the worst of both worlds, and
 taking a short step backwards (i.e. narrow the Char type and leave the
 rest alone) is a lot simpler (and more feasible) than the long journey
 towards real I18N.
This being Haskell, I can't imagine a consensus on a step backwards.
In any case, a Char type distinct from bytes and the rest is the most
valuable part of the current situation.  The rest is just libraries,
and the solution to that is to create other libraries.  (It's true
that the Prelude is harder to work around, but even that can be done,
as with the new exception interface.)  Indeed more than one approach
can proceed concurrently, and that's probably what's going to happen:
The Right Thing proceeds in stages:
1. new byte-based libraries
2. conversions sitting on top of these
3. the ultimate I18N API
The Quick Fix: alter the existing implementation to use the
encoding determined by the current locale at the borders.
When the Right Thing is finished, the Quick Fix can be recast as a
special case.  The Right Thing might take a very long (possibly infinite)
time, because this is the sort of thing people can argue about endlessly.
Still, the first stage would deal with most of the scenarios you raised.
It just needs a group of people who care about it to get together and
do it.
The Quick Fix is the most logical implementation of the current
definition of Haskell, and entirely consistent with its general
philosophy of presenting the programmer with an idealized (some might
say oversimplified) model of computation.  From the start, Haskell
has supported only character-based I/O, with whatever translations
were required to present a uniform view on all platforms.  And that's
not an entirely bad thing.  It won't work all the time, but it will be
simple, and good enough for most people.  Its existence will not rule
out binary I/O or more sophisticated alternatives.  Those who need more
may be motivated to help finish the Right Thing.
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe

Graham Klyne
For email:
http://www.ninebynine.org/#Contact
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-15 Thread Marcin 'Qrczak' Kowalczyk
Glynn Clements [EMAIL PROTECTED] writes:

 Unless you are the sole user of a system, you have no control over
 what filenames may occur on it (and even if you are the sole user,
 you may wish to use packages which don't conform to your rules).

For these occasions you may set the encoding to ISO-8859-1. But then
you can't sensibly show them to the user in a GUI, nor in ncurses
using the wide character API, nor you can't sensibly store them in a
file which is to be always encoded in UTF-8 (e.g. XML file where you
can't put raw bytes without knowing their encoding).

There are two paradigms: manipulate bytes not knowing their encoding,
and manipulating characters explicitly encoded in various encodings
(possibly UTF-8). The world is slowly migrating from the first to the
second.

  There are limits to the extent to which this can be achieved. E.g. 
  what happens if you set the encoding to UTF-8, then call
  getDirectoryContents for a directory which contains filenames which
  aren't valid UTF-8 strings?
 
 The library fails. Don't do that. This environment is internally
 inconsistent.

 Call it what you like, it's a reality, and one which programs need to
 deal with.

The reality is that filenames are encoded in different encodings
depending on the system. Sometimes it's ISO-8859-1, sometimes
ISO-8859-2, sometimes UTF-8. We should not ignore the possibility
of UTF-8-encoded filenames.

In CLisp it fails silently (undecodable filenames are skipped), which
is bad. It should fail loudly.

 Most programs don't care whether any filenames which they deal with
 are valid in the locale's encoding (or any other encoding). They just
 receive lists (i.e. NUL-terminated arrays) of bytes and pass them
 directly to the OS or to libraries.

And this is why I can't switch my home environment to UTF-8 yet. Too
many programs are broken; almost all terminal programs which use more
than stdin and stdout in default modes, i.e. which use line editing or
work in full screen. How would you display a filename in a full screen
text editor, such that it works in a UTF-8 environment?

 If the assumed encoding is ISO-8859-*, this program will work
 regardless of the filenames which it is passed or the contents of the
 file (modulo the EOL translation on Windows). OTOH, if it were to use
 UTF-8 (e.g. because that was the locale's encoding), it wouldn't work
 correctly if either filename or the file's contents weren't valid
 UTF-8.

A program is not supposed to encounter filenames which are not
representable in the locale's encoding. In your setting it's
impossible to display a filename in a way other than printing
to stdout.

 More accurately, it specifies which encoding to assume when you *need*
 to know the encoding (i.e. ctype.h etc), but you can't obtain that
 information from a more reliable source.

In the case of filenames there is no more reliable source.

 My central point is that the existing API forces the encoding to be
 an issue when it shouldn't be.

It is an unavoidable issue because not every interface in a given
computer system uses the same encoding. Gtk+ uses UTF-8; you must
convert text to UTF-8 in order to display it, and in order to convert
you must know its encoding.

 Well, to an extent it is an implementation issue. Historically, curses
 never cared about encodings. A character is a byte, you draw bytes on
 the screen, curses sends them directly to the terminal.

This is the old API. But newer ncurses API is prepared even for
combining accents. A character is coded with a sequence of wchar_t
values, such that all except the first one are combining characters.

 Furthermore, the curses model relies upon monospaced fonts, and falls
 down once you encounter CJK text (where a monospaced font means one
 whose glyphs are an integer multiple of the cell size, not necessarily
 a single cell).

It doesn't fall. Characters may span several columns. There is wcwidth(),
and curses specification in X/Open says how it should behave for wide
CJK characters. I haven't tested it but I believe ncurses supports
them.

 Extending something like curses to handle encoding issues is far
 from trivial; which is probably why it hasn't been finished yet.

It's almost finished. The API specification was ready in 1997.
It works in ncurses modulo unfixed bugs.

But programs can't use it unless they use Unicode internally.

 Although, if you're going to have implicit String - [Word8]
 converters, there's no reason why you can't do the reverse, and have
 isAlpha :: Word8 - IO Bool. Although, like ctype.h, this will only
 work for single-byte encodings.

We should not ignore multibyte encodings like UTF-8, which means that
Haskell should have a Unicoded character type. And it's already
specified in Haskell 98 that Char is such a type!

What is missing is API for manipulating binary files, and conversion
between byte streams and character streams using particular text
encodings.

 A mail client is expected to respect the encoding set in 

Re: [Haskell-cafe] Writing binary files?

2004-09-15 Thread Glynn Clements

Marcin 'Qrczak' Kowalczyk wrote:

  Unless you are the sole user of a system, you have no control over
  what filenames may occur on it (and even if you are the sole user,
  you may wish to use packages which don't conform to your rules).
 
 For these occasions you may set the encoding to ISO-8859-1. But then
 you can't sensibly show them to the user in a GUI, nor in ncurses
 using the wide character API, nor you can't sensibly store them in a
 file which is to be always encoded in UTF-8 (e.g. XML file where you
 can't put raw bytes without knowing their encoding).

If you need to preserve the data exactly, you can use octal escapes
(\337), URL encoding (%DF) or similar. If you don't, you can just
approximate it (e.g. display unrepresentable characters as ?). But
this is an inevitable consequence of filenames being bytes rather than
chars.

[Actually, regarding on-screen display, this is also an issue for
Unicode. How many people actually have all of the Unicode glyphs? I
certainly don't.]

 There are two paradigms: manipulate bytes not knowing their encoding,
 and manipulating characters explicitly encoded in various encodings
 (possibly UTF-8). The world is slowly migrating from the first to the
 second.

This migration isn't a process which will ever be complete. There will
always be plenty of cases where bytes really are just bytes.

And even to the extent that it can be done, it will take a long time. 
Outside of the Free Software ghetto, long-term backward compatibility
still means a lot.

[E.g.: EBCDIC has been in existence longer than I have and, in spite
of the fact that it's about the only widely-used encoding in existence
which doesn't have ASCII as a subset, it shows no sign of dying out
any time soon.]

   There are limits to the extent to which this can be achieved. E.g. 
   what happens if you set the encoding to UTF-8, then call
   getDirectoryContents for a directory which contains filenames which
   aren't valid UTF-8 strings?
  
  The library fails. Don't do that. This environment is internally
  inconsistent.
 
  Call it what you like, it's a reality, and one which programs need to
  deal with.
 
 The reality is that filenames are encoded in different encodings
 depending on the system. Sometimes it's ISO-8859-1, sometimes
 ISO-8859-2, sometimes UTF-8. We should not ignore the possibility
 of UTF-8-encoded filenames.

I'm not suggesting we do.

 In CLisp it fails silently (undecodable filenames are skipped), which
 is bad. It should fail loudly.

No, it shouldn't fail at all.

  Most programs don't care whether any filenames which they deal with
  are valid in the locale's encoding (or any other encoding). They just
  receive lists (i.e. NUL-terminated arrays) of bytes and pass them
  directly to the OS or to libraries.
 
 And this is why I can't switch my home environment to UTF-8 yet. Too
 many programs are broken; almost all terminal programs which use more
 than stdin and stdout in default modes, i.e. which use line editing or
 work in full screen. How would you display a filename in a full screen
 text editor, such that it works in a UTF-8 environment?

So, what are you suggesting? That the whole world switches to UTF-8? 
Or that every program should pass everything through iconv() (and
handle the failures)? Or what?

  If the assumed encoding is ISO-8859-*, this program will work
  regardless of the filenames which it is passed or the contents of the
  file (modulo the EOL translation on Windows). OTOH, if it were to use
  UTF-8 (e.g. because that was the locale's encoding), it wouldn't work
  correctly if either filename or the file's contents weren't valid
  UTF-8.
 
 A program is not supposed to encounter filenames which are not
 representable in the locale's encoding.

Huh? What does supposed to mean in this context? That everything
would be simpler if reality wasn't how it is?

If that's your position, then my response is essentially: Yes, but so
what?

 In your setting it's
 impossible to display a filename in a way other than printing
 to stdout.

Writing to stdout doesn't amount to displaying anything; stdout
doesn't have to be a terminal.

  More accurately, it specifies which encoding to assume when you *need*
  to know the encoding (i.e. ctype.h etc), but you can't obtain that
  information from a more reliable source.
 
 In the case of filenames there is no more reliable source.

Sure; but that doesn't automatically mean that the locale's encoding
is correct for any given filename. The point is that you often don't
need to know the encoding.

Converting a byte string to a character string when you're just going
to be converting it back to the original byte string is pointless. And
it introduces unnecessary errors. If the only difference between
(decode . encode) and the identity function is that the former
sometimes fails, what's the point?

  My central point is that the existing API forces the encoding to be
  an issue when it shouldn't be.
 
 It is an unavoidable 

Re: [Haskell-cafe] Writing binary files?

2004-09-15 Thread Glynn Clements

Udo Stenzel wrote:

  Note that this needs to include all of the core I/O functions, not
  just reading/writing streams. E.g. FilePath is currently an alias for
  String, but (on Unix, at least) filenames are strings of bytes, not
  characters. Ditto for argv, environment variables, possibly other
  cases which I've overlooked.
 
 I don't think so.  They all are sequences of CChars, and C isn't
 particularly known for keeping bytes and chars apart.

CChar is a C char, which is a byte (not necessarily an octet, and
not necessarily a character either).

 I believe,
 Windows NT has (alternate) filename handling functions that use unicode
 stringsr.

Almost all of the Win32 API functions which handle strings exist in
both char and wide-char versions.

 This would strengthen the view that a filename is a sequence
 of characters.

It would be reasonable to make FilePath equivalent to String on
Windows, but not on Unix.

 Ditto for argv, env, whatnot; they are typically entered
 from the shell and therefore are characters in the local encoding.

Both argv and envp are char**, i.e. lists of byte strings. There is no
guarantee that the values can be succesfully decoded according the
locale's encoding.

The environment is typically set on login, and inherited thereafter. 
It's typically limited to ASCII, but this isn't guaranteed. Similarly,
a program may need to access files which he didn't create, and which
have filenames which aren't valid strings according to his locale.

E.g. a user may choose a locale which uses UTF-8, but the sysadmin has
installed files with ISO-8859-1 filenames. If a Haskell program tries
to coerce everything to String using the user's locale, the program
will be unable to access such files.

   3. The default encoding is settable from Haskell, defaults to
  ISO-8859-1.
  
  Agreed.
 
 Oh no, please don't do that.  A global, settable encoding is, well,
 dys-functional.  Hidden state makes programs hard to understand and
 Haskell imho shouldn't go that route.

There's already plenty of hidden state in the system libraries upon
which a Haskell program depends.

 And please don't introduce the notion of a default encoding.

It isn't an issue of *introducing* it. Many Haskell98 functions (i.e. 
much of IO, System and Directory) accept or return Strings, yet have
to be implemented on top of an OS which accepts or provides char*s. 
There *has* to be an encoding between the two, and currently it's
hardwired to ISO-8859-1.

The alternative to a global encoding is for *all* functions which
interface to the OS to always either accept or return [CChar] or, if
they accept or return Strings, accept an additional argument which
specifies the encoding.

Also, bear in mind that the functions under discussion are all I/O
functions which, by their nature, deal with state (e.g. the state of
the filesystem).

 I'd like to see the following:
 
 - Duplicate the IO library.  The duplicate should work with [Byte]
   everywhere where the old library uses String.  Byte is some suitable
   unsigned integer, on most (all?) platforms this will be Word8

Technically it should be CChar. However, it's fairly safe to assume
that a byte will always be 8 bits; almost nobody writes code which
works on systems where it isn't.

However: if we go this route, I suspect that we will also need a
convenient method for specifying literal byte strings in Haskell
source code.

 - Provide an explicit conversion between encodings.  A simple conversion
   of type [Word8] - String would suit me, iconv would provide all that
   is needed.

For the general case, you need to allow for stateful encodings (e.g. 
ISO-2022). Actually, even UTF-8 needs to deal with state if you need
to decode byte streams which are split into chunks and the breaks can
occur in the middle of a character (e.g. if you're using non-blocking
I/O).

 - iconv takes names of encodings as arguments.  Provide some names as
   constants: one name for the internal encoding (probably UCS4), one
   name for the canonical external encoding (probably locale dependent).
 
 - Then redefine the old IO API in terms of the new API and appropriate
   conversions.

The old API requires an implicit encoding. The OS gives accepts or
provides bytes, the old API functions accept or return Chars, and the
old API functions don't accept an encoding argument.

This is why we are (or, at least, I am) suggesting a settable current
encoding. Because the existing API *needs* a current encoding, and I'm
assuming that there may be some reluctance to just discarding it
completely.

 While we're at it, do away with the annoying CR/LF problem on Windows,
 this should simply be part of the local encoding.  This way file can
 always be opened as binary, hSetBinary can be dropped.  (This won't wont
 on ancient platforms where text files and binary files are genuinely
 different, but these are probably not interesting anyway.)

Apart from OS-specific issues, it would be useful to treat EOL
conventions 

Re: [Haskell-cafe] Writing binary files?

2004-09-15 Thread Marcin 'Qrczak' Kowalczyk
Glynn Clements [EMAIL PROTECTED] writes:

 [Actually, regarding on-screen display, this is also an issue for
 Unicode. How many people actually have all of the Unicode glyphs?
 I certainly don't.]

If I don't have a particular character in fonts, I will not create
files with it in filenames. Actually I only use 9 Polish letters in
addition to ASCII, and even them rarely. Usually it's only a subset
of ASCII.

Some programs use UTF-8 in filenames no matter what the locale is. For
example the Evolution mail program which stores mail folders as files
under names the user entered in a GUI. I had to rename some of these
files in order to import them to Gnus, as it choked on filenames with
strange characters, never mind that it didn't display them correctly
(maybe because it tried to map them to virtual newsgroup names, or
maybe because they are control characters in ISO-8859-x).

If all programs consistently used the locale encoding for filenames,
this should have worked.

When I switch my environment to UTF-8, which may happen in a few
years, I will convert filenames to UTF-8 and set up mount options to
translate vfat filenames to/from UTF-8 instead of to ISO-8859-2.

I expect good programs to understand that and display them correctly
no matter what technique they are using for the display. For example
the Epiphany web browser, when I open the file:/home/users/qrczak URL,
displays ISO-8859-2-encoded filenames correctly. The virtual HTML file
it created from the directory listing has x105; in its title where
the directory name had 0xB1 in ISO-8859-2. When I run Epiphany with
the locale set to pl_PL.UTF-8, it displays UTF-8 filenames correctly
and ISO-8859-2 filenames are not shown at all.

It's fine for me that it doesn't deal with wrongly encoded filenames,
because it allowed to treat well encoded filenames correctly. For a
web page rendered on the screen it makes no sense to display raw
bytes. Epiphany treats filenames as sequences of characters encoded
according to the locale.

 And even to the extent that it can be done, it will take a long time. 
 Outside of the Free Software ghetto, long-term backward compatibility
 still means a lot.

Windows has already switched most of its internals to Unicode, and it
did it faster than Linux.

 In CLisp it fails silently (undecodable filenames are skipped), which
 is bad. It should fail loudly.

 No, it shouldn't fail at all.

Since it uses Unicode as string representation, accepting filenames
not encoded in the locale encoding would imply making garbage from
filenames correctly encoded in the locale encoding. In a UTF-8
environment character U+00E1 in the filename means bytes 0xC3 0xA1
on ext2 filesystem (and 0x00E1 on vfat filesystem), so it can't at
the same time mean 0xE1 on ext2 filesystem.

 And this is why I can't switch my home environment to UTF-8 yet. Too
 many programs are broken; almost all terminal programs which use more
 than stdin and stdout in default modes, i.e. which use line editing or
 work in full screen. How would you display a filename in a full screen
 text editor, such that it works in a UTF-8 environment?

 So, what are you suggesting? That the whole world switches to UTF-8?

No, each computer system decides for itself, and announces it in the
locale setting. I'm suggesting that programs should respect that and
correctly handle all correctly encoded texts, including filenames.

Better programs may offer to choose the encoding explicitly when it
makes sense (e.g. text file editors for opening a file), but if they
don't, they should at least accept the locale encoding.

 Or that every program should pass everything through iconv()
 (and handle the failures)?

If it uses Unicode as internal string representation, yes (because the
OS API on Unix generally uses byte encodings rather than Unicode).

This should be done transparently in libraries of respective languages
instead of in each program independently.

 A program is not supposed to encounter filenames which are not
 representable in the locale's encoding.

 Huh? What does supposed to mean in this context? That everything
 would be simpler if reality wasn't how it is?

It means that if it encounters a filename encoded differently, it's
usually not the fault of the program but of whoever caused the
mismatch in the first place.

 In your setting it's impossible to display a filename in a way
 other than printing to stdout.

 Writing to stdout doesn't amount to displaying anything; stdout
 doesn't have to be a terminal.

I know, it's not the point. The point is that other display channels
than stdout connected to a terminal often work in terms of characters
rather than bytes of some implicit encoding. For example various GUI
frameworks, and wide character ncurses.

 Sure; but that doesn't automatically mean that the locale's encoding
 is correct for any given filename. The point is that you often don't
 need to know the encoding.

What if I do need to know the encoding? I must assume 

Re: [Haskell-cafe] Writing binary files?

2004-09-15 Thread Glynn Clements

Marcin 'Qrczak' Kowalczyk wrote:

  [Actually, regarding on-screen display, this is also an issue for
  Unicode. How many people actually have all of the Unicode glyphs?
  I certainly don't.]
 
 If I don't have a particular character in fonts, I will not create
 files with it in filenames. Actually I only use 9 Polish letters in
 addition to ASCII, and even them rarely. Usually it's only a subset
 of ASCII.

But this seems to be assuming a closed world. I.e. the only files
which the program will ever see are those which were created by you,
or by others who are compatible with your conventions.

 Some programs use UTF-8 in filenames no matter what the locale is. For
 example the Evolution mail program which stores mail folders as files
 under names the user entered in a GUI.

This is entirely reasonable for a file which a program creates. If a
filename is just a string of bytes, a program can use whatever
encoding it wants.

 I had to rename some of these
 files in order to import them to Gnus, as it choked on filenames with
 strange characters, never mind that it didn't display them correctly
 (maybe because it tried to map them to virtual newsgroup names, or
 maybe because they are control characters in ISO-8859-x).

If it had just treated them as bytes, rather than trying to interpret
them as characters, there wouldn't have been any problems.

 If all programs consistently used the locale encoding for filenames,
 this should have worked.

But again, for this to work in general, you have to assume a closed
world.

 When I switch my environment to UTF-8, which may happen in a few
 years, I will convert filenames to UTF-8 and set up mount options to
 translate vfat filenames to/from UTF-8 instead of to ISO-8859-2.

But what about files which were been created by other people, who
don't use UTF-8?

 I expect good programs to understand that and display them correctly
 no matter what technique they are using for the display.

When it comes to display, you have to have to deal with encoding
issues one way or another. But not all programs deal with display.

 For example
 the Epiphany web browser, when I open the file:/home/users/qrczak URL,
 displays ISO-8859-2-encoded filenames correctly. The virtual HTML file
 it created from the directory listing has x105; in its title where
 the directory name had 0xB1 in ISO-8859-2. When I run Epiphany with
 the locale set to pl_PL.UTF-8, it displays UTF-8 filenames correctly
 and ISO-8859-2 filenames are not shown at all.

For many (probably most) programs, omitting such files would be an
unacceptable failure.

  And even to the extent that it can be done, it will take a long time. 
  Outside of the Free Software ghetto, long-term backward compatibility
  still means a lot.
 
 Windows has already switched most of its internals to Unicode, and it
 did it faster than Linux.

Microsoft is actively hostile to both backwards compatibility and
cross-platform compatibility.

I consider the fact that some Unix (primarily Linux) developers seem
equally hostile to be a problem.

Having said that, with Linux developers, the issue usually due to not
being bothered. Assuming that everything is UTF-8 allows a lot of
potential problems to be ignored.

Fortunately, the problem is mostly consigned to the periphery, i.e. 
the desktop, where most programs have to deal with display issues (so
you *have* to decode bytes into characters), and it isn't too critical
if they have limitations.

The core OS and network server applications essentially remain
encoding-agnostic.

  In CLisp it fails silently (undecodable filenames are skipped), which
  is bad. It should fail loudly.
 
  No, it shouldn't fail at all.
 
 Since it uses Unicode as string representation, accepting filenames
 not encoded in the locale encoding would imply making garbage from
 filenames correctly encoded in the locale encoding. In a UTF-8
 environment character U+00E1 in the filename means bytes 0xC3 0xA1
 on ext2 filesystem (and 0x00E1 on vfat filesystem), so it can't at
 the same time mean 0xE1 on ext2 filesystem.

But, as I keep pointing out, filenames are byte strings, not character
strings. You shouldn't be converting them to character strings unless
you have to.

  And this is why I can't switch my home environment to UTF-8 yet. Too
  many programs are broken; almost all terminal programs which use more
  than stdin and stdout in default modes, i.e. which use line editing or
  work in full screen. How would you display a filename in a full screen
  text editor, such that it works in a UTF-8 environment?
 
  So, what are you suggesting? That the whole world switches to UTF-8?
 
 No, each computer system decides for itself, and announces it in the
 locale setting. I'm suggesting that programs should respect that and
 correctly handle all correctly encoded texts, including filenames.

1. Actually, each user decides which locale they wish to use. Nothing
forces two users of a system to use the same locale.

2. Even if 

Re: [Haskell-cafe] Writing binary files?

2004-09-15 Thread Ben Rudiak-Gould
I modestly re-propose the I/O model which I first proposed last year:

http://www.haskell.org/pipermail/haskell/2003-July/012312.html
http://www.haskell.org/pipermail/haskell/2003-July/012313.html
http://www.haskell.org/pipermail/haskell/2003-July/012350.html
http://www.haskell.org/pipermail/haskell/2003-July/012352.html
...

-- Ben

___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


RE: [Haskell-cafe] Writing binary files?

2004-09-15 Thread Simon Marlow
On 15 September 2004 12:32, [EMAIL PROTECTED] wrote:

 On Mon, Sep 13, 2004 at 12:01:58PM +0100, Glynn Clements wrote:
 My view is that, right now, we have the worst of both worlds, and
 taking a short step backwards (i.e. narrow the Char type and leave
 the rest alone) is a lot simpler (and more feasible) than the long
 journey towards real I18N.
 
 This being Haskell, I can't imagine a consensus on a step backwards.
 In any case, a Char type distinct from bytes and the rest is the most
 valuable part of the current situation.  The rest is just libraries,
 and the solution to that is to create other libraries.  (It's true
 that the Prelude is harder to work around, but even that can be done,
 as with the new exception interface.)  Indeed more than one approach
 can proceed concurrently, and that's probably what's going to happen:
 
 The Right Thing proceeds in stages:
 1. new byte-based libraries
 2. conversions sitting on top of these
 3. the ultimate I18N API
 
 The Quick Fix: alter the existing implementation to use the
 encoding determined by the current locale at the borders.

I wish I had some more time to work on this, but I implemented a
prototype of an ultimate i18n API recently.  This is derived from the
API that Ben Rudiak-Gould proposed last year.

It is in two layers: the InputStream/OutputStream classes provide raw
byte I/O, and the TextStream class provides a conversion on top of that.
The prototype uses iconv for conversions.  You can make Streams from all
sorts of things: files, sockets, pipes, and even Haskell arrays.
InputStream and OutputStreams are just classes, so you can implement
your own.

IIRC, I managed to get it working with speed comparable to GHC's current
IO library.

Here's a tarball that works with GHC 6.2.1 on a Unix platform, just
--make to build it:

  http://www.haskell.org/~simonmar/new-io.tar.gz

If anyone would like to pick this up and run with it, I'd be delighted.
I'm not likely to get back to in the short term, at least.

Cheers,
Simon
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-15 Thread Malcolm Wallace
Simon Marlow [EMAIL PROTECTED] writes:

 Here's a tarball that works with GHC 6.2.1 on a Unix platform, just
 --make to build it:
 
   http://www.haskell.org/~simonmar/new-io.tar.gz

Found a bug already...
In System/IO/Stream.hs, line 183:

streamReadBufrer s 0 buf = return 0
streamReadBuffer s len ptr = ...

Note the different spellings of the function name.
Regards,
Malcolm
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-15 Thread Glynn Clements

Graham Klyne wrote:

 In particular, the idea of narrowing the Char type really seems like a 
 bad idea to me (if I understand the intent correctly).  Not so long ago, I 
 did a whole load of work on the HaXml parser so that, among other things, 
 it would support UTF-8 and UTF-16 Unicode (as required by the XML 
 spec).  To do this depends upon having a Char type that can represent the 
 full repertoire of Unicode characters.

Note: I wasn't proposing doing away with wide character support
altogether. Essentially, I was suggesting making Char a byte and
having e.g. WideChar for wide characters. The reason being that the
existing Haskell98 API uses Char for functions which are actually
dealing with bytes.

In an ideal world, the IO, System and Directory modules (and the
Prelude I/O functions) would have used Byte, leaving Char to represent
a (wide) character. However, that isn't the hand we've been dealt.

-- 
Glynn Clements [EMAIL PROTECTED]
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-14 Thread Udo Stenzel
Glynn Clements wrote:
 Marcin 'Qrczak' Kowalczyk wrote:
  [...]
 Note that this needs to include all of the core I/O functions, not
 just reading/writing streams. E.g. FilePath is currently an alias for
 String, but (on Unix, at least) filenames are strings of bytes, not
 characters. Ditto for argv, environment variables, possibly other
 cases which I've overlooked.

I don't think so.  They all are sequences of CChars, and C isn't
particularly known for keeping bytes and chars apart.  I believe,
Windows NT has (alternate) filename handling functions that use unicode
stringsr.  This would strengthen the view that a filename is a sequence
of characters.  Ditto for argv, env, whatnot; they are typically entered
from the shell and therefore are characters in the local encoding.


  3. The default encoding is settable from Haskell, defaults to
 ISO-8859-1.
 
 Agreed.

Oh no, please don't do that.  A global, settable encoding is, well,
dys-functional.  Hidden state makes programs hard to understand and
Haskell imho shouldn't go that route.  And please don't introduce the
notion of a default encoding.


I'd like to see the following:

- Duplicate the IO library.  The duplicate should work with [Byte]
  everywhere where the old library uses String.  Byte is some suitable
  unsigned integer, on most (all?) platforms this will be Word8

- Provide an explicit conversion between encodings.  A simple conversion
  of type [Word8] - String would suit me, iconv would provide all that
  is needed.

- iconv takes names of encodings as arguments.  Provide some names as
  constants: one name for the internal encoding (probably UCS4), one
  name for the canonical external encoding (probably locale dependent).

- Then redefine the old IO API in terms of the new API and appropriate
  conversions.

While we're at it, do away with the annoying CR/LF problem on Windows,
this should simply be part of the local encoding.  This way file can
always be opened as binary, hSetBinary can be dropped.  (This won't wont
on ancient platforms where text files and binary files are genuinely
different, but these are probably not interesting anyway.)

The same thoughts apply to filenames.  Make them [Word8] and convert
explicitly.  By the way, I think a path should be a list of names (that
is of type [[Word8]]) and the library would be concerned with putting in
the right path separator.  Add functions to read and show pathnames in
the local conventions and we'll never need to worry about path
separators again.

 
 There are limits to the extent to which this can be achieved. E.g. 
 what happens if you set the encoding to UTF-8, then call
 getDirectoryContents for a directory which contains filenames which
 aren't valid UTF-8 strings?

Well, then you did something stupid, didn't you?  If you don't know the
encoding you shouldn't decode anything.  That's a strong point against
any implicit decoding, I think.


Also, if efficiency is a concern, lists probably shouldn't be passed
between filesystem operations and iconv.  I think, we need a better
representation here (like PackedString for Word8), not a convoluted API.

Regards,

Udo.
-- 
If Perl is the solution, you're solving the wrong problem. -- Erik Naggum


pgpahJRCZ73FY.pgp
Description: PGP signature
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-14 Thread Glynn Clements

David Menendez wrote:

  I'd like to see the following:
  
  - Duplicate the IO library.  The duplicate should work with [Byte]
everywhere where the old library uses String.  Byte is some suitable
unsigned integer, on most (all?) platforms this will be Word8
  
  - Provide an explicit conversion between encodings.  A simple
conversion of type [Word8] - String would suit me, iconv would
provide all that is needed.
 
 I like this idea, but I say there should be a bit-oriented layer beneath
 everything.

The byte stream is inherent, as that's (usually) what the OS gives
you. Everything else is synthesised.

-- 
Glynn Clements [EMAIL PROTECTED]
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-14 Thread David Menendez
Glynn Clements writes:

 
 David Menendez wrote:
 
   I'd like to see the following:
   
   - Duplicate the IO library.  The duplicate should work with [Byte]
 everywhere where the old library uses String.  Byte is some 
 suitable unsigned integer, on most (all?) platforms this will be
 Word8
   
   - Provide an explicit conversion between encodings.  A simple
 conversion of type [Word8] - String would suit me, iconv would
 provide all that is needed.
  
  I like this idea, but I say there should be a bit-oriented layer
  beneath everything.
 
 The byte stream is inherent, as that's (usually) what the OS gives
 you. Everything else is synthesised.

I was unclear. I meant the bit layer would be beneath everything
conceptually. On today's machines, it would be implemented in terms of a
byte stream and the conversion to the byte stream type would get
compiled away.
-- 
David Menendez [EMAIL PROTECTED] http://www.eyrie.org/~zednenem/
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-13 Thread Ketil Malde
Glynn Clements [EMAIL PROTECTED] writes:

 Right now, the attempt at providing I18N for free, by defining Char
 to mean Unicode, has essentially backfired, IMHO. Anything that isn't
 ISO-8859-1 just doesn't work for the most part, and anyone who wants

Basically, I'm inclined to agree with what you say.  A minor point is
that the other ISO-8859 encodings (or really, any single-byte
encoding) works equally well, as long as you don't want to mix them.
So I guess you really want to say Anything that isn't a single-byte
encoding... 

(Except for string constants, I guess, but perhaps you could just use
its byte representation in the source?  The length could be slightly
surprising, though.)

-kzm
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-13 Thread Glynn Clements

Marcin 'Qrczak' Kowalczyk wrote:

  1. API for manipulating byte sequences in I/O (without representing
 them in String type).
 
  Note that this needs to include all of the core I/O functions, not
  just reading/writing streams. E.g. FilePath is currently an alias for
  String, but (on Unix, at least) filenames are strings of bytes, not
  characters. Ditto for argv, environment variables, possibly other
  cases which I've overlooked.
 
 They don't hold binary data; they hold data intended to be interpreted
 as text.

No. They frequently hold data intended to be passed to system
functions which interpret them simply as bytes, without regard to
encoding.

 If the encoding of the text doesn't agree with the locale,
 the environment setup is broken and 'ls' and 'env' misbehave on an
 UTF-8 terminal.

ls and env just write bytes to stdout (which may or may not refer to
the terminal). A particular terminal may not display them correctly,
but that's a separate issue.

Unless you are the sole user of a system, you have no control over
what filenames may occur on it (and even if you are the sole user, you
may wish to use packages which don't conform to your rules). As
environment variables frequently contain pathnames, this fact may get
propagated to the environment (however, system directories are usually
restricted to ASCII, so this aspect is less likely to be an issue).

 A program can explicitly set the default encoding to ISO-8859-1 if it
 wishes to do something in a broken environment.
 
  4. Libraries are reviewed to ensure that they work with various
 encoding settings.
 
  There are limits to the extent to which this can be achieved. E.g. 
  what happens if you set the encoding to UTF-8, then call
  getDirectoryContents for a directory which contains filenames which
  aren't valid UTF-8 strings?
 
 The library fails. Don't do that. This environment is internally
 inconsistent.

Call it what you like, it's a reality, and one which programs need to
deal with.

Most programs don't care whether any filenames which they deal with
are valid in the locale's encoding (or any other encoding). They just
receive lists (i.e. NUL-terminated arrays) of bytes and pass them
directly to the OS or to libraries.

  I feel that the default encoding should be one whose decoder cannot
  fail, e.g. ISO-8859-1.
 
 But filenames on my filesystem and most file contents are *not*
 encoded in ISO-8859-1. Assuming that they are ISO-8859-1 is plainly
 wrong.

For the most part, assuming that they are encoded in *any* coding
system is wrong.

However, If you treat them as ISO-8859-* (it doesn't matter which one,
so long as you're consistent), the Haskell I/O functions will at least
pass them through unmodified. Consider a trivial cp program:

main = do
[src, dst] - getArgs
text - readFile src
writeFile dst text

If the assumed encoding is ISO-8859-*, this program will work
regardless of the filenames which it is passed or the contents of the
file (modulo the EOL translation on Windows). OTOH, if it were to use
UTF-8 (e.g. because that was the locale's encoding), it wouldn't work
correctly if either filename or the file's contents weren't valid
UTF-8.

  You should have to explicitly request the use of the locale's
  encoding (analogous to calling setlocale(LC_CTYPE, ) at the start
  of a C program; there's a good reason why C doesn't do this without
  being explicitly told to).
 
 C usually uses the paradigm of representing text in their original
 8-bit encodings. This is why getting C programs to work in a UTF-8
 locale is such a pain. Only some programs use wchar_t internally.

Many C programs don't care about encodings. It's only if you actually
have to interpret the bytes (e.g. ctype.h, strcoll) that encodings
start to matter. At which point, you have to know the encoding.

 Java and C# uses the paradigm of representing text in Unicode
 internally, recoding it on boundaries with the external world.
 
 The second paradigm has a cost that you must be aware what encodings
 are used in texts you manipulate.

And that cost can be a pretty high; e.g. gratuitously failing in the
case where you have no idea which encoding is used but where you
shouldn't actually need to know.

 Locale gives a reasonable default
 for simple programs which aren't supposed to work with multiple
 encodings, and it specifies the encoding of texts which don't have an
 encoding specified elsewhere (terminal I/O, filenames, environment
 variables).

More accurately, it specifies which encoding to assume when you *need*
to know the encoding (i.e. ctype.h etc), but you can't obtain that
information from a more reliable source.

My central point is that the existing API forces the encoding to be an
issue when it shouldn't be.

 ncurses wide character API is still broken. I reported bugs, the
 author acknowledged them, but hasn't fixed them. (Attributes are
 ignored on add_wch; get_wch is wrong for 

Re: [Haskell-cafe] Writing binary files?

2004-09-13 Thread Marcin 'Qrczak' Kowalczyk
Glynn Clements [EMAIL PROTECTED] writes:

 1. API for manipulating byte sequences in I/O (without representing
them in String type).

 Note that this needs to include all of the core I/O functions, not
 just reading/writing streams. E.g. FilePath is currently an alias for
 String, but (on Unix, at least) filenames are strings of bytes, not
 characters. Ditto for argv, environment variables, possibly other
 cases which I've overlooked.

They don't hold binary data; they hold data intended to be interpreted
as text. If the encoding of the text doesn't agree with the locale,
the environment setup is broken and 'ls' and 'env' misbehave on an
UTF-8 terminal.

A program can explicitly set the default encoding to ISO-8859-1 if it
wishes to do something in a broken environment.

 4. Libraries are reviewed to ensure that they work with various
encoding settings.

 There are limits to the extent to which this can be achieved. E.g. 
 what happens if you set the encoding to UTF-8, then call
 getDirectoryContents for a directory which contains filenames which
 aren't valid UTF-8 strings?

The library fails. Don't do that. This environment is internally
inconsistent.

 I feel that the default encoding should be one whose decoder cannot
 fail, e.g. ISO-8859-1.

But filenames on my filesystem and most file contents are *not*
encoded in ISO-8859-1. Assuming that they are ISO-8859-1 is plainly
wrong.

 You should have to explicitly request the use of the locale's
 encoding (analogous to calling setlocale(LC_CTYPE, ) at the start
 of a C program; there's a good reason why C doesn't do this without
 being explicitly told to).

C usually uses the paradigm of representing text in their original
8-bit encodings. This is why getting C programs to work in a UTF-8
locale is such a pain. Only some programs use wchar_t internally.

Java and C# uses the paradigm of representing text in Unicode
internally, recoding it on boundaries with the external world.

The second paradigm has a cost that you must be aware what encodings
are used in texts you manipulate. Locale gives a reasonable default
for simple programs which aren't supposed to work with multiple
encodings, and it specifies the encoding of texts which don't have an
encoding specified elsewhere (terminal I/O, filenames, environment
variables).

It also has benefits:

1. It's easier to work with multiple encodings, because the internal
   representation can represent text decoded from any of them and is
   the same in all places of the program.

2. It's much easier to work in a UTF-8 environment, and to work with
   libraries which use Unicode internally (e.g. Gtk+ or Qt).

3. isAlpha, toUpper etc. are true pure functions. (Haskell API is
   broken in a different way here: toUpper should be defined in terms
   of strings, not characters.)

 Actually, the more I think about it, the more I think that simple,
 stupid programs probably shouldn't be using Unicode at all.

This attitude causes them to break in a UTF-8 environment, which is
why I can't use it as a default yet.

ncurses wide character API is still broken. I reported bugs, the
author acknowledged them, but hasn't fixed them. (Attributes are
ignored on add_wch; get_wch is wrong for non-ASCII keys pressed if
the locale is different from ISO-8859-1 and UTF-8.) It seems people
don't use that API yet, because C traditionally uses the model of
representing texts in byte sequences. But the narrow character API
of ncurses is unusable with UTF-8 - this is not an implementation
limitation but inherent limitation of the interface.

 I.e. Char, String, string literals, and the I/O functions in Prelude,
 IO etc should all be using bytes, with a distinct wide-character API
 available for people who want to make the (substantial) effort
 involved in writing (genuinely) internationalised programs.

This would cause excessive duplication of APIs. Look, Java and C#
don't do that. Only file contents handling needs a byte API, because
many files don't contain text.

This would imply isAlpha :: Char - IO Bool.

 Right now, the attempt at providing I18N for free, by defining Char
 to mean Unicode, has essentially backfired, IMHO.

Because it needs to be accompanied with character recoders, both
invoked explicitly (also lazily) and attached to file handles, and
with a way to obtain recoders for various encodings.

Assuming that the same encoding is used everywhere and programs can
just copy bytes without interpreting them no longer works today.
A mail client is expected to respect the encoding set in headers.

 Oh, and because bytes are being stored in Chars, the type system won't
 help if you neglect to decode a string, or if you decode it twice.

This is why I said 1. API for manipulating byte sequences in I/O
(without representing them in String type).

 2. If you assume ISO-8859-1, you can always convert back to Word8 then
 re-decode as UTF-8. If you assume UTF-8, anything which is neither
 UTF-8 nor ASCII will fail far more 

Re: [Haskell-cafe] Writing binary files?

2004-09-13 Thread Jan-Willem Maessen - Sun Labs East
Glynn Clements wrote:
Actually, the more I think about it, the more I think that simple,
stupid programs probably shouldn't be using Unicode at all.
I.e. Char, String, string literals, and the I/O functions in Prelude,
IO etc should all be using bytes, with a distinct wide-character API
available for people who want to make the (substantial) effort
involved in writing (genuinely) internationalised programs.
I have become very sympathetic to this viewpoint.  In particular, it 
is nigh-impossible to just move bits into and out of a Haskell 
program.  There certainly isn't a simple, portable way to do so.  In 
the absence of such a mechanism, there are *no* *portable* *workarounds*.

The absence of real internationalization is galling, but not nearly as 
galling as the absence of simple, stupid, reproducible I/O facilities. 
 Forget even ISO-8859-1.  Just give me bytes.

Personally, I would take the C approach: redefine Char to mean a byte
(i.e. CChar), treat string literals as bytes, keep the existing type
signatures on all of the existing Haskell98 functions, and provide a
completely new wide-character API for those who wish to use it.
Much as we may hate to admit it, this would probably just work in 99% 
of all cases, and would make it a simple exercise to build working 
solutions.

Given the frequency with which this issue crops up, and the associated
lack of action to date, I'd rather not have to wait until someone
finally gets around to designing the new, improved,
genuinely-I18N-ised API before we can read/write arbitrary files
without too much effort.
This is simple stuff, and far more basic than the details of test 
representation.  Why must it be so hard?  And why must we drag 
students through the muck of allocating and managing explicit byte 
buffers and (God forbid) understanding the FFI just to get a few bytes 
on and off disk without the system monkeying with them en route?

Simplify.  Please.
-Jan-Willem Maessen
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-13 Thread David Menendez
Glynn Clements writes:

  The problem is that API for that yet is not even designed, so
  programs can't be written such that they will work after the
  default encoding change.
 
 Personally, I would take the C approach: redefine Char to mean a byte
 (i.e. CChar), treat string literals as bytes, keep the existing type
 signatures on all of the existing Haskell98 functions, and provide a
 completely new wide-character API for those who wish to use it.

I really don't like having a type called Char that's defined to be a
byte, rather than a character. On the other hand, I don't know how much
of a pain it would be to replace Char with Word8 in the file IO library,
which would be my preferred temporary solution.

 Given the frequency with which this issue crops up, and the associated
 lack of action to date, I'd rather not have to wait until someone
 finally gets around to designing the new, improved,
 genuinely-I18N-ised API before we can read/write arbitrary files
 without too much effort.

Any I18N-ized API would need a bit-level layer underneath, right? In
fact, a good low-level IO library could support multiple higher-level
APIs. Has there been any progress on something like that?
-- 
David Menendez [EMAIL PROTECTED] | In this house, we obey the laws
http://www.eyrie.org/~zednenem  |of thermodynamics!
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-12 Thread Sven Panne
Glynn Clements wrote:
[...]
main :: IO ()
main = do
h - openBinaryFile out.dat WriteMode
hPutStr h $ map (octetToChar . bitsToOctet) bits
hClose h
Hmmm, using string I/O when one really wants to do binary I/O gives me a bad
feeling. Haskell characters are defined to be Unicode characters, so the
above only works because current Haskell implementations usually get this wrong
(either no Unicode support at all and/or ignoring any encodings and doing I/O
only with the lower 8 bits of the characters)... hGetBuf/hPutBuf plus their
non-blocking variants are the only way to *really* do binary I/O currently.
Cheers,
   S.
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-12 Thread Glynn Clements

Abraham Egnor wrote:

 Passing a Ptr isn't that onerous; it's easy enough to make functions
 that have the signature you'd like:
 
 import System.IO
 import Data.Word (Word8)
 import Foreign.Marshal.Array
 
 hPutBytes :: Handle - [Word8] - IO ()
 hPutBytes h ws = withArray ws $ \p - hPutBuf h p $ length ws
 
 hGetBytes :: Handle - Int - IO [Word8]
 hGetBytes h c = allocaArray c $ \p -
 do c' - hGetBuf h p c
peekArray c' p

The problem with this approach is that the entire array has to be held
in memory, which could be an issue if the amount of data involved is
large.

-- 
Glynn Clements [EMAIL PROTECTED]
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-12 Thread Sven Panne
Glynn Clements wrote:
The problem with this approach is that the entire array has to be held
in memory, which could be an issue if the amount of data involved is
large.
Simple reasoning: If the amount of data is large, you don't want the overhead
of lists because it kills performance. If the amount of data is small, you
can easily use similar code to read/write a single byte. :-)
Of course things are a bit different when you are in the blissful position
where lazy I/O is what you want. This implies that you expect a stream of
data of a single type. How often is this really the case? And I'm not sure
if this the correct way of doing things even when the data involved wouldn't
fit into memory all at once. I'd prefer something mmap-like then...
Cheers,
   S.
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-12 Thread Glynn Clements

Sven Panne wrote:

  Also, changing the existing functions to deal with encodings is likely
  to break a lot of things (i.e. anything which reads or writes data
  which is in neither UTF-8 nor the locale-specified encoding).
 
 Hmmm, the Unicode tables start with ISO-Latin-1, so what would exactly break
 when we stipulate that the standard encoding for string I/O in Haskell is
 ISO-Latin-1?

That would essentially be formally specifying the existing behaviour,
which wouldn't break anything, including the mechanism for
reading/writing binary data which I suggested (and which is the only
choice if your Haskell implementation doesn't have h{Get,Put}Buf).

The problems would come if it was decided to change the existing
behaviour, i.e. use something other than Latin1.

-- 
Glynn Clements [EMAIL PROTECTED]
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-12 Thread Marcin 'Qrczak' Kowalczyk
Sven Panne [EMAIL PROTECTED] writes:

 Hmmm, the Unicode tables start with ISO-Latin-1, so what would exactly break
 when we stipulate that the standard encoding for string I/O in Haskell is
 ISO-Latin-1? Additional encodings could be specified e.g. via a new open
 variant.

That the encoding of most file contents is not ISO-Latin-1 in practice.
The locale mechanism specifies a default.

It's also a default for other things: filenames (on Unix), program
invocation arguments, environment variables etc. Some other places
have an encoding hardwired (e.g. Gtk+ uses UTF-8 and Qt uses UTF-16),
and yet others have it specified as a part of the protocol (email,
usenet, WWW).

Unfortunately changing a Haskell implementation to actually convert
between the external encodings and Unicode must be done in all those
places at once, otherwise there will be mismatches and e.g. printing
program invocation arguments to a file will have a wrong effect.

Most Haskell programs currently work because they misuse Chars to
represent characters in the implicit default encoding. As long as they
don't use isAlpha or toUpper on non-ASCII characters, and as long as
they don't try to support several encodings at once.

These two paradigms:
A. Represent strings using their original encoding.
B. Use Unicode internally, convert it at the boundaries.
should not be mixed in one string type, or confusion will arise.

For at least some of these places, e.g. file contents or socket data,
a program must have a way to specify a different encoding, and also to
manipulate raw bytes without recoding. But the default encoding should
come from the locale instead of being ISO-8859-1. A Char value should
always mean a Unicode code point and not e.g. an ISO-8859-2-coded value.
This is the B paradigm and it must be applied consistently.

I did this for my language http://kokogut.sourceforge.net/ and it
works. Only some things are hard, e.g. reading a file whose encoding
is specified inside it (trying to apply the default encoding might
fail, even if the text before the encoding name is all ASCII, because
of buffering); it's possible but needs care.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-12 Thread Glynn Clements

Marcin 'Qrczak' Kowalczyk wrote:

 But the default encoding should
 come from the locale instead of being ISO-8859-1.

The problem with that is that, if the locale's encoding is UTF-8, a
lot of stuff is going to break (i.e. anything in ISO-8859-* which
isn't limited to the 7-bit ASCII subset).

The advantage of assuming ISO-8859-* is that the decoder can't fail;
every possible stream of bytes is valid. This isn't the case for
UTF-8. The advantage of ISO-8859-1 in particular is that it's trivial
to convert the string back into the bytes which were actually read.

The key problem with using the locale is that you frequently encounter
files which aren't in the locale's encoding, and for which the
encoding can't easily be deduced.

If you assume ISO-8859-*, you can at least read them in, manipulate
the contents (in any way that doesn't require interpreting any
non-ASCII characters), and write out the results. OTOH, if you assume
UTF-8 (e.g. because that happens to be the locale's encoding), the
decoder is likely to abort shortly after the first non-ASCII character
it finds (either that, or it will just silently drop characters).

-- 
Glynn Clements [EMAIL PROTECTED]
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-12 Thread Marcin 'Qrczak' Kowalczyk
Glynn Clements [EMAIL PROTECTED] writes:

 But the default encoding should
 come from the locale instead of being ISO-8859-1.

 The problem with that is that, if the locale's encoding is UTF-8, a
 lot of stuff is going to break (i.e. anything in ISO-8859-* which
 isn't limited to the 7-bit ASCII subset).

What about this transition path:

1. API for manipulating byte sequences in I/O (without representing
   them in String type).

2. API for conversion between explicitly specified encodings and byte
   sequences, including attaching converters to Handles. There is also
   a way to obtain the locale encoding.

3. The default encoding is settable from Haskell, defaults to
   ISO-8859-1.

4. Libraries are reviewed to ensure that they work with various
   encoding settings.

5. The default encoding is settable from Haskell, defaults to the
   locale encoding.

Points 1-3 don't change the behavior of existing programs, but they
allow to start writing libraries and programs which manipulate
something other than texts in the default encoding and will work
in future.

After relevant libraries work with the default encoding changed,
programs which use them may begin their main function with setting
the default encoding to the locale encoding.

Finally, when we consider libraries and programs which break in this
setting obsolete, the default is changed.

 The advantage of assuming ISO-8859-* is that the decoder can't fail;
 every possible stream of bytes is valid.

Assuming. But UTF-8 is not ISO-8859-*. When someday I change most of
my files and filenames from ISO-8859-2 to UTF-8, and change the
locale, the assumption will be wrong. I can't change that now, because
too many programs would break.

The current ISO-8859-1 assumption is also wrong. A program written in
Haskell which sorts strings would break for non-ASCII letters even now
that they are ISO-8859-2 unless specified otherwise.

 The key problem with using the locale is that you frequently encounter
 files which aren't in the locale's encoding, and for which the
 encoding can't easily be deduced.

Programs should either explicitly set the encoding for I/O on these
files to ISO-8859-1, or manipulate them as binary data.

The problem is that API for that yet is not even designed, so programs
can't be written such that they will work after the default encoding
change.

 OTOH, if you assume UTF-8 (e.g. because that happens to be the
 locale's encoding), the decoder is likely to abort shortly after the
 first non-ASCII character it finds (either that, or it will just
 silently drop characters).

Detectable errors should not be automatically silenced, so it would
fail. So the change to the default encoding must be done some time
after it's possible to write programs which would not fail.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-12 Thread Glynn Clements

Marcin 'Qrczak' Kowalczyk wrote:

  But the default encoding should
  come from the locale instead of being ISO-8859-1.
 
  The problem with that is that, if the locale's encoding is UTF-8, a
  lot of stuff is going to break (i.e. anything in ISO-8859-* which
  isn't limited to the 7-bit ASCII subset).
 
 What about this transition path:
 
 1. API for manipulating byte sequences in I/O (without representing
them in String type).

Note that this needs to include all of the core I/O functions, not
just reading/writing streams. E.g. FilePath is currently an alias for
String, but (on Unix, at least) filenames are strings of bytes, not
characters. Ditto for argv, environment variables, possibly other
cases which I've overlooked.

 2. API for conversion between explicitly specified encodings and byte
sequences, including attaching converters to Handles. There is also
a way to obtain the locale encoding.
 
 3. The default encoding is settable from Haskell, defaults to
ISO-8859-1.

Agreed.

 4. Libraries are reviewed to ensure that they work with various
encoding settings.

There are limits to the extent to which this can be achieved. E.g. 
what happens if you set the encoding to UTF-8, then call
getDirectoryContents for a directory which contains filenames which
aren't valid UTF-8 strings?

 5. The default encoding is settable from Haskell, defaults to the
locale encoding.

I feel that the default encoding should be one whose decoder cannot
fail, e.g. ISO-8859-1. You should have to explicitly request the use
of the locale's encoding (analogous to calling setlocale(LC_CTYPE, )
at the start of a C program; there's a good reason why C doesn't do
this without being explicitly told to).

Actually, the more I think about it, the more I think that simple,
stupid programs probably shouldn't be using Unicode at all.

I.e. Char, String, string literals, and the I/O functions in Prelude,
IO etc should all be using bytes, with a distinct wide-character API
available for people who want to make the (substantial) effort
involved in writing (genuinely) internationalised programs.

Right now, the attempt at providing I18N for free, by defining Char
to mean Unicode, has essentially backfired, IMHO. Anything that isn't
ISO-8859-1 just doesn't work for the most part, and anyone who wants
to provide real I18N first has to work around the pseudo-I18N that's
already there (e.g. convert Chars back into Word8s so that they can
decode them into real Chars).

Oh, and because bytes are being stored in Chars, the type system won't
help if you neglect to decode a string, or if you decode it twice.

  The advantage of assuming ISO-8859-* is that the decoder can't fail;
  every possible stream of bytes is valid.
 
 Assuming. But UTF-8 is not ISO-8859-*. When someday I change most of
 my files and filenames from ISO-8859-2 to UTF-8, and change the
 locale, the assumption will be wrong. I can't change that now, because
 too many programs would break.
 
 The current ISO-8859-1 assumption is also wrong. A program written in
 Haskell which sorts strings would break for non-ASCII letters even now
 that they are ISO-8859-2 unless specified otherwise.

1. In that situation, you can't avoid the encoding issues. It doesn't
matter what the default is, because you're going to have to set the
encoding anyhow.

2. If you assume ISO-8859-1, you can always convert back to Word8 then
re-decode as UTF-8. If you assume UTF-8, anything which is neither
UTF-8 nor ASCII will fail far more severely than just getting the
collation order wrong.

  The key problem with using the locale is that you frequently encounter
  files which aren't in the locale's encoding, and for which the
  encoding can't easily be deduced.
 
 Programs should either explicitly set the encoding for I/O on these
 files to ISO-8859-1, or manipulate them as binary data.

Well, my view is essentially that files should be treated as
containing bytes unless you explicitly choose to decode them, at which
point you have to specify the encoding.

 The problem is that API for that yet is not even designed, so programs
 can't be written such that they will work after the default encoding
 change.

Personally, I would take the C approach: redefine Char to mean a byte
(i.e. CChar), treat string literals as bytes, keep the existing type
signatures on all of the existing Haskell98 functions, and provide a
completely new wide-character API for those who wish to use it.

That gets the failed attempt at I18N out of everyone's way with a
minimum of effort and with maximum backwards compatibility for
existing code.

Given the frequency with which this issue crops up, and the associated
lack of action to date, I'd rather not have to wait until someone
finally gets around to designing the new, improved,
genuinely-I18N-ised API before we can read/write arbitrary files
without too much effort.

My main concern is that someone will get sick of waiting and make the
wrong fix, i.e. keep the 

[Haskell-cafe] Writing binary files?

2004-09-11 Thread Ron de Bruijn
Hi,

I would like to write and read binary files in
Haskell. I saw the System.IO module, but you need a
(Ptr a) value for using that, and I don't need
positions. I only want to read a complete binary file
and write another binary file. 

In 2001 somebody else came up with the same subject,
but then there wasn't a real solution. Now, 3 years
later, I can imagine there's *something*. 

What's that *something*?

Regards, 
  Ron




__
Do you Yahoo!?
New and Improved Yahoo! Mail - Send 10MB messages!
http://promotions.yahoo.com/new_mail 
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-11 Thread Hal Daume III
There's a Binary module that comes with GHC that you can get somewhere (I 
believe Simon M wrote it).  I have hacked it up a bit and added support 
for bit-based writing, to bring it more in line with the NHC module.  
Mine, with various information, etc., is available at:

  http://www.isi.edu/~hdaume/haskell/NewBinary/

On Sat, 11 Sep 2004, Ron de Bruijn wrote:

 Hi,
 
 I would like to write and read binary files in
 Haskell. I saw the System.IO module, but you need a
 (Ptr a) value for using that, and I don't need
 positions. I only want to read a complete binary file
 and write another binary file. 
 
 In 2001 somebody else came up with the same subject,
 but then there wasn't a real solution. Now, 3 years
 later, I can imagine there's *something*. 
 
 What's that *something*?
 
 Regards, 
   Ron
 
 
 
   
 __
 Do you Yahoo!?
 New and Improved Yahoo! Mail - Send 10MB messages!
 http://promotions.yahoo.com/new_mail 
 ___
 Haskell-Cafe mailing list
 [EMAIL PROTECTED]
 http://www.haskell.org/mailman/listinfo/haskell-cafe
 

-- 
 Hal Daume III   | [EMAIL PROTECTED]
 Arrest this man, he talks in maths.   | www.isi.edu/~hdaume

___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-11 Thread Glynn Clements

Ron de Bruijn wrote:

 I would like to write and read binary files in
 Haskell. I saw the System.IO module, but you need a
 (Ptr a) value for using that, and I don't need
 positions. I only want to read a complete binary file
 and write another binary file. 

You just need to open the files with System.IO.openBinaryFile instead
of openFile (files opened with the latter will have automatic LF/CRLF
translation on Windows).

Converting between Chars and octets (i.e. Word8) is just a matter of:

import Char (ord, chr)

charToOctet :: Char - Word8
charToOctet = fromIntegral . ord

octetToChar :: Word8 - Char
octetToChar = chr . fromIntegral

-- 
Glynn Clements [EMAIL PROTECTED]
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-11 Thread Sven Panne
Hal Daume III wrote:
There's a Binary module that comes with GHC that you can get somewhere (I 
believe Simon M wrote it).  I have hacked it up a bit and added support 
for bit-based writing, to bring it more in line with the NHC module.  
Mine, with various information, etc., is available at:

  http://www.isi.edu/~hdaume/haskell/NewBinary/
Hmmm, I'm not sure if that is what Ron asked for. What I guess is needed is
support for things like:
   read the next 4 bytes as a low-endian unsigned integer
   read the next 8 bytes as a big-endian IEEE 754 double
   write the Int16 as a low-endian signed integer
   write the (StorableArray Int Int32) as big-endian signed integers
   ...
plus perhaps some String I/O with a few encodings. Alas, we do *not* have
something in our standard libs, although there were a few discussions about
it. I know that one can debate ages about byte orders, external representation
of arrays and floats, etc. Nevertheless, I guess covering only low-/big-endian
systems, IEEE 754 floats, and arrays as a simple 0-based sequence of its
elements (with an explicit length stated somehow) would make at least 90% of all
users happy and would be sufficient for most real world file formats. Currently
one is bound to hGetBuf/hPutBuf, which is not really a comfortable way of doing
binary I/O.
Cheers,
   S.
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-11 Thread Ron de Bruijn
--- Sven Panne [EMAIL PROTECTED] wrote:

 Hal Daume III wrote:
  There's a Binary module that comes with GHC that
 you can get somewhere (I 
  believe Simon M wrote it).  I have hacked it up a
 bit and added support 
  for bit-based writing, to bring it more in line
 with the NHC module.  
  Mine, with various information, etc., is available
 at:
  
http://www.isi.edu/~hdaume/haskell/NewBinary/
 
 Hmmm, I'm not sure if that is what Ron asked for.
 What I guess is needed is
 support for things like:
 
 read the next 4 bytes as a low-endian unsigned
 integer
 read the next 8 bytes as a big-endian IEEE 754
 double
 write the Int16 as a low-endian signed integer
 write the (StorableArray Int Int32) as
 big-endian signed integers
 ...
 
 plus perhaps some String I/O with a few encodings.
 Alas, we do *not* have
 something in our standard libs, although there were
 a few discussions about
 it. I know that one can debate ages about byte
 orders, external representation
 of arrays and floats, etc. Nevertheless, I guess
 covering only low-/big-endian
 systems, IEEE 754 floats, and arrays as a simple
 0-based sequence of its
 elements (with an explicit length stated somehow)
 would make at least 90% of all
 users happy and would be sufficient for most real
 world file formats. Currently
 one is bound to hGetBuf/hPutBuf, which is not really
 a comfortable way of doing
 binary I/O.
 
 Cheers,
 S.
 
Basically, I just want to have a function, that
converts a list with zeros and ones to a binary file
and the other way around. 

If I write  to a file, it would take 8 bytes.
But I want it to take 8 bits. 



__
Do you Yahoo!?
Yahoo! Mail Address AutoComplete - You start. We finish.
http://promotions.yahoo.com/new_mail 
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Writing binary files?

2004-09-11 Thread Glynn Clements

Ron de Bruijn wrote:

 Basically, I just want to have a function, that
 converts a list with zeros and ones to a binary file
 and the other way around. 
 
 If I write  to a file, it would take 8 bytes.
 But I want it to take 8 bits. 

import Char (digitToInt, chr)
import Word (Word8)
import System.IO (openBinaryFile)
import IO (IOMode(..), hPutStr, hClose)

bitsToOctet :: [Char] - Word8
bitsToOctet ds = fromIntegral $ sum $ zipWith (*) powers digits
where   powers = [2^n | n - [7,6..0]]
digits = map digitToInt ds

octetToChar :: Word8 - Char
octetToChar = chr . fromIntegral

bits :: [[Char]]
bits =  [ 01101000 -- 0x68 'h'
, 01100101 -- 0x65 'e'
, 01101100 -- 0x6c 'l'
, 01101100 -- 0x6c 'l'
, 0110 -- 0x6f 'o'
, 1010 -- 0x0a '\n'
]

main :: IO ()
main = do
h - openBinaryFile out.dat WriteMode
hPutStr h $ map (octetToChar . bitsToOctet) bits
hClose h

-- 
Glynn Clements [EMAIL PROTECTED]
___
Haskell-Cafe mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell-cafe