Re: Output character encoding for ghc on OpenBSD

2010-04-19 Thread Simon Marlow

On 18/04/2010 19:22, Matthias Kilian wrote:

Hi,

On Sun, Apr 18, 2010 at 10:53:22AM -0700, Judah Jacobson wrote:

Anyway, the short story is that I have to either hard-code the
character set to something like utf-8, or ghc will start to behave
really strange (for example, ghci would terminate immediately if
you just *type* a non-ASCII character).


That sounds like it might be something to do with the haskeline
package, which ghci uses for user interaction.  Haskeline makes its
own FFI calls to translate raw input bytes into Unicode Chars.


Oh, this may indeed be a second problem. However, the encoding
problem itself also manifests in the `openTempFile001' test of the
testsuite.  For example, with an unpatched ghc-6.12, the test fails
with the following output:

=  openTempFile001(normal) 1048 of 2375 [0, 38, 0]
cd ./lib/IO  '/usr/obj/ports/ghc-6.12.2/ghc-6.12.2/inplace/bin/ghc-stage2' 
-fforce-recomp -dcore-lint -dcmm-lint -no-user-package-conf  -dno-debug-output -o 
openTempFile001 openTempFile001.hsopenTempFil
e001.comp.stderr 21
cd ./lib/IO  ./openTempFile001/dev/nullopenTempFile001.run.stdout 
2openTempFile001.run.stderr
Wrong exit code (expected 0 , actual 1 )
Stdout:

Stderr:
openTempFile001: ./test22236.txt: hClose: invalid argument (Illegal byte 
sequence)

*** unexpected failure for openTempFile001(normal)


A few of the tests in the test suite assume a UTF-8 locale, so you're 
probably falling foul of that.  We could fix the tests - but we do want 
to test that the locale encoding is being respected in some way, so just 
adding hSetEncoding to those tests would be wrong.


Or you could just make those tests an expected failure on OpenBSD for 
the time being.


For the IO library, I expect you should default the encoding to Latin-1 
on OpenBSD.


Cheers,
Simon
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Output character encoding for ghc on OpenBSD

2010-04-19 Thread Matthias Kilian
On Mon, Apr 19, 2010 at 02:57:00PM +0100, Simon Marlow wrote:
 A few of the tests in the test suite assume a UTF-8 locale, so you're 
 probably falling foul of that.  We could fix the tests - but we do want 
 to test that the locale encoding is being respected in some way, so just 
 adding hSetEncoding to those tests would be wrong.

Nah, don't touch the tests because of this.

 For the IO library, I expect you should default the encoding to Latin-1 
 on OpenBSD.

I've some (rather horrible) patch that tries to make sense out of
LC_ALL or LC_CTYPE if set. And  if it isn't set, I'm currently
defaulting to 646//TRANSLIT (which is ASCII with translation of
some non-ASCII characters to ASCII art, like `(c)' for \xa9). But
Latin-1 may be a more usable default. Thanks for the suggestion.

(No, I'm not going to send this patch to cvs-ghc, it's really too
horrid).

Ciao,
Kili
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Output character encoding for ghc on OpenBSD

2010-04-18 Thread Matthias Kilian
Hi,

as some of you may know, I'm working on an update of OpenBSDs ghc
port to 6.12.2, currently chasing down the last remaining testsuite
failures. Yesterday, I ran into a problem which I have a fix for,
but only a really ugly fix, and I need some opinions of what users
would prefer.

The problem is that Haskell uses unicode characters internally (ghc
itself uses UTF-32 internally, where the endianess depends on the
architecture it's running on), and that any Haskell program (including
ghc and ghci) has to convert between the internal representation
and the actual locale settings of the system it's running on.
Unfortunately, OpenBSD is really bad if it comes to locale support;
the only supported locales are the C and the POSIX locales, so even
if you set LC_ALL or LC_CTYPE to something like, for example,
de_DE.iso88591, this would have no effect on OpenBSD.

Anyway, the short story is that I have to either hard-code the
character set to something like utf-8, or ghc will start to behave
really strange (for example, ghci would terminate immediately if
you just *type* a non-ASCII character).

So what would you prefer?

- Use utf-8 and only utf-8 (i.e. hardcoded)?

- Use something like iso-8859-15 (hardcoded)?

- Make it configurable via some non-standard environment variable
  (GHC_CODESET, for example). If so, what should be the default if
  the environment variable isn't set? Back to 7 bit (ASCII)? utf-8?
  Some of the latin variants?

Your suggestions are appreciated.

Thanks in advance.

Ciao,
Kili
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Output character encoding for ghc on OpenBSD

2010-04-18 Thread Judah Jacobson
On Sun, Apr 18, 2010 at 7:01 AM, Matthias Kilian k...@outback.escape.de wrote:
 Hi,

 as some of you may know, I'm working on an update of OpenBSDs ghc
 port to 6.12.2, currently chasing down the last remaining testsuite
 failures. Yesterday, I ran into a problem which I have a fix for,
 but only a really ugly fix, and I need some opinions of what users
 would prefer.

 The problem is that Haskell uses unicode characters internally (ghc
 itself uses UTF-32 internally, where the endianess depends on the
 architecture it's running on), and that any Haskell program (including
 ghc and ghci) has to convert between the internal representation
 and the actual locale settings of the system it's running on.
 Unfortunately, OpenBSD is really bad if it comes to locale support;
 the only supported locales are the C and the POSIX locales, so even
 if you set LC_ALL or LC_CTYPE to something like, for example,
 de_DE.iso88591, this would have no effect on OpenBSD.

 Anyway, the short story is that I have to either hard-code the
 character set to something like utf-8, or ghc will start to behave
 really strange (for example, ghci would terminate immediately if
 you just *type* a non-ASCII character).

That sounds like it might be something to do with the haskeline
package, which ghci uses for user interaction.  Haskeline makes its
own FFI calls to translate raw input bytes into Unicode Chars.  Can
you elaborate further on what exactly the issue is with OpenBSD's
locale support?  In particular, there's several components used by
Haskeline:
 - call set_locale(LC_CTYPE)
 - call nl_langinfo(CODESET)
 - pass the resulting string (which should be, e.g., $LANG) to iconv_open
 - call iconv on user input (which may be malformed)

Is the problem that setting $LC_ALL or $LANG has no effect on the
string returned by nl_langinfo, so the translation fails?  If so,
haskeline is supposed to output ?s in that case, so there might be a
bug in the package.

Finally, when you say you have to hard-code the character set, are
you talking about ghc, haskeline, the base library, or somewhere else?

Best,
-Judah
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: Output character encoding for ghc on OpenBSD

2010-04-18 Thread Matthias Kilian
Hi,

On Sun, Apr 18, 2010 at 10:53:22AM -0700, Judah Jacobson wrote:
  Anyway, the short story is that I have to either hard-code the
  character set to something like utf-8, or ghc will start to behave
  really strange (for example, ghci would terminate immediately if
  you just *type* a non-ASCII character).
 
 That sounds like it might be something to do with the haskeline
 package, which ghci uses for user interaction.  Haskeline makes its
 own FFI calls to translate raw input bytes into Unicode Chars.

Oh, this may indeed be a second problem. However, the encoding
problem itself also manifests in the `openTempFile001' test of the
testsuite.  For example, with an unpatched ghc-6.12, the test fails
with the following output:

= openTempFile001(normal) 1048 of 2375 [0, 38, 0]
cd ./lib/IO  '/usr/obj/ports/ghc-6.12.2/ghc-6.12.2/inplace/bin/ghc-stage2' 
-fforce-recomp -dcore-lint -dcmm-lint -no-user-package-conf  -dno-debug-output 
-o openTempFile001 openTempFile001.hsopenTempFil
e001.comp.stderr 21
cd ./lib/IO  ./openTempFile001/dev/null openTempFile001.run.stdout 
2openTempFile001.run.stderr
Wrong exit code (expected 0 , actual 1 )
Stdout:

Stderr:
openTempFile001: ./test22236.txt: hClose: invalid argument (Illegal byte 
sequence)

*** unexpected failure for openTempFile001(normal)


 Can
 you elaborate further on what exactly the issue is with OpenBSD's
 locale support?  In particular, there's several components used by
 Haskeline:
  - call set_locale(LC_CTYPE)

Problem number 1: set_locale(LC_CTYPE) fails (i.e. returns NULL)
for any locale except `C` or `POSIX'. Did I mention that OpenBSD
is really bad with locales? ;-)

  - call nl_langinfo(CODESET)

Always returns `646' (ASCII). Duh.

  - pass the resulting string (which should be, e.g., $LANG) to iconv_open

iconv_open appears to need the *codeset* name, not a complete locale.
Note that OpenBSD uses GNU libiconv-1.13, which AFAIK differs from
the one included in glibc. Even worse, I have to pass something
like UTF-8, whereas UTF8 doesn't work.

  - call iconv on user input (which may be malformed)

I wrote a little C program that does the following (some error
checks omitted here):

char *inp, outp;
size_t insz, outsz;
unsigned char in[] = {0xa9, 0, 0, 0};
char out[512];

inp = in;
outp = out;
insz = sizeof(in);
outsz = sizeof(out) - 1;
setlocale(LC_CTYPE, );
ic = iconv_open(, UTF-32LE);
if (iconv(ic, inp, insz, outp, outsz) == -1) {
... bail out (perror() etc.) ...
}
iconv_close(ic);
*outp = 0;
puts(out);

And it just doesn't work, regardless what I set LC_CTYPE to. The
only way to get it printing the copyright symbol is to explicitely
use UTF-8 (or ISO-8859-1 or something else that knows about
that symbol) as the first argument to iconv_open().

 Is the problem that setting $LC_ALL or $LANG has no effect on the
 string returned by nl_langinfo, so the translation fails?

Yes, see above.

 If so,
 haskeline is supposed to output ?s in that case, so there might be a
 bug in the package.

It fails (or rather: ghci fails, since I didn't yet do any separate
haskeline tests) with the same error as the test mentioned above,
with the difference that it fails on hPutChar instead of hClose for
obvious reasons.

 Finally, when you say you have to hard-code the character set, are
 you talking about ghc, haskeline, the base library, or somewhere else?

I'm talking about libraries/base/GHC/IO/Encoding/Iconv.hs

See? There just is no non-hackerish way to fix this (except of
course improving locale support on OpenBSD, but that's beyond my
scope currently).

Ciao,
Kili
___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users