#3307: System.IO and System.Directory functions not Unicode-aware under Unix
----------------------------------+-----------------------------------------
Reporter: YitzGale | Owner:
Type: bug | Status: new
Priority: normal | Milestone: 7.2.1
Component: libraries/base | Version: 6.11
Keywords: directory unicode | Testcase:
Blockedby: | Difficulty: Unknown
Os: Unknown/Multiple | Blocking:
Architecture: Unknown/Multiple | Failure: None/Unknown
----------------------------------+-----------------------------------------
Changes (by batterseapower):
* failure: => None/Unknown
Comment:
I have been investigating this issue and would like to add some
observations.
* Python 2 does what Haskell does at the moment: it reads command line
arguments in as byte strings, and exposes them to the programmer as byte
strings. This is consistent with the fact that Python strings aren't
"really" a text type, and unicode strings are a separate type
* Python 3 changed the behaviour to match its String type being a "real"
string type. Now, command line arguments are decoded into UTF-8 according
to the current locale for internal consumption. See the relevant issue at
http://bugs.python.org/issue2128
* Passing command line arguments encoded in any other than the current
locale is weird and fragile. Here is some weirdness I discovered.
First, we create a file with a Big5 encoded name. Set your terminal to
decode using Big 5 and then:
{{{
LC_ALL=zw_TW.big5 bash
touch zw你好 #Be careful here that your IME actually outputs Big5 bytes into
a Big5 terminal. It did for me on Ubuntu but not on OS X
}}}
As expected, this name will work nicely if we ls. This reflects the fact
that Unix stores the file with exactly the Big5 encoded name that we gave
it, so when we ls it decodes perfectly in the Big5 terminal.
Now open another terminal set for UTF-8. Assuming your default locale is
UTF8 as well, we can try some fun experiments. First, I wrote a program
called encoding.c that let me observe the command line. Compile this file
to ./bytes:
{{{
#include <stdio.h>
int main(int argc, char **argv)
{
if (argc < 2) {
printf("Not enough arguments\n");
return 1;
}
int len = 0;
for (char *c = argv[1]; *c; c++, len++) {
printf("%d ", (int)(*c));
}
printf("\nLength: %d\n", len);
return 0;
}
}}}
Now for the fun:
1. ls. You should see some gibberish for the "zw" file because the Big5
doesn't get decoded cleanly as UTF-8 by your terminal. I saw the literal
string "zw?A?n" printed.
2. Type "./bytes zw" and then press tab, then enter. You will get 6
bytes printed because 你好 is 4 bytes long in Big5.
3. Type "./bytes zw?A?n". Use literal question marks. This is where it
gets really weird. The output is *exactly the same as before*. Bash has
somehow detected that I "meant" to refer to the file in the current
working directory and decided to substitute my 6 bytes of ASCII text (all
characters <128) with the Big5 from before (which contains some characters
>= 128). I have no idea what happens if the choice of filename is
ambiguous. If you rm the file this stops happening, obviously.
4. Type "./bytes foo=zw" and then press tab and enter. You get 10 bytes:
4 bytes for the Chinese and 6 bytes for the ASCII
5. Type "./bytes foo=zw?A?n", with literal question marks. It shows *10
bytes of ASCII*. So Bash's weird encoding-fixing heuristic fails if
command line arguments are more complex than just a file name by itself.
In my opinion this is absolutely bonkers behaviour :-).
IMHO C programs should be able to assume all of their command line
arguments are in the same encoding - that of the current locale. But with
this bash behaviour, some arguments will be in the locale encoding and
some of them will be in another encoding (happens when tab-completing a
filename in a non-locale encoding, or Bash's heuristics rewrite something
the user wrote to a filename automatically). The user can't even
necessarily predict in advance which ones will be which, because Bash's
heuristic depends on at least the contents of the CWD!
I would like to argue that we should follow the Python 3 behaviour, and
not support file names passed to the command line in any encoding other
than the current locale. The reasons are:
1. Support for this scenario is sort-of-but-not-quite there in other
tools, including wildly-popular ones such as bash. So if it doesn't really
work at the moment, we aren't causing much trouble by having Haskell not
support it.
2. The very popular language Python 3 has exactly the behaviour I
propose and (apparently) noone has complained yet
3. Most importantly, making this choice means that we don't do natural
things like use the current locale to decode command line arguments. This
penalises users of modern systems (i.e. those with UTF-8 everywhere) who
expect international text to work seamlessly for the sake of supporting a
very small group of legacy users (those who use non-UTF-8 encodings on
non-Windows, non-OS X systems)
--
Ticket URL: <http://hackage.haskell.org/trac/ghc/ticket/3307#comment:8>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
_______________________________________________
Glasgow-haskell-bugs mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs