Re: [GHC] #3307: System.IO and System.Directory functions not Unicode-aware under Unix

GHC Fri, 25 Mar 2011 09:20:42 -0700

#3307: System.IO and System.Directory functions not Unicode-aware under Unix
----------------------------------+-----------------------------------------
    Reporter:  YitzGale           |        Owner:              
        Type:  bug                |       Status:  new         
    Priority:  normal             |    Milestone:  7.2.1       
   Component:  libraries/base     |      Version:  6.11        
    Keywords:  directory unicode  |     Testcase:              
   Blockedby:                     |   Difficulty:  Unknown     
          Os:  Unknown/Multiple   |     Blocking:              
Architecture:  Unknown/Multiple   |      Failure:  None/Unknown
----------------------------------+-----------------------------------------
Changes (by batterseapower):


  * failure:  => None/Unknown


Comment:

 I have been investigating this issue and would like to add some
 observations.

   * Python 2 does what Haskell does at the moment: it reads command line
 arguments in as byte strings, and exposes them to the programmer as byte
 strings. This is consistent with the fact that Python strings aren't
 "really" a text type, and unicode strings are a separate type

   * Python 3 changed the behaviour to match its String type being a "real"
 string type. Now, command line arguments are decoded into UTF-8 according
 to the current locale for internal consumption. See the relevant issue at
 http://bugs.python.org/issue2128

   * Passing command line arguments encoded in any other than the current
 locale is weird and fragile. Here is some weirdness I discovered.

 First, we create a file with a Big5 encoded name. Set your terminal to
 decode using Big 5 and then:

 {{{
 LC_ALL=zw_TW.big5 bash
 touch zw你好 #Be careful here that your IME actually outputs Big5 bytes into
 a Big5 terminal. It did for me on Ubuntu but not on OS X
 }}}

 As expected, this name will work nicely if we ls. This reflects the fact
 that Unix stores the file with exactly the Big5 encoded name that we gave
 it, so when we ls it decodes perfectly in the Big5 terminal.

 Now open another terminal set for UTF-8. Assuming your default locale is
 UTF8 as well, we can try some fun experiments. First, I wrote a program
 called encoding.c that let me observe the command line. Compile this file
 to ./bytes:

 {{{
 #include <stdio.h>

 int main(int argc, char **argv)
 {
     if (argc < 2) {
         printf("Not enough arguments\n");
         return 1;
     }

     int len = 0;
     for (char *c = argv[1]; *c; c++, len++) {
         printf("%d ", (int)(*c));
     }

     printf("\nLength: %d\n", len);

     return 0;
 }
 }}}

 Now for the fun:

   1. ls. You should see some gibberish for the "zw" file because the Big5
 doesn't get decoded cleanly as UTF-8 by your terminal. I saw the literal
 string "zw?A?n" printed.

   2. Type "./bytes zw" and then press tab, then enter. You will get 6
 bytes printed because 你好 is 4 bytes long in Big5.

   3. Type "./bytes zw?A?n". Use literal question marks. This is where it
 gets really weird. The output is *exactly the same as before*. Bash has
 somehow detected that I "meant" to refer to the file in the current
 working directory and decided to substitute my 6 bytes of ASCII text (all
 characters <128) with the Big5 from before (which contains some characters
 >= 128). I have no idea what happens if the choice of filename is
 ambiguous. If you rm the file this stops happening, obviously.

   4. Type "./bytes foo=zw" and then press tab and enter. You get 10 bytes:
 4 bytes for the Chinese and 6 bytes for the ASCII

   5. Type "./bytes foo=zw?A?n", with literal question marks. It shows *10
 bytes of ASCII*. So Bash's weird encoding-fixing heuristic fails if
 command line arguments are more complex than just a file name by itself.

 In my opinion this is absolutely bonkers behaviour :-).

 IMHO C programs should be able to assume all of their command line
 arguments are in the same encoding - that of the current locale. But with
 this bash behaviour, some arguments will be in the locale encoding and
 some of them will be in another encoding (happens when tab-completing a
 filename in a non-locale encoding, or Bash's heuristics rewrite something
 the user wrote to a filename automatically). The user can't even
 necessarily predict in advance which ones will be which, because Bash's
 heuristic depends on at least the contents of the CWD!

 I would like to argue that we should follow the Python 3 behaviour, and
 not support file names passed to the command line in any encoding other
 than the current locale. The reasons are:

   1. Support for this scenario is sort-of-but-not-quite there in other
 tools, including wildly-popular ones such as bash. So if it doesn't really
 work at the moment, we aren't causing much trouble by having Haskell not
 support it.

   2. The very popular language Python 3 has exactly the behaviour I
 propose and (apparently) noone has complained yet

   3. Most importantly, making this choice means that we don't do natural
 things like use the current locale to decode command line arguments. This
 penalises users of modern systems (i.e. those with UTF-8 everywhere) who
 expect international text to work seamlessly for the sake of supporting a
 very small group of legacy users (those who use non-UTF-8 encodings on
 non-Windows, non-OS X systems)

-- 
Ticket URL: <http://hackage.haskell.org/trac/ghc/ticket/3307#comment:8>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler

_______________________________________________
Glasgow-haskell-bugs mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs

Re: [GHC] #3307: System.IO and System.Directory functions not Unicode-aware under Unix

Reply via email to