Re: behaviour change in getDirectoryContents in GHC 7.2?

Simon Marlow Wed, 09 Nov 2011 03:04:06 -0800

On 09/11/2011 10:39, Max Bolingbroke wrote:

On 8 November 2011 11:43, Simon Marlow<marlo...@gmail.com>  wrote:

Don't you mean 1 is what we have?


Yes, sorry!

Failing to roundtrip in some cases, and doing so silently, seems highly
suboptimal to me.  I'm sorry I didn't pick up on this at the time (Unicode
is a swamp :).


I *can* change the implementation back to using lone surrogates. This
gives us guaranteed roundtripping but it means that the user might see
lone-surrogate Char values in Strings from the filesystem/command
line. IIRC this does break some software -- e.g. Brian's "text"
library explicitly checks for such characters and fails if it detects
them.

So whatever happens we are going to end up making some group of users unhappy!
   * No PEP383: Haskellers using non-ASCII get upset when their command
line argument [String]s aren't in fact sequences of characters, but
sequences of bytes in some arbitrary encoding
   * PEP383(surrogates): Unicoders get upset by lone surrogates (which
can actually occur at the moment, independent of PEP383 -- e.g. as
character literals or from FFI)
   * PEP383(private chars): Unixers get upset that we can't roundtrip
byte sequences that look like the codepoint 0xEFXX encoded in the
current locale. In practice, 0xEFXX is only decodable from a UTF
encoding, so we fail to roundtrip byte sequences like the one Ian
posted.

I'm happy to implement any behaviour, I would just like to know that
whatever it is is accepted as the correct tradeoff :-)

I would be happy with the surrogate approach I think. Arguable if youtry to treat a string with lone surrogates as Unicode and it fails, thenthat is a feature: the original string wasn't Unicode. All you can dowith an invalid Unicode string is use it as a FilePath again, and theright thing will happen.

Alternatively if we stick with the private char approach, it should bepossible to have an escaping scheme for 0xEFxx characters in the inputthat would enable us to roundtrip correctly. That is, escape 0xEFxxinto a sequence 0xYYEF 0xYYxx for some suitable YY. But perhaps thatwould be too expensive - an extra translation pass over the buffer aftericonv (well, we do this for newline translation, so maybe it's not too bad).

RE exposing a ByteString based interface to the IO library from
base/unix/whatever: AFAIK Python doesn't do this, and just tells
people to use the (x.encode(sys.getfilesystemencoding(),
"surrogateescape")) escape hatch, which is what I've been
recommending. I think this would be more satisfying to John if it were
actually guaranteed to work on arbitrary byte sequences, not just
*highly likely* to work :-)

The performance overhead of all this worries me. withCString has takena huge performance hit, and I think there are people who wnat to knowthat there aren't several complex encoding/decoding passes between theirHaskell code and the POSIX API. We ought to be able to program to POSIXdirectly, and the same goes for Win32.


Cheers,
        Simon



_______________________________________________
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Re: behaviour change in getDirectoryContents in GHC 7.2?

Reply via email to