On 09/11/2011 10:39, Max Bolingbroke wrote:
On 8 November 2011 11:43, Simon Marlow<marlo...@gmail.com> wrote:
Don't you mean 1 is what we have?
Yes, sorry!
Failing to roundtrip in some cases, and doing so silently, seems highly
suboptimal to me. I'm sorry I didn't pick up on this at the time (Unicode
is a swamp :).
I *can* change the implementation back to using lone surrogates. This
gives us guaranteed roundtripping but it means that the user might see
lone-surrogate Char values in Strings from the filesystem/command
line. IIRC this does break some software -- e.g. Brian's "text"
library explicitly checks for such characters and fails if it detects
them.
So whatever happens we are going to end up making some group of users unhappy!
* No PEP383: Haskellers using non-ASCII get upset when their command
line argument [String]s aren't in fact sequences of characters, but
sequences of bytes in some arbitrary encoding
* PEP383(surrogates): Unicoders get upset by lone surrogates (which
can actually occur at the moment, independent of PEP383 -- e.g. as
character literals or from FFI)
* PEP383(private chars): Unixers get upset that we can't roundtrip
byte sequences that look like the codepoint 0xEFXX encoded in the
current locale. In practice, 0xEFXX is only decodable from a UTF
encoding, so we fail to roundtrip byte sequences like the one Ian
posted.
I'm happy to implement any behaviour, I would just like to know that
whatever it is is accepted as the correct tradeoff :-)
I would be happy with the surrogate approach I think. Arguable if you
try to treat a string with lone surrogates as Unicode and it fails, then
that is a feature: the original string wasn't Unicode. All you can do
with an invalid Unicode string is use it as a FilePath again, and the
right thing will happen.
Alternatively if we stick with the private char approach, it should be
possible to have an escaping scheme for 0xEFxx characters in the input
that would enable us to roundtrip correctly. That is, escape 0xEFxx
into a sequence 0xYYEF 0xYYxx for some suitable YY. But perhaps that
would be too expensive - an extra translation pass over the buffer after
iconv (well, we do this for newline translation, so maybe it's not too bad).
RE exposing a ByteString based interface to the IO library from
base/unix/whatever: AFAIK Python doesn't do this, and just tells
people to use the (x.encode(sys.getfilesystemencoding(),
"surrogateescape")) escape hatch, which is what I've been
recommending. I think this would be more satisfying to John if it were
actually guaranteed to work on arbitrary byte sequences, not just
*highly likely* to work :-)
The performance overhead of all this worries me. withCString has taken
a huge performance hit, and I think there are people who wnat to know
that there aren't several complex encoding/decoding passes between their
Haskell code and the POSIX API. We ought to be able to program to POSIX
directly, and the same goes for Win32.
Cheers,
Simon
_______________________________________________
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users