Simon Marlow wrote: > I've been working on adding proper Unicode support to Handle I/O in GHC, > and I finally have something that's ready for testing. I've put a patchset > here:
Yay! Comments below. > Comments/discussion please! Do you expect Hugs will be able to pick up all of this? > The only change to the existing behaviour is that by default, text IO > is done in the prevailing encoding of the system. Handles created by > openBinaryFile use the Latin-1 encoding, as do Handles placed in > binary mode using hSetBinaryMode. Sounds very good and reasonable. > We provide a way to change the encoding for an existing Handle: > > hSetEncoding :: Handle -> TextEncoding -> IO () > > and various encodings: > > latin1, > utf8, > utf16, utf16le, utf16be, > utf32, utf32le, utf32be, > localeEncoding, Will there also be something to handle the UTF-16 BOM marker? I'm not sure what the best API for that is, since it may or may not be present, but it should be considered -- and could perhaps help autodetect encoding. > Thanks to suggestions from Duncan Coutts, it's possible to call > hSetEncoding even on buffered read Handles, and the right thing > happens. So we can read from text streams that include multiple > encodings, such as an HTTP response or email message, without having > to turn buffering off (though there is a penalty for switching > encodings on a buffered Handle, as the IO system has to do some > re-decoding to figure out where it should start reading from again). Sounds useful, but is this the bit that causes the 30% performance hit? > Performance is about 30% slower on "hGetContents >>= putStr" than > before. I've profiled it, and about 25% of this is in doing the > actual encoding/decoding, the rest is accounted for by the fact that > we're shuffling around 32-bit chars rather than bytes in the Handle > buffer, so there's not much we can do to improve this. Does this mean that if we set the encoding to latin1, tat we should see performance 5% worse than present? 30% slower is a big deal, especially since we're not all that speedy now. > IO library restructuring > ~~~~~~~~~~~~~~~~~~~~~~~~ > > The major change here is that the implementation of the Handle > operations is separated from the underlying IO device, using type > classes. File descriptors are just one IO provider; I have also > implemented memory-mapped files (good for random-access read/write) > and a Handle that pipes output to a Chan (useful for testing code that > writes to a Handle). New kinds of Handle can be implemented outside > the base package, for instance someone could write bytestringToHandle. > A Handle is made using mkFileHandle: Very nice. That means I can eliminate all the HVIO stuff I have in MissingH, which does roughly the same thing. > with making new kinds of Handle. We could split up the layers further > later. Would it now be possible to make the Socket an instance of this typeclass, so we can work with it directly rather than having to convert it to a Handle first? Thanks, -- John _______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users