2008/12/1 Eric Kow <[EMAIL PROTECTED]>: > On Sun, Nov 30, 2008 at 16:34:43 -0500, Gwern Branwen wrote: >> But as I said before, so far as I know, >> we shouldn't have to care about possibly erroneous strings because the >> only function in UTF8.lhs is 'encode :: String -> [Word8]', and the >> Haskell runtime will never give us a malformed String. (If we do get a >> bad string, certainly nothing encode can do will make the situation >> worse.) > > So, I was just about to apply this last night before getting attacked by > another fit of paranoia. I'm going to need just a little bit more > patient coaxing here: > > 1. Looking at Wikipedia, obviously the most reliable and stable place to > learn things, I see that UTF-8 has a payload of up to 21 bits. > 2. Haskell Char are 32 bit > 3. I presume that by 'malformed' String, you mean "things which cannot > be represented in 21 bits" or "not valid Unicode". I'm sorry if my > language is sloppy here, but I hope what I'm saying makes sense > > So... how do we know that we will never get such Char? Ian assures us > that "You can only put valid unicode values in a Char". But to what > extent is that true? For example, can we do something nasty and low > level to inadvertently produce those chars? Is there a scenario where > this kind of thing may happen? Let's say we're reading in filenames. > What if for some reason or another, the filenames use some crazy > encoding with lots of non-Unicode things?
(a few days late...) >From http://darcs.haskell.org/libraries/base/GHC/Base.lhs : chr :: Int -> Char chr (I# i#) | int2Word# i# `leWord#` int2Word# 0x10FFFF# = C# (chr# i#) | otherwise = error "Prelude.chr: bad argument" i.e. unless you import GHC.Prim (which provides the primitive C# constructor) then you're pretty safe. I'm not sure what the Haskell standard says about the implementation size of Char (does it _have_ to be 32 bits?) but it is, AFAIR, specified to cover only the current Unicode codepoint range 0-0x10FFFF. Yes, you can probably do nasty things to create UTF8 sequences that contain codepoints > 0x10FFFF (I'm just guessing here, but poking bytes into the data part of a ByteString might do it), but in normal operations, including reading filenames, Haskell Strings should be quite safe. The filename scenario: the filename would have to be marshaled first into a String. At this point, it should be sanitised. If it has an exotic encoding, then whatever is doing the marshaling must deal with it. Alistair _______________________________________________ darcs-users mailing list [email protected] http://lists.osuosl.org/mailman/listinfo/darcs-users
