Hello cafe, (Following up on my own optimisation question, and Duncan's advice to look at http://darcs.haskell.org/ghc/compiler/utils/Encoding.hs)
If you want to look at some existing optimised UTF8 encoding/decoding code then take a look at the code used in GHC: http://darcs.haskell.org/ghc/compiler/utils/Encoding.hs Duncan
I took a look at the UTF8 decoder in GHC. This inspired me to write one that also used unboxed types directly. Pleasingly, it goes like a cut cat, and uses far less space than the naive version, but it's not portable, which is a bummer. (The docs tell me that using GHC.Exts is the "approved" way of accessing GHC-specific extensions, but all of the useful stuff seems to be in GHC.Prim.) After some expriments with the simplifier, I think I have a portable version of a direct-from-buffer decoder which seems to perform nearly as well as one written directly against GHC primitive unboxed functions. I'm wondering if there's anything further I can do to improve performance. The "portable" unboxed version is within about 15% of the unboxed version in terms of time and allocation. Changes I made: - added strictness "annotations" in the form of a strictness guards that are always False - unrolled loops (they were always short loops anyway, with a maximum of 3 or 4 iterations) - replaced shiftL with multiplication, because multiplication unboxes, while shiftL doesn't. Some things I've noticed in the simplifier output: - the shiftL call hasn't unboxed or inlined into a call to uncheckedShiftL#, which I would prefer. Would this be possible if we added unchecked versions of the shiftL/R functions to Data.Bits? - Ptrs don't get unboxed. Why is this? Some IO monad thing? - the chr function tests that its Int argument is less than 1114111, before constructing the Char. It'd be nice to avoid this test. - why does this code: | x <= 0xF7 = remaining 3 (bAND x 0x07) xs | otherwise = err x turn into this i.e. the <= turns into two identical case-branches, using eqword# and ltword#, rather than one case-branch using leword# ? case GHC.Prim.eqWord# a11_a2PJ __word 247 of wild25_X2SU { GHC.Base.False -> case GHC.Prim.ltWord# a11_a2PJ __word 247 of wild6_Xcw { GHC.Base.False -> <error call> GHC.Base.True -> $wremaining_r3dD 3 (__scc {fromUTF8 main:Foreign.C.UTF8 !} GHC.Base.I# (GHC.Prim.word2Int# (GHC.Prim.and# a11_a2PJ __word 7))) xs_aVm }; GHC.Base.True -> $wremaining_r3dD 3 (__scc {fromUTF8 main:Foreign.C.UTF8 !} GHC.Base.I# (GHC.Prim.word2Int# (GHC.Prim.and# a11_a2PJ __word 7))) xs_aVm }; BTW, what's the difference between the indexXxxxOffAddr# and readXxxxOffAddr# functions in GHC.Prim? AFAICT they are equivalent, except that the read* functions take an extra State# s parameter. Presumably this is to thread the IO monad's RealWorld value through, to create some sort of data dependency between the functions (and so to ensure ordered evaluation?) Alistair
{-# OPTIONS_GHC -fglasgow-exts #-} -- | -- Module : Foreign.C.UTF8 -- Copyright : (c) 2004 John Meacham, Alistair Bayley -- License : BSD-style -- Maintainer : [EMAIL PROTECTED] -- Stability : experimental -- Portability : portable -- -- Marshall Haskell Strings to and from UTF8-encoded CStrings. -- This module's code is inspired by John Meacham's UTF8 en- & de-coders, -- and also those found in the HXT library (module Text.XML.HXT.DOM.Unicode). -- -- Note that the -Len functions all return the length in bytes, -- not Chars (this is more useful, as you are most likely to want -- to pass the length to an FFI function, which is most likely -- expecting the length in bytes). If you want the length in Chars, -- well, you have the original String, so... module Foreign.C.UTF8 ( peekUTF8String, peekUTF8StringLen , peekUTF8StringB, peekUTF8StringLenB , newUTF8String, withUTF8String, withUTF8StringLen , toUTF8String, fromUTF8String , lengthUTF8, fromUTF8, toUTF8 ) where import Control.Monad (when, liftM) import Data.Bits import Data.Char import Data.Word (Word8) import Foreign.C.Types import Foreign.C.String import Foreign.Ptr import Foreign.Marshal.Array import Foreign.Storable import GHC.Base import GHC.Ptr (Ptr(..)) nullCChar :: CChar nullCChar = 0 nullByte :: Word8 nullByte = 0 -- | Analogous to peekCString. Converts UTF8 CString to String. peekUTF8String :: CString -> IO String peekUTF8String cs = peekArray0 nullByte (castPtr cs) >>= return . fromUTF8 peekUTF8StringB :: CString -> IO String peekUTF8StringB cs = fromUTF8Ptr0 (castPtr cs) -- | Analogous to peekCStringLen. Converts UTF8 CString to String. -- The resulting String will end either when @len@ bytes -- have been converted, or when a NULL is found. peekUTF8StringLen :: CStringLen -> IO String peekUTF8StringLen (cs, len) = peekArray len (castPtr cs) >>= return . fromUTF8 peekUTF8StringLenB :: CStringLen -> IO String peekUTF8StringLenB (cs, len) = fromUTF8Ptr (len-1) (castPtr cs) "" -- | Analogous to newCString. Creates UTF8 encoded CString. newUTF8String :: String -> IO CString newUTF8String hs = do p <- newArray0 nullByte (toUTF8 hs) return (castPtr p) -- | Analogous to newCStringLen. -- The length returned in bytes, not chars. -- This is more useful for when you need to pass the -- UTF8-encoded CString to an FFI function that expects the -- size in bytes. -- If you need the length in chars, just ask for -- 'length' @[EMAIL PROTECTED] newUTF8StringLen :: String -> IO CStringLen newUTF8StringLen hs = do let utf8 = toUTF8 hs p <- newArray0 nullByte utf8 return (castPtr p, length utf8) -- | Analogous to withCString. Creates UTF8 encoded CString. withUTF8String :: String -> (CString -> IO a) -> IO a withUTF8String s action = withUTF8StringLen s (\(cstr, _) -> action cstr) -- | Analogous to withCStringLen. -- The length returned in bytes, not chars. -- This is more useful for when you need to pass the -- UTF8-encoded CString to an FFI function that expects the -- size in bytes. -- If you need the length in chars, just ask for -- 'length' @[EMAIL PROTECTED] withUTF8StringLen :: String -> (CStringLen -> IO a) -> IO a withUTF8StringLen s action = do let utf8 = toUTF8 s withArray0 nullByte utf8 (\arr -> action (castPtr arr, length utf8) ) -- | Convert a String that was marshalled from a CString without -- any decoder applied. This might be useful if the client encoding -- is unknown, and the user code must convert. -- We assume that the UTF8 CString was marshalled as if Latin-1 -- i.e. all chars are in the range 0-255. fromUTF8String :: String -> String fromUTF8String = fromUTF8 . map charToWord8 -- | Convert a Haskell String into a UTF8 String, where each UTF8 byte -- is represented by its Char equivalent i.e. only chars 0-255 are used. -- The resulting String can be marshalled to CString directly i.e. with -- a Latin-1 encoding. toUTF8String :: String -> String toUTF8String = map word8ToChar . toUTF8 charToWord8 :: Char -> Word8 charToWord8 = fromIntegral . fromEnum word8ToChar :: Word8 -> Char word8ToChar = toEnum . fromIntegral lengthUTF8 :: String -> Int lengthUTF8 s = length (toUTF8 s) {- The codepoint-to-UTF8 rules: 0x00 - 0x7f: 7 bits: as is 0x80 - 0x07ff: 11 bits byte 1: 0xC0 OR ((x << 6) AND 0x1F) i.e. 0xC0 + bits 7-11 (bits 12-up are 0) byte 2: 0x80 OR (x AND 0x3F) i.e. 0x80 + bits 1-6 0x0800 - 0xFFFF: 16 bits byte 1: 0xE0 OR ((x << 12) AND 0x0F) i.e. 0xE0 + bits 13-16 byte 2: 0x80 OR ((x << 6) AND 0x3F) i.e. 0x80 + bits 7-12 byte 3: 0x80 OR (x AND 0x3F) i.e. 0x80 + bits 1-6 0x00010000 - 0x001FFFFF: 21 bits byte 1: 0xF0 OR ((x << 18) AND 0x07) i.e. 0xF0 + bits 19-21 byte 2: 0x80 OR ((x << 12) AND 0x3F) i.e. 0x80 + bits 13-18 byte 3: 0x80 OR ((x << 6) AND 0x3F) i.e. 0x80 + bits 7-12 byte 4: 0x80 OR (x AND 0x3F) i.e. 0x80 + bits 1-6 0x00200000 - 0x03FFFFFF: 26 bits byte 1: 0xF8 OR ((x << 24) AND 0x03) i.e. 0xF8 + bits 25-26 byte 2: 0x80 OR ((x << 18) AND 0x3F) i.e. 0x80 + bits 19-24 byte 3: 0x80 OR ((x << 12) AND 0x3F) i.e. 0x80 + bits 13-18 byte 4: 0x80 OR ((x << 6) AND 0x3F) i.e. 0x80 + bits 7-12 byte 5: 0x80 OR (x AND 0x3F) i.e. 0x80 + bits 1-6 0x04000000 - 0x7FFFFFFF: 31 bits byte 1: 0xFC OR ((x << 30) AND 0x01) i.e. 0xFC + bit 31 byte 2: 0x80 OR ((x << 24) AND 0x3F) i.e. 0x80 + bits 25-30 byte 3: 0x80 OR ((x << 18) AND 0x3F) i.e. 0x80 + bits 19-24 byte 4: 0x80 OR ((x << 12) AND 0x3F) i.e. 0x80 + bits 13-18 byte 5: 0x80 OR ((x << 6) AND 0x3F) i.e. 0x80 + bits 7-12 byte 6: 0x80 OR (x AND 0x3F) i.e. 0x80 + bits 1-6 -} -- | Convert Unicode characters to UTF-8. toUTF8 :: String -> [Word8] toUTF8 [] = [] toUTF8 (x:xs) = toUTF8' (ord x) where toUTF8' x | x <= 0x0000007F = fromIntegral x : more | x <= 0x000007FF = w8 0xC0 6 : w8 0x80 0 : more | x <= 0x0000FFFF = w8 0xE0 12 : w8 0x80 6 : w8 0x80 0 : more | x <= 0x0010FFFF -- should be 0x001FFFFF = w8 0xF0 18 : w8 0x80 12 : w8 0x80 6 : w8 0x80 0 : more | otherwise = error ("toUTF8: codepoint " ++ show x ++ " is greater than the largest allowed (decimal 1114111, hex 0x10FFFF).") {- -- Potentially useful code, if Haskell ever supports codepoints > 0x0010FFFF. -- There are no tests for this, because we can't create Strings containing -- chars > 0x0010FFFF. | x <= 0x03FFFFFF = w8 0xF8 24 : w8 0x80 18 : w8 0x80 12 : w8 0x80 6 : w8 0x80 0 : more | x <= 0x7FFFFFFF = w8 0xFC 30 : w8 0x80 24 : w8 0x80 18 : w8 0x80 12 : w8 0x80 6 : w8 0x80 0 : more | otherwise = error ("toUTF8: codepoint " ++ show x ++ " is greater " ++ "then the largest that can be represented by UTF8 encoding" ++ "(decimal 2147483647, hex 0x7FFFFFFF).") -} where more = toUTF8 xs w8 :: Word8 -> Int -> Word8 w8 base rshift = base .|. (fromIntegral (shiftR x rshift) .&. mask) where mask | base == 0x80 = 0x3F | base == 0xC0 = 0x1F | base == 0xE0 = 0x0F | base == 0xF0 = 0x07 | base == 0xF8 = 0x03 | base == 0xFC = 0x01 {- And the rules for UTF8-to-codepoint: examine first byte: 0x00-0x7F: 1 byte: as-is 0x80-0xBF: error (surrogate) 0xC0-0xDF: 2 bytes: b1 AND 0x1F + remaining 0xE0-0xEF: 3 bytes: b1 AND 0x0F + remaining 0xF0-0xF7: 4 bytes: b1 AND 0x07 + remaining 0xF8-0xFB: 5 bytes: b1 AND 0x03 + remaining 0xFC-0xFD: 6 bytes: b1 AND 0x01 + remaining 0xFE-0xFF: error (byte-order-mark indicators: UTF8 - EFBBBF, UTF16 - FEFF or FFFE) remaining = lower 6 bits of each byte, concatenated -} -- | Convert UTF-8 to Unicode. fromUTF8 :: [Word8] -> String fromUTF8 [] = "" fromUTF8 (x:xs) | x <= 0x7F = remaining 0 (fromIntegral x) xs | x <= 0xBF = err x | x <= 0xDF = remaining 1 (bAND x 0x1F) xs | x <= 0xEF = remaining 2 (bAND x 0x0F) xs | x <= 0xF7 = remaining 3 (bAND x 0x07) xs | otherwise = err x {- -- Again, only works for chars > 0x0010FFFF, which we can't test. | x <= 0xFB = remaining 4 (bAND x 0x03) xs | x <= 0xFD = remaining 5 (bAND x 0x01) xs | otherwise = err x -} where err x = error ("fromUTF8: illegal UTF-8 character " ++ show x) bAND :: Word8 -> Word8 -> Int bAND x m = fromIntegral (x .&. m) remaining :: Int -> Int -> [Word8] -> String remaining 0 x xs = chr x : fromUTF8 xs remaining n x [] = error "fromUTF8: incomplete UTF8 sequence" remaining n x (b:xs) | b == 0 = err x | otherwise = remaining (n-1) ((shiftL x 6) .|. (bAND b 0x3F)) xs {- We should be able to write a version of fromUTF8Ptr which starts at the end of the array and works backwards. This will allow us to create the result String with constant space usage. Contrast this with creating the String by processing the array from start to end: in this case we would probably use an accumulating parameter, and reverse the list when we reach the end of the array. This isn't so bad, if we expect reverse to work in constant space (more or less). The trick is to find the next character boundary. The rules are: If the top 2 bits are 10xxxxxx then it's a surrogate If the top 3 bits are 110xxxxx then it's a 2-byte char If the top 4 bits are 1110xxxx then it's a 3-byte char If the top 5 bits are 11110xxx then it's a 4-byte char so... 0x00-0x7F: 1 byte: as-is 0x80-0xBF: skip (surrogate) 0xC0-0xDF: 2 bytes: b1 AND 0x1F + remaining 0xE0-0xEF: 3 bytes: b1 AND 0x0F + remaining 0xF0-0xF7: 4 bytes: b1 AND 0x07 + remaining remaining = lower 6 bits of each byte, concatenated -} -- | Convert UTF-8 to Unicode, from a null-terminated C array of bytes. -- This function is useful, in addition to 'fromUTF8' above, -- because it doesn't create an intermediate @[Word8]@ list. fromUTF8Ptr0 :: Ptr Word8 -> IO String fromUTF8Ptr0 p = do len <- lengthArray0 nullByte p fromUTF8PtrUnboxed (len-1) p "" --fromUTF8Ptr (len-1) p "" -- | The bytes parameter should be len-1 -- i.e. if the CString has length 2, then you should pass -- bytes=1. fromUTF8PtrUnboxed :: Int -> Ptr Word8 -> String -> IO String fromUTF8PtrUnboxed (I# bytes) (Ptr addr) acc = fromUTF8Ptr# bytes addr acc fromUTF8Ptr :: Int -> Ptr Word8 -> String -> IO String fromUTF8Ptr bytes p acc | () !bytes !p !acc !False = undefined | bytes < 0 = return acc | otherwise = do x <- liftM fromIntegral (peek p') --putStrLn ("fromUTF8Ptr: byte " ++ show bytes ++ " = " ++ show x ++ ", " ++ acc) case () of _ | x == 0 -> error ("fromUTF8Ptr: zero byte found in string as position " ++ show bytes) | x <= 0x7F -> fromUTF8Ptr (bytes-1) p (chr x:acc) | x <= 0xBF && bytes == 0 -> error "fromUTF8Ptr: surrogate at start of string" | x <= 0xBF -> fromUTF8Ptr (bytes-1) p acc | otherwise -> do c <- readUTF8Char x p' fromUTF8Ptr (bytes-1) p (c:acc) where p' = advancePtr p bytes x ! y = seq x y readUTF8Char :: Int -> Ptr Word8 -> IO Char readUTF8Char x p | () !x !False = undefined -- | False = undefined | otherwise = case () of _ | x == 0 -> err x | x <= 0x7F -> return (chr x) | x <= 0xBF -> err x | x <= 0xDF -> do x1 <- liftM fromIntegral (peekElemOff p 1) return (chr ( -- (shiftL (x - 0xC0) 6) ((x - 0xC0) * 64) + (x1 - 0x80) )) | x <= 0xEF -> do x1 <- liftM fromIntegral (peekElemOff p 1) x2 <- liftM fromIntegral (peekElemOff p 2) return (chr ( -- (shiftL (x - 0xE0) 12) -- + (shiftL (x1 - 0x80) 6) ((x - 0xE0) * 4096) + ((x1 - 0x80) * 64) + (x2 - 0x80) )) | x <= 0xF7 -> do x1 <- liftM fromIntegral (peekElemOff p 1) x2 <- liftM fromIntegral (peekElemOff p 2) x3 <- liftM fromIntegral (peekElemOff p 3) return (chr ( -- (shiftL (x - 0xF0) 18) -- + (shiftL (x1 - 0x80) 12) -- + (shiftL (x2 - 0x80) 6) ((x - 0xF0) * 262144) + ((x1 - 0x80) * 4096) + ((x2 - 0x80) * 64) + (x3 - 0x80) )) | otherwise -> err x where x ! y = seq x y err x = error ("readUTF8Char: illegal UTF-8 character " ++ show x) --------------------------------------------------------------------- -- Unboxed versions - GHC only? fromUTF8Ptr# :: Int# -> Addr# -> String -> IO String fromUTF8Ptr# bytes p acc | bytes <# 0# = return acc | otherwise = do case () of _ | x ==# 0# -> error ("fromUTF8Ptr#: zero byte in string at position " ++ show (I# x)) | x <=# 0x7F# -> fromUTF8Ptr# (bytes -# 1#) p ((C# (chr# x)):acc) | x <=# 0xBF# && bytes ==# 0# -> error "fromUTF8Ptr#: surrogate at start of string" | x <=# 0xBF# -> fromUTF8Ptr# (bytes -# 1#) p acc | otherwise -> do c <- readUTF8Char# x p' fromUTF8Ptr# (bytes -# 1#) p (c:acc) where p' = plusAddr# p bytes x = word2Int# (indexWord8OffAddr# p' 0#) readUTF8Char# :: Int# -> Addr# -> IO Char readUTF8Char# x p = do case () of _ | x ==# 0# -> err x | x <=# 0x7F# -> return (C# (chr# x)) | x <=# 0xBF# -> err x | x <=# 0xDF# -> do let x1 = word2Int# (indexWord8OffAddr# p 1#) return (C# (chr# ( (uncheckedIShiftL# (x -# 0xC0#) 6#) +# (x1 -# 0x80#) ))) | x <=# 0xEF# -> do let x1 = word2Int# (indexWord8OffAddr# p 1#) let x2 = word2Int# (indexWord8OffAddr# p 2#) return (C# (chr# ( (uncheckedIShiftL# (x -# 0xE0#) 12#) +# (uncheckedIShiftL# (x1 -# 0x80#) 6#) +# (x2 -# 0x80#) ))) | x <=# 0xF7# -> do let x1 = word2Int# (indexWord8OffAddr# p 1#) let x2 = word2Int# (indexWord8OffAddr# p 2#) let x3 = word2Int# (indexWord8OffAddr# p 3#) return (C# (chr# ( (uncheckedIShiftL# (x -# 0xF0#) 18#) +# (uncheckedIShiftL# (x1 -# 0x80#) 12#) +# (uncheckedIShiftL# (x2 -# 0x80#) 6#) +# (x3 -# 0x80#) ))) | otherwise -> err x where err x = error ("readUTF8Char: illegal UTF-8 character " ++ show (I# x))
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe