UTF-8 encode/decode
I have implemented UTF8-encode/decode. Unlike the code someone has already posted it handles all UTF8 sequences, including those longer than 3 bytes. It also catches all illegal UTF8 sequences (such as characters encoded with a longer sequence than necessary). Here is the code. -- cut here --- {- This module contains functions for converting to and from the UTF8 representations for Strings. -} module UTF8( toUTF8, -- :: String - String -- Converts a String (whose characters must all have codes 2^31) into -- its UTF8 representation. fromUTF8WE, -- :: Monad m = String - m String -- Converts a UTF8 representation of a String back into the String, -- catching all possible format errors. -- -- Example: With the Haskell module Control.Monad.Error, you can -- instance this as -- (fromUTF8WE :: String - Either String String) -- to get a conversion function which either succeeds (Right) or -- returns an error message (Left). ) where import Char import List import Data.Bits import Computation -- -- -- Encoding -- -- -- | Converts a String into its UTF8 representation. toUTF8 :: String - String toUTF8 [] = [] toUTF8 (x:xs) = let xs1 = toUTF8 xs ox = ord x mkUTF8 :: Int - String - Int - Int - String mkUTF8 x0 xs0 xmask0 xmax0 = let xbot = 0x80 .|. (x0 .. 0x3f) x1 = x0 `shiftR` 6 xs1 = chr xbot : xs0 in if x1 xmax0 then chr (xmask0 .|. x1) : xs1 else let xmask1 = xmask0 .|. xmax0 xmax1 = xmax0 `shiftR` 1 in mkUTF8 x1 xs1 xmask1 xmax1 in if ox = 0x7f then x : xs1 else if ox `shiftR` 31 /= 0 then error (Huge character with code ++ show ox ++ detected in string being converted to UTF8.) else mkUTF8 ox xs1 0xc0 0x20 -- | Converts a UTF8 representation of a String back into the String, -- catching all possible format errors. -- -- Example: With the Haskell module Control.Monad.Error, you can -- instance this as -- (fromUTF8WE :: String - Either String String) -- to get a conversion function which either succeeds (Right) or -- returns an error message (Left). fromUTF8WE :: Monad m = String - m String fromUTF8WE [] = return [] fromUTF8WE (x0 : xs0) = let ox = ord x0 in case topZero8 ox of 7 - do xs1 - fromUTF8WE xs0 return (x0 : xs1) 6 - fail UTF8 escape sequence starts 10xx 0 - fail UTF8 escape sequence starts 1110 -1 - fail UTF8 escape sequence starts n - let r = 6 - n -- number of 6-bit pieces xtop = ox .. ones n minx = bit ( if r == 1 then 7 else 5*r + 1 ) mkx [] _ _ = fail UTF8 string ends in middle of escape sequence mkx (ch : xs1) x0 count0 = do let och = ord ch if och .. 0x80 /= 0x80 then fail (UTF8 escape sequence contains continuing ++ character not of form 10xx) else return () let xbot = och .. 0x3f x1 = (x0 `shiftL` 6) .|. xbot count1 = count0 - 1 if count1 == 0 then return (x1,xs1) else mkx xs1 x1 count1 in do (x,xs1) - mkx xs0 xtop r if x minx then fail (UTF8 escape sequence contains character not ++ optimally encoded) else do xs2 - fromUTF8WE xs1 return (chr x : xs2) -- -- -- Binary utilities -- -- -- | return the number of the top bit which is zero, or -1 if they -- are all zero, for a number between 0 and 255. topZero8 :: Int - Int
Re: UTF-8 encode/decode
On Tue, Apr 27, 2004 at 10:55:57AM +0200, George Russell wrote: I have implemented UTF8-encode/decode. Unlike the code someone has already posted it handles all UTF8 sequences, including those longer than 3 bytes. It also catches all illegal UTF8 sequences (such as characters encoded with a longer sequence than necessary). Here is the code. What license is your code covered under? As it stands now, it is an informative example, but cannot be used by anybody. Thanks, Dave Brown ___ Glasgow-haskell-users mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: UTF-8 encode/decode
David Brown wrote (snipped): What license is your code covered under? As it stands now, it is an informative example, but cannot be used by anybody. As author, I am quite happy for it to be used and modified by other people for non-commercial purposes. As far as I know my employers wouldn't any problem with that either. If it is important to your lawyers that you have a definite licence, say so, and I'll see if I can do something. ___ Glasgow-haskell-users mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
UTF-8 encode/decode libraries.
I am writing some utilities to deal with UTF-8 encoded text files (not source). Currently, I'm just reading in the UTF-8 directly, and things work reasonably well, since my parse tokens are ASCII, they are easy to parse. However, the character type seems perfectly happy with larger values for each character. Is anyone aware of any Haskell libraries for doing UTF-8 decoding and encoding? If not, I'll write something simple. Thanks, Dave Brown ___ Glasgow-haskell-users mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: UTF-8 encode/decode libraries.
On Mon, 2004-04-26 at 18:49, David Brown wrote: Is anyone aware of any Haskell libraries for doing UTF-8 decoding and encoding? If not, I'll write something simple. The gtk2hs library uses the following functions internally. Credit to Axel Simon I believe unless he swiped them from somewhere too. -- Convert Unicode characters to UTF-8. -- toUTF :: String - String toUTF [] = [] toUTF (x:xs) | ord x=0x007F = x:toUTF xs | ord x=0x07FF = chr (0xC0 .|. ((ord x `shift` (-6)) .. 0x1F)): chr (0x80 .|. (ord x .. 0x3F)): toUTF xs | otherwise = chr (0xE0 .|. ((ord x `shift` (-12)) .. 0x0F)): chr (0x80 .|. ((ord x `shift` (-6)) .. 0x3F)): chr (0x80 .|. (ord x .. 0x3F)): toUTF xs -- Convert UTF-8 to Unicode. -- fromUTF :: String - String fromUTF [] = [] fromUTF (all@(x:xs)) | ord x=0x7F = x:fromUTF xs | ord x=0xBF = err | ord x=0xDF = twoBytes all | ord x=0xEF = threeBytes all | otherwise = err where twoBytes (x1:x2:xs) = chr (((ord x1 .. 0x1F) `shift` 6) .|. (ord x2 .. 0x3F)):fromUTF xs twoBytes _ = error fromUTF: illegal two byte sequence threeBytes (x1:x2:x3:xs) = chr (((ord x1 .. 0x0F) `shift` 12) .|. ((ord x2 .. 0x3F) `shift` 6) .|. (ord x3 .. 0x3F)):fromUTF xs threeBytes _ = error fromUTF: illegal three byte sequence err = error fromUTF: illegal UTF-8 character Duncan ___ Glasgow-haskell-users mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: UTF-8 encode/decode libraries.
Duncan Coutts wrote: On Mon, 2004-04-26 at 18:49, David Brown wrote: [...] toUTF :: String - String Hmmm, String - [Word8] would be nicer... fromUTF :: String - String ... and here: [Word8] - String or [Word8] - Maybe String. Furthermore, UTF-8 is not restricted to a maximum of 3 bytes per character, here an excerpt from man utf8 on my SuSE Linux: * UTF-8 encoded UCS characters may be up to six bytes long, however the Unicode standard specifies no characters above 0x10, so Unicode characters can only be up to four bytes long in UTF-8. IIRC we discussed encoders/decoders quite some time ago on the libraries mailing list, but nothing really happened, which is a pity. We should strive for something more general than UTF-8 - UCS/Unicode, there are quite a few more widely used encodings, e.g. GSM 03.38, etc. Any takers? Cheers, S. ___ Glasgow-haskell-users mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: UTF-8 encode/decode libraries.
On Mon, Apr 26, 2004 at 08:33:38PM +0200, Sven Panne wrote: Duncan Coutts wrote: On Mon, 2004-04-26 at 18:49, David Brown wrote: [...] toUTF :: String - String Hmmm, String - [Word8] would be nicer... fromUTF :: String - String ... and here: [Word8] - String or [Word8] - Maybe String. Except that I would then have to come up with my own IO routines to read and write UTF data. With both sides as string, it is easy to just filter input and output of files. Dave ___ Glasgow-haskell-users mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/glasgow-haskell-users