UTF-8 encode/decode

2004-04-27 Thread George Russell
I have implemented UTF8-encode/decode.  Unlike the code someone has already
posted it handles all UTF8 sequences, including those longer than 3 bytes.
It also catches all illegal UTF8 sequences (such as characters encoded
with a longer sequence than necessary).  Here is the code.
-- cut here ---

{- This module contains functions for converting to and from the UTF8
   representations for Strings.
   -}
module UTF8(
   toUTF8,
  -- :: String - String
  -- Converts a String (whose characters must all have codes 2^31) into
  -- its UTF8 representation.
   fromUTF8WE,
  -- :: Monad m = String - m String
  -- Converts a UTF8 representation of a String back into the String,
  -- catching all possible format errors.
  --
  -- Example: With the Haskell module Control.Monad.Error, you can
  -- instance this as
  -- (fromUTF8WE :: String - Either String String)
  -- to get a conversion function which either succeeds (Right) or
  -- returns an error message (Left).
   ) where
import Char
import List
import Data.Bits

import Computation

-- --
-- Encoding
-- --
-- | Converts a String into its UTF8 representation.
toUTF8 :: String - String
toUTF8 [] = []
toUTF8 (x:xs) =
   let
  xs1 = toUTF8 xs
  ox = ord x
  mkUTF8 :: Int - String - Int - Int - String
  mkUTF8 x0 xs0 xmask0 xmax0 =
 let
xbot = 0x80 .|. (x0 .. 0x3f)
x1 = x0 `shiftR` 6
xs1 = chr xbot : xs0
 in
if x1  xmax0
  then
 chr (xmask0 .|. x1) : xs1
  else
 let
xmask1 = xmask0 .|. xmax0
xmax1 = xmax0 `shiftR` 1
 in
mkUTF8 x1 xs1 xmask1 xmax1
   in
  if ox = 0x7f
 then
x : xs1
 else
   if ox `shiftR` 31 /= 0
  then
 error (Huge character with code  ++ show ox ++
 detected in string being converted to UTF8.)
  else
 mkUTF8 ox xs1 0xc0 0x20
-- | Converts a UTF8 representation of a String back into the String,
-- catching all possible format errors.
--
-- Example: With the Haskell module Control.Monad.Error, you can
-- instance this as
-- (fromUTF8WE :: String - Either String String)
-- to get a conversion function which either succeeds (Right) or
-- returns an error message (Left).
fromUTF8WE :: Monad m = String - m String
fromUTF8WE [] = return []
fromUTF8WE (x0 : xs0) =
   let
  ox = ord x0
   in
  case topZero8 ox of
 7 -
do
   xs1 - fromUTF8WE xs0
   return (x0 : xs1)
 6 -
fail UTF8 escape sequence starts 10xx
 0 -
fail UTF8 escape sequence starts 1110
 -1 -
fail UTF8 escape sequence starts 
 n -
let
   r = 6 - n -- number of 6-bit pieces
   xtop = ox .. ones n
   minx =
  bit (
 if r == 1
then
   7
else
   5*r + 1
 )
   mkx [] _ _ =
  fail UTF8 string ends in middle of escape sequence
   mkx (ch : xs1) x0 count0 =
  do
 let
och = ord ch
 if och .. 0x80 /= 0x80
then
   fail (UTF8 escape sequence contains continuing 
  ++ character not of form 10xx)
else
   return ()
 let
xbot = och .. 0x3f
x1 = (x0 `shiftL` 6) .|. xbot
count1 = count0 - 1
 if count1 == 0
then
   return (x1,xs1)
else
   mkx xs1 x1 count1
in
   do
  (x,xs1) - mkx xs0 xtop r
  if x  minx
 then
fail (UTF8 escape sequence contains character not 
   ++ optimally encoded)
 else
do
   xs2 - fromUTF8WE xs1
   return (chr x : xs2)
-- --
-- Binary utilities
-- --
-- | return the number of the top bit which is zero, or -1 if they
-- are all zero, for a number between 0 and 255.
topZero8 :: Int - Int

Re: UTF-8 encode/decode

2004-04-27 Thread David Brown
On Tue, Apr 27, 2004 at 10:55:57AM +0200, George Russell wrote:

 I have implemented UTF8-encode/decode.  Unlike the code someone has already
 posted it handles all UTF8 sequences, including those longer than 3 bytes.
 It also catches all illegal UTF8 sequences (such as characters encoded
 with a longer sequence than necessary).  Here is the code.

What license is your code covered under?  As it stands now, it is an
informative example, but cannot be used by anybody.

Thanks,
Dave Brown
___
Glasgow-haskell-users mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: UTF-8 encode/decode

2004-04-27 Thread George Russell
David Brown wrote (snipped):
What license is your code covered under?  As it stands now, it is an
informative example, but cannot be used by anybody.
As author, I am quite happy for it to be used and modified by other people
for non-commercial purposes.  As far as I know my employers wouldn't
any problem with that either.  If it is important to your lawyers that
you have a definite licence, say so, and I'll see if I can do something.
___
Glasgow-haskell-users mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


UTF-8 encode/decode libraries.

2004-04-26 Thread David Brown
I am writing some utilities to deal with UTF-8 encoded text files (not
source).  Currently, I'm just reading in the UTF-8 directly, and things
work reasonably well, since my parse tokens are ASCII, they are easy to
parse.

However, the character type seems perfectly happy with larger values for
each character.

Is anyone aware of any Haskell libraries for doing UTF-8 decoding and
encoding?  If not, I'll write something simple.

Thanks,
Dave Brown
___
Glasgow-haskell-users mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: UTF-8 encode/decode libraries.

2004-04-26 Thread Duncan Coutts
On Mon, 2004-04-26 at 18:49, David Brown wrote:
 Is anyone aware of any Haskell libraries for doing UTF-8 decoding and
 encoding?  If not, I'll write something simple.

The gtk2hs library uses the following functions internally.
Credit to Axel Simon I believe unless he swiped them from somewhere too.

-- Convert Unicode characters to UTF-8.
--
toUTF :: String - String
toUTF [] = []
toUTF (x:xs) | ord x=0x007F = x:toUTF xs
 | ord x=0x07FF = chr (0xC0 .|. ((ord x `shift` (-6)) .. 0x1F)):
   chr (0x80 .|. (ord x .. 0x3F)):
   toUTF xs
 | otherwise = chr (0xE0 .|. ((ord x `shift` (-12)) .. 0x0F)):
   chr (0x80 .|. ((ord x `shift` (-6)) .. 0x3F)):
   chr (0x80 .|. (ord x .. 0x3F)):
   toUTF xs

-- Convert UTF-8 to Unicode.
--
fromUTF :: String - String
fromUTF [] = []
fromUTF (all@(x:xs)) | ord x=0x7F = x:fromUTF xs
 | ord x=0xBF = err
 | ord x=0xDF = twoBytes all
 | ord x=0xEF = threeBytes all
 | otherwise   = err
  where
twoBytes (x1:x2:xs) = chr (((ord x1 .. 0x1F) `shift` 6) .|.
   (ord x2 .. 0x3F)):fromUTF xs
twoBytes _ = error fromUTF: illegal two byte sequence

threeBytes (x1:x2:x3:xs) = chr (((ord x1 .. 0x0F) `shift` 12) .|.
((ord x2 .. 0x3F) `shift` 6) .|.
(ord x3 .. 0x3F)):fromUTF xs
threeBytes _ = error fromUTF: illegal three byte sequence

err = error fromUTF: illegal UTF-8 character

Duncan

___
Glasgow-haskell-users mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: UTF-8 encode/decode libraries.

2004-04-26 Thread Sven Panne
Duncan Coutts wrote:
On Mon, 2004-04-26 at 18:49, David Brown wrote: [...]
toUTF :: String - String
Hmmm, String - [Word8] would be nicer...

fromUTF :: String - String
... and here: [Word8] - String or [Word8] - Maybe String.
Furthermore, UTF-8 is not restricted to a maximum of 3 bytes per character,
here an excerpt from man utf8 on my SuSE Linux:
   * UTF-8  encoded  UCS  characters  may  be up to six bytes
 long, however the Unicode standard specifies no  characters­
 above  0x10, so Unicode characters can only be up to
 four bytes long in UTF-8.
IIRC we discussed encoders/decoders quite some time ago on the libraries
mailing list, but nothing really happened, which is a pity. We should
strive for something more general than UTF-8 - UCS/Unicode, there are
quite a few more widely used encodings, e.g. GSM 03.38, etc. Any takers?
Cheers,
   S.
___
Glasgow-haskell-users mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: UTF-8 encode/decode libraries.

2004-04-26 Thread David Brown
On Mon, Apr 26, 2004 at 08:33:38PM +0200, Sven Panne wrote:
 Duncan Coutts wrote:
 On Mon, 2004-04-26 at 18:49, David Brown wrote: [...]
 toUTF :: String - String
 
 Hmmm, String - [Word8] would be nicer...
 
 fromUTF :: String - String
 
 ... and here: [Word8] - String or [Word8] - Maybe String.

Except that I would then have to come up with my own IO routines to read
and write UTF data.  With both sides as string, it is easy to just
filter input and output of files.

Dave
___
Glasgow-haskell-users mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users