[Haskell-cafe] regex-pcre and ghc-7.4.2 is not working with UTF-8

2012-08-23 Thread José Romildo Malaquias
Hello.

I think I have an explanation for the problem with regex-pcre, ghc-7.4.2
and UTF Strings.

The Text.Regex.PCRE.String module uses the withCString and
withCStringLen from the module Foreign.C.String to pass a Haskell string
to the C library pcre functions that compile regular expressions, and
execute regular expressions to match some text.

Recent versions of ghc have withCString and withCStringLen definitions
that uses the current system locale to define the marshalling of a
Haskell string into a NUL terminated C string using temporary storage.

With a UTF-8 locale the length of the C string will be greater than the
length of the corresponding Haskell string in the presence with
characters outside of the ASCII range. Therefore positions of
corresponding characters in both strings do not match.

In order to compute matching positions, regex-pcre functions use C
strings. But to compute matching strings they use those positions with
Haskell strings.

That gives the mismatch shown earlier and repeated here with the
attached program run on a system with a UTF-8 locale:


   $ LANG=en_US.UTF-8  ./test1
   getForeignEncoding: UTF-8

   regex: país:(.*):(.*)
   text : país:Brasília:Brasil
   String matchOnce : Just (array (0,2) [(0,(0,22)),(1,(6,9)),(2,(16,6))])
   String match : [[pa\237s:Bras\237lia:Brasil,ras\237lia:B,asil]]

   $ LANG=en_US.ISO-8859-1  ./test1
   getForeignEncoding: ISO-8859-1

   regex: pa�s:(.*):(.*)
   text : pa�s:Bras�lia:Brasil
   String matchOnce : Just (array (0,2) [(0,(0,20)),(1,(5,8)),(2,(14,6))])
   String match : [[pa\237s:Bras\237lia:Brasil,Bras\237lia,Brasil]]


I see two ways of fixing this bug:

1. make the matching functions compute the text using the C string and
   the positions calculated by the C function, and convert the text back
   to a Haskell string.

2. map the positions in the C string (if possible) to the corresponding
   positions in the Haskell string; this way the current definitions of
   the matching functions returning text will just work.

I hope this would help fixing the issue.


Regards,

Romildo
module Main where

import GHC.IO.Encoding (getForeignEncoding)
import Data.Bits (Bits((..)))
import Text.Regex.PCRE

testpcre re text = do putStrLn (regex:  ++ re)
  putStrLn (text :  ++ text)
  putStrLn (String matchOnce :  ++ show mo)
  putStrLn (String match :  ++ show m)
  where
c = defaultCompOpt .. compUTF8
e = defaultExecOpt
regex = makeRegexOpts c e re :: Regex
mo = matchOnce regex text
m = match regex text :: [[String]]

main = do enc - getForeignEncoding
  putStrLn (getForeignEncoding:  ++ show enc)
  putStrLn 
  testpcre país:(.*):(.*) país:Brasília:Brasil

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] regex-pcre and ghc-7.4.2 is not working with UTF-8

2012-08-23 Thread José Romildo Malaquias
On Thu, Aug 23, 2012 at 08:59:52AM -0300, José Romildo Malaquias wrote:
 Hello.
 
 I think I have an explanation for the problem with regex-pcre, ghc-7.4.2
 and UTF Strings.
 
 The Text.Regex.PCRE.String module uses the withCString and
 withCStringLen from the module Foreign.C.String to pass a Haskell string
 to the C library pcre functions that compile regular expressions, and
 execute regular expressions to match some text.
 
 Recent versions of ghc have withCString and withCStringLen definitions
 that uses the current system locale to define the marshalling of a
 Haskell string into a NUL terminated C string using temporary storage.
 
 With a UTF-8 locale the length of the C string will be greater than the
 length of the corresponding Haskell string in the presence with
 characters outside of the ASCII range. Therefore positions of
 corresponding characters in both strings do not match.
 
 In order to compute matching positions, regex-pcre functions use C
 strings. But to compute matching strings they use those positions with
 Haskell strings.
 
 That gives the mismatch shown earlier and repeated here with the
 attached program run on a system with a UTF-8 locale:
 
 
$ LANG=en_US.UTF-8  ./test1
getForeignEncoding: UTF-8
 
regex: país:(.*):(.*)
text : país:Brasília:Brasil
String matchOnce : Just (array (0,2) [(0,(0,22)),(1,(6,9)),(2,(16,6))])
String match : [[pa\237s:Bras\237lia:Brasil,ras\237lia:B,asil]]
 
$ LANG=en_US.ISO-8859-1  ./test1
getForeignEncoding: ISO-8859-1
 
regex: pa�s:(.*):(.*)
text : pa�s:Bras�lia:Brasil
String matchOnce : Just (array (0,2) [(0,(0,20)),(1,(5,8)),(2,(14,6))])
String match : [[pa\237s:Bras\237lia:Brasil,Bras\237lia,Brasil]]
 
 
 I see two ways of fixing this bug:
 
 1. make the matching functions compute the text using the C string and
the positions calculated by the C function, and convert the text back
to a Haskell string.
 
 2. map the positions in the C string (if possible) to the corresponding
positions in the Haskell string; this way the current definitions of
the matching functions returning text will just work.
 
 I hope this would help fixing the issue.


I have a fix for this bug and it would be nice if others take a look at
it and see if it is ok. It is based on the second way presented above.

Romildo
diff -ur regex-pcre-0.94.4.orig/Text/Regex/PCRE/String.hs 
regex-pcre-0.94.4/Text/Regex/PCRE/String.hs
--- regex-pcre-0.94.4.orig/Text/Regex/PCRE/String.hs2012-05-30 
18:44:14.0 -0300
+++ regex-pcre-0.94.4/Text/Regex/PCRE/String.hs 2012-08-23 17:22:14.114641657 
-0300
@@ -46,11 +46,16 @@
   ) where
 
 import Text.Regex.PCRE.Wrap -- all
-import Foreign.C.String(withCStringLen,withCString)
-import Data.Array(Array,listArray)
+import Foreign.C.String(CStringLen,withCStringLen,withCString)
+import Foreign.Storable(peekByteOff)
+import Data.Word(Word8)
+import Data.Array.IO(IOUArray,newArray,readArray,writeArray)
+import Data.Array(Array,listArray,bounds,elems)
 import System.IO.Unsafe(unsafePerformIO)
-import 
Text.Regex.Base.RegexLike(RegexMaker(..),RegexLike(..),RegexContext(..),MatchLength,MatchOffset)
+import 
Text.Regex.Base.RegexLike(RegexMaker(..),RegexLike(..),RegexContext(..),MatchLength,MatchOffset,MatchArray)
 import Text.Regex.Base.Impl(polymatch,polymatchM)
+import GHC.IO.Encoding(getForeignEncoding,textEncodingName)
+import Control.Monad(forM)
 
 instance RegexContext Regex String String where
   match = polymatch
@@ -72,7 +77,7 @@
   matchOnce regex str = unsafePerformIO $
 execute regex str = unwrap
   matchAll regex str = unsafePerformIO $ 
-withCStringLen str (wrapMatchAll regex) = unwrap
+withCStringLen str (wrapMatchAllFixPos regex) = unwrap
   matchCount regex str = unsafePerformIO $ 
 withCStringLen str (wrapCount regex) = unwrap
 
@@ -91,7 +96,7 @@
 -- string, or:
 --   'Just' an array of (offset,length) pairs where index 0 is 
whole match, and the rest are the captured subexpressions.
 execute regex str = do
-  maybeStartEnd - withCStringLen str (wrapMatch 0 regex)
+  maybeStartEnd - withCStringLen str (wrapMatchFixPos 0 regex)
   case maybeStartEnd of
 Right Nothing - return (Right Nothing)
 --  Right (Just []) - fail got [] back! -- should never happen
@@ -115,9 +120,94 @@
 ,getSub matchedStartStop
 ,drop stop str
 ,map getSub subStartStop)
-  maybeStartEnd - withCStringLen str (wrapMatch 0 regex)
+  maybeStartEnd - withCStringLen str (wrapMatchFixPos 0 regex)
   case maybeStartEnd of
 Right Nothing - return (Right Nothing)
 --  Right (Just []) - fail got [] back! -- should never happen
 Right (Just parts) - return . Right . Just . matchedParts $ parts
 Left err - return (Left err)
+
+
+
+-- | wrapMatchFixPos calls wrapMatch and fixes the string offsets
+-- in the result so that they are valid in the original Haskell string