On Thu, Aug 23, 2012 at 08:59:52AM -0300, José Romildo Malaquias wrote:
Hello.
I think I have an explanation for the problem with regex-pcre, ghc-7.4.2
and UTF Strings.
The Text.Regex.PCRE.String module uses the withCString and
withCStringLen from the module Foreign.C.String to pass a Haskell string
to the C library pcre functions that compile regular expressions, and
execute regular expressions to match some text.
Recent versions of ghc have withCString and withCStringLen definitions
that uses the current system locale to define the marshalling of a
Haskell string into a NUL terminated C string using temporary storage.
With a UTF-8 locale the length of the C string will be greater than the
length of the corresponding Haskell string in the presence with
characters outside of the ASCII range. Therefore positions of
corresponding characters in both strings do not match.
In order to compute matching positions, regex-pcre functions use C
strings. But to compute matching strings they use those positions with
Haskell strings.
That gives the mismatch shown earlier and repeated here with the
attached program run on a system with a UTF-8 locale:
$ LANG=en_US.UTF-8 ./test1
getForeignEncoding: UTF-8
regex: país:(.*):(.*)
text : país:Brasília:Brasil
String matchOnce : Just (array (0,2) [(0,(0,22)),(1,(6,9)),(2,(16,6))])
String match : [[pa\237s:Bras\237lia:Brasil,ras\237lia:B,asil]]
$ LANG=en_US.ISO-8859-1 ./test1
getForeignEncoding: ISO-8859-1
regex: pa�s:(.*):(.*)
text : pa�s:Bras�lia:Brasil
String matchOnce : Just (array (0,2) [(0,(0,20)),(1,(5,8)),(2,(14,6))])
String match : [[pa\237s:Bras\237lia:Brasil,Bras\237lia,Brasil]]
I see two ways of fixing this bug:
1. make the matching functions compute the text using the C string and
the positions calculated by the C function, and convert the text back
to a Haskell string.
2. map the positions in the C string (if possible) to the corresponding
positions in the Haskell string; this way the current definitions of
the matching functions returning text will just work.
I hope this would help fixing the issue.
I have a fix for this bug and it would be nice if others take a look at
it and see if it is ok. It is based on the second way presented above.
Romildo
diff -ur regex-pcre-0.94.4.orig/Text/Regex/PCRE/String.hs
regex-pcre-0.94.4/Text/Regex/PCRE/String.hs
--- regex-pcre-0.94.4.orig/Text/Regex/PCRE/String.hs2012-05-30
18:44:14.0 -0300
+++ regex-pcre-0.94.4/Text/Regex/PCRE/String.hs 2012-08-23 17:22:14.114641657
-0300
@@ -46,11 +46,16 @@
) where
import Text.Regex.PCRE.Wrap -- all
-import Foreign.C.String(withCStringLen,withCString)
-import Data.Array(Array,listArray)
+import Foreign.C.String(CStringLen,withCStringLen,withCString)
+import Foreign.Storable(peekByteOff)
+import Data.Word(Word8)
+import Data.Array.IO(IOUArray,newArray,readArray,writeArray)
+import Data.Array(Array,listArray,bounds,elems)
import System.IO.Unsafe(unsafePerformIO)
-import
Text.Regex.Base.RegexLike(RegexMaker(..),RegexLike(..),RegexContext(..),MatchLength,MatchOffset)
+import
Text.Regex.Base.RegexLike(RegexMaker(..),RegexLike(..),RegexContext(..),MatchLength,MatchOffset,MatchArray)
import Text.Regex.Base.Impl(polymatch,polymatchM)
+import GHC.IO.Encoding(getForeignEncoding,textEncodingName)
+import Control.Monad(forM)
instance RegexContext Regex String String where
match = polymatch
@@ -72,7 +77,7 @@
matchOnce regex str = unsafePerformIO $
execute regex str = unwrap
matchAll regex str = unsafePerformIO $
-withCStringLen str (wrapMatchAll regex) = unwrap
+withCStringLen str (wrapMatchAllFixPos regex) = unwrap
matchCount regex str = unsafePerformIO $
withCStringLen str (wrapCount regex) = unwrap
@@ -91,7 +96,7 @@
-- string, or:
-- 'Just' an array of (offset,length) pairs where index 0 is
whole match, and the rest are the captured subexpressions.
execute regex str = do
- maybeStartEnd - withCStringLen str (wrapMatch 0 regex)
+ maybeStartEnd - withCStringLen str (wrapMatchFixPos 0 regex)
case maybeStartEnd of
Right Nothing - return (Right Nothing)
-- Right (Just []) - fail got [] back! -- should never happen
@@ -115,9 +120,94 @@
,getSub matchedStartStop
,drop stop str
,map getSub subStartStop)
- maybeStartEnd - withCStringLen str (wrapMatch 0 regex)
+ maybeStartEnd - withCStringLen str (wrapMatchFixPos 0 regex)
case maybeStartEnd of
Right Nothing - return (Right Nothing)
-- Right (Just []) - fail got [] back! -- should never happen
Right (Just parts) - return . Right . Just . matchedParts $ parts
Left err - return (Left err)
+
+
+
+-- | wrapMatchFixPos calls wrapMatch and fixes the string offsets
+-- in the result so that they are valid in the original Haskell string