Hello.

I think I have an explanation for the problem with regex-pcre, ghc-7.4.2
and UTF Strings.

The Text.Regex.PCRE.String module uses the withCString and
withCStringLen from the module Foreign.C.String to pass a Haskell string
to the C library pcre functions that compile regular expressions, and
execute regular expressions to match some text.

Recent versions of ghc have withCString and withCStringLen definitions
that uses the current system locale to define the marshalling of a
Haskell string into a NUL terminated C string using temporary storage.

With a UTF-8 locale the length of the C string will be greater than the
length of the corresponding Haskell string in the presence with
characters outside of the ASCII range. Therefore positions of
corresponding characters in both strings do not match.

In order to compute matching positions, regex-pcre functions use C
strings. But to compute matching strings they use those positions with
Haskell strings.

That gives the mismatch shown earlier and repeated here with the
attached program run on a system with a UTF-8 locale:


   $ LANG=en_US.UTF-8 && ./test1
   getForeignEncoding: UTF-8

   regex            : país:(.*):(.*)
   text             : país:Brasília:Brasil
   String matchOnce : Just (array (0,2) [(0,(0,22)),(1,(6,9)),(2,(16,6))])
   String match     : [["pa\237s:Bras\237lia:Brasil","ras\237lia:B","asil"]]

   $ LANG=en_US.ISO-8859-1 && ./test1
   getForeignEncoding: ISO-8859-1

   regex            : pa�s:(.*):(.*)
   text             : pa�s:Bras�lia:Brasil
   String matchOnce : Just (array (0,2) [(0,(0,20)),(1,(5,8)),(2,(14,6))])
   String match     : [["pa\237s:Bras\237lia:Brasil","Bras\237lia","Brasil"]]


I see two ways of fixing this bug:

1. make the matching functions compute the text using the C string and
   the positions calculated by the C function, and convert the text back
   to a Haskell string.

2. map the positions in the C string (if possible) to the corresponding
   positions in the Haskell string; this way the current definitions of
   the matching functions returning text will just work.

I hope this would help fixing the issue.


Regards,

Romildo
module Main where

import GHC.IO.Encoding (getForeignEncoding)
import Data.Bits (Bits((.&.)))
import Text.Regex.PCRE

testpcre re text = do putStrLn ("regex            : " ++ re)
                      putStrLn ("text             : " ++ text)
                      putStrLn ("String matchOnce : " ++ show mo)
                      putStrLn ("String match     : " ++ show m)
  where
    c = defaultCompOpt .&. compUTF8
    e = defaultExecOpt
    regex = makeRegexOpts c e re :: Regex
    mo = matchOnce regex text
    m = match regex text :: [[String]]

main = do enc <- getForeignEncoding
          putStrLn ("getForeignEncoding: " ++ show enc)
          putStrLn ""
          testpcre "país:(.*):(.*)" "país:Brasília:Brasil"

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to