[PHP-I18N] Difficulties using preg_replace with Latin9 and Unicode characters (Resolved)

Erik Norvelle Mon, 10 Sep 2007 04:14:34 -0700

Greetings,

This was originally a question, but since I ended up solving the issueon my own, I thought I would post my solution.

I have written a program for aiding in translating documents fromSpanish to English, which relies heavily on regular expressions. Mostlyit works, but there are a few characters which cause problems for theregular expression engine. For instance, the following regularexpressions do not match correctly:


preg_replace("Dña\. ", "Ms\. ", $text); [Matches as /D.*/]
preg_replace("[Ff]útbol", "Soccer", $text); [Matches as /[Ff].*/]
preg_replace("1º", "1st", $text); [Matches as /1.*/]

Plus a few others. It appears that upon hitting one of thesetroublesome characters, the preg engine stops parsing and uses whatever"legal" characters it has found up to that point as the "real" regex,ignoring whatever comes after.

I have tried saving the files in various encodings, in particular,UTF-8, as well as the native Latin9 encoding, to see if PHP would pickup the encoding and respond correctly. No luck, alas.The regexes are stored in a MySQL database, with encoding"utf8_unicode_ci", so in theory the function iconv should work to changethe encoding. I have tried the following:


$regex = iconv("UTF-8", "ISO-8859-1", $trans['patron']);

This should, in theory, change the pattern (stored in UTF-8 in the DB)into a nice Latin1 pattern. However, it truncates the pattern, much asPHP does automatically. For instance, "Sociedad Española deCardiología" becomes "Sociedad Espa", and "Dña." becomes "D", etc.

The solution was to tell MySQL to perform the conversion to Latin1 priorto executing the SELECT query to retrieve the Regexes. MySQL does abetter job than PHP in translating between character sets, it would appear:


mysql_query("SET character_set_results=latin1");

This has fixed the problems that I had.

HTH,
Erik Norvelle

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP-I18N] Difficulties using preg_replace with Latin9 and Unicode characters (Resolved)

Reply via email to