Greetings,

This was originally a question, but since I ended up solving the issue on my own, I thought I would post my solution.

I have written a program for aiding in translating documents from Spanish to English, which relies heavily on regular expressions. Mostly it works, but there are a few characters which cause problems for the regular expression engine. For instance, the following regular expressions do not match correctly:

preg_replace("Dña\. ", "Ms\. ", $text); [Matches as /D.*/]
preg_replace("[Ff]útbol", "Soccer", $text); [Matches as /[Ff].*/]
preg_replace("1º", "1st", $text); [Matches as /1.*/]

Plus a few others. It appears that upon hitting one of these troublesome characters, the preg engine stops parsing and uses whatever "legal" characters it has found up to that point as the "real" regex, ignoring whatever comes after.

I have tried saving the files in various encodings, in particular, UTF-8, as well as the native Latin9 encoding, to see if PHP would pick up the encoding and respond correctly. No luck, alas. The regexes are stored in a MySQL database, with encoding "utf8_unicode_ci", so in theory the function iconv should work to change the encoding. I have tried the following:

$regex = iconv("UTF-8", "ISO-8859-1", $trans['patron']);

This should, in theory, change the pattern (stored in UTF-8 in the DB) into a nice Latin1 pattern. However, it truncates the pattern, much as PHP does automatically. For instance, "Sociedad Española de Cardiología" becomes "Sociedad Espa", and "Dña." becomes "D", etc.

The solution was to tell MySQL to perform the conversion to Latin1 prior to executing the SELECT query to retrieve the Regexes. MySQL does a better job than PHP in translating between character sets, it would appear:

mysql_query("SET character_set_results=latin1");

This has fixed the problems that I had.

HTH,
Erik Norvelle

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to