On Mon, 29 Jun 2009 06:52:23 +0100, Nicholas Robinson wrote:
> I have a mysql database and use a php application that captures, stores,
> retrieves and displays data correctly - including French language words
> with accents. It has been running for around five years. I've recently
> written an extension that creates an openoffice writer document using
> this data. Everything works apart from the these wretched French
> characters!!! If I unzip the odt package and examine content.xml, then
> the characters are wrong - but simply cutting and pasting correct ones
> in gives me a working document, so the error is definitely in the way I
> am creating the content using php.
>
> An example of the problem is Côte. As I've just typed it, the o has a
> circumflex accent or 'hat' on it. Within the odt file, the o-circumflex
> is shown as ô. Piping this to od -c gives 303 203 302 264. If I take
> the o-circumflex character from gnome charmap and od -c this, then I get
> 303 264. If I copy the character from my php/web app then it is correct.
> Where are these two middle bytes coming from? I've tried various
> combinations of mbstring functions and ini file settings but without
> joy.
Hexadecimal is easier on my eyes, so:
303 203 302 264 == c3 83 c2 b4
303 264 == c3 b4
These are UTF-8 encodings:
<c3 83><c2 b4> == U+00C3 (LATIN CAPITAL LETTER A WITH TILDE),
U+00B4 (ACUTE ACCENT)
<c3 b4> == U+00F4 (LATIN SMALL LETTER O WITH CIRCUMFLEX)
In other words, somewhere in the process, a perfectly fine
UTF-8 encoded character:
<c3 b4> (U+00F4)
has been (incorrectly) converted from ISO-8859-1 (or similar)
to UTF-8, resulting in:
<c3 83><c2 b4> (U+00C3, U+00B4)
Perhaps this gives you some idea of what's going wrong.
/Nisse
--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php