On Mon, 29 Jun 2009 06:52:23 +0100, Nicholas Robinson wrote:

> I have a mysql database and use a php application that captures, stores,
> retrieves and displays data correctly - including French language words
> with accents. It has been running for around five years. I've recently
> written an extension that creates an openoffice writer document using
> this data. Everything works apart from the these wretched French
> characters!!! If I unzip the odt package and examine content.xml, then
> the characters are wrong - but simply cutting and pasting correct ones
> in gives me a working document, so the error is definitely in the way I
> am creating the content using php.
> 
> An example of the problem is Côte. As I've just typed it, the o has a
> circumflex accent or 'hat' on it. Within the odt file, the o-circumflex
> is shown as ô. Piping this to od -c gives 303 203 302 264. If I take
> the o-circumflex character from gnome charmap and od -c this, then I get
> 303 264. If I copy the character from my php/web app then it is correct.
> Where are these two middle bytes coming from? I've tried various
> combinations of mbstring functions and ini file settings but without
> joy.

Hexadecimal is easier on my eyes, so:

  303 203 302 264  ==  c3 83 c2 b4
  303 264          ==  c3 b4

These are UTF-8 encodings:

  <c3 83><c2 b4>  == U+00C3 (LATIN CAPITAL LETTER A WITH TILDE),
                     U+00B4 (ACUTE ACCENT)
  <c3 b4>         == U+00F4 (LATIN SMALL LETTER O WITH CIRCUMFLEX)


In other words, somewhere in the process, a perfectly fine
UTF-8 encoded character:

  <c3 b4> (U+00F4)

has been (incorrectly) converted from ISO-8859-1 (or similar)
to UTF-8, resulting in:

  <c3 83><c2 b4> (U+00C3, U+00B4)


Perhaps this gives you some idea of what's going wrong.


/Nisse

-- 
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to