On Mon, 29 Jun 2009 06:52:23 +0100, Nicholas Robinson wrote: > I have a mysql database and use a php application that captures, stores, > retrieves and displays data correctly - including French language words > with accents. It has been running for around five years. I've recently > written an extension that creates an openoffice writer document using > this data. Everything works apart from the these wretched French > characters!!! If I unzip the odt package and examine content.xml, then > the characters are wrong - but simply cutting and pasting correct ones > in gives me a working document, so the error is definitely in the way I > am creating the content using php. > > An example of the problem is Côte. As I've just typed it, the o has a > circumflex accent or 'hat' on it. Within the odt file, the o-circumflex > is shown as ô. Piping this to od -c gives 303 203 302 264. If I take > the o-circumflex character from gnome charmap and od -c this, then I get > 303 264. If I copy the character from my php/web app then it is correct. > Where are these two middle bytes coming from? I've tried various > combinations of mbstring functions and ini file settings but without > joy.
Hexadecimal is easier on my eyes, so: 303 203 302 264 == c3 83 c2 b4 303 264 == c3 b4 These are UTF-8 encodings: <c3 83><c2 b4> == U+00C3 (LATIN CAPITAL LETTER A WITH TILDE), U+00B4 (ACUTE ACCENT) <c3 b4> == U+00F4 (LATIN SMALL LETTER O WITH CIRCUMFLEX) In other words, somewhere in the process, a perfectly fine UTF-8 encoded character: <c3 b4> (U+00F4) has been (incorrectly) converted from ISO-8859-1 (or similar) to UTF-8, resulting in: <c3 83><c2 b4> (U+00C3, U+00B4) Perhaps this gives you some idea of what's going wrong. /Nisse -- PHP Unicode & I18N Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php