From:             mikx at mikx dot de
Operating system: Linux, Windows
PHP version:      5.0.5
PHP Bug Type:     WDDX related
Bug description:  wddx deserialization problems with utf-8 data

Description:
------------
It seems the behavior of wddx_deserialize is inconsistent or at least
unpredictable based on the given documentation. Not only between PHP 4 and
5, also based on the given packet data. I am not sure if this is a bug or
expected behavior. I am aware of bug #34928 - so please don't just treat
this as bogus.

The following script behaves as described on PHP 5.0.5 on Windows and
5.0.4 on Linux (currently i have no 5.0.5 Linux testcase available) and
PHP 4.3.9 on Linux. At least the windows version is a complete default
installation.

Please clearify what wddx serialize and deserialize exactly do (encoding),
why the documentation encourages to add an additional utf8_encode to
non-ascii characters on serialize and how the entire process can be
influenced (e.g. which configs get used). setlocale() and
putenv("locale=xyz") have no effect.

Currently wddx_serialize adds no character set information and keeps
whatever you supply as a string inside the resulting wddx file. So if you
send an extended character in ISO-8859-1 or UTF-8 it will be the same in
the resulting wddx packet.

The deserializer seems to always convert the packet to ISO-8859-1 unless
you explicitly set information in the XML file that it is already
ISO-8859-1 (even if there is UTF-8 content in it). 

If the documentation entry to always utf8_encode a string before sending
it to serialize is correct, it would mean you would have to double encode
an UTF-8 string. But that seems like a dirty workaround. 

>From my perspective both wddx_serialize and wddx_deserialize should
add/respect the information to the XML file and get an additional
parameter to enforce an input or output encoding or overwrite the default
behavior.

Currently i try to deserialize wddx packets produced with PHP4 in PHP5.
They are stored in a database, firstly in MySQL4 (latin1 encoded) and now
migrated to MySQL5 (utf8 encoded). What is the proper way to handle that?
utf8_encode the packet (producing a double encoded packet) before sending
to wddx_deserialize (which implicitly adds a utf8_decode on that data)
seems like an evil hack in a undocumented area.

This seems like a common migration path to me, so please specifiy clearly
what to expect and what to do.






Reproduce code:
---------------
<?php

header("Content-type: text/html; charset=UTF-8"); 

echo "ISO-8859-1 specified, ISO-8859-1 data<br>";
echo "produces latin1 output [php5]<br>";
echo "produces ISO-8859-1 output [php4]<br>";
echo wddx_deserialize("<?xml version=\"1.0\"
encoding=\"ISO-8859-1\"?><wddxPacket
version='1.0'><header/><data><string>abc-äöü</string></data></wddxPacket>")."<hr>";

echo "UTF-8 specified, ISO-8859-1 data<br>";
echo "non-ascii characters get stripped [php5]<br>";
echo "produces ISO-8859-1 output [php4]<br>";
echo wddx_deserialize("<?xml version=\"1.0\"
encoding=\"UTF-8\"?><wddxPacket
version='1.0'><header/><data><string>abc-äöü</string></data></wddxPacket>")."<hr>";;

echo "Nothing specified, ISO-8859-1 data<br>";
echo "non-ascii characters get stripped [php5]<br>";
echo "produces ISO-8859-1 output [php4]<br>";
echo wddx_deserialize("<wddxPacket
version='1.0'><header/><data><string>abc-äöü</string></data></wddxPacket>")."<hr>";;

echo "ISO-8859-1 specified, UTF-8 data<br>";
echo "produces utf-8 output [php5]<br>"; 
echo "produces UTF-8 output [php4]<br>";
echo wddx_deserialize("<?xml version=\"1.0\"
encoding=\"ISO-8859-1\"?><wddxPacket
version='1.0'><header/><data><string>".utf8_encode("abc-äöü")."</string></data></wddxPacket>")."<hr>";;

echo "UTF-8 specified, UTF-8 data<br>";
echo "produces latin1 output [php5]<br>"; 
echo "produces UTF-8 output [php4]<br>";
echo wddx_deserialize("<?xml version=\"1.0\"
encoding=\"UTF-8\"?><wddxPacket
version='1.0'><header/><data><string>".utf8_encode("abc-äöü")."</string></data></wddxPacket>")."<hr>";;

echo "Nothing specified, UTF-8 data<br>";
echo "produces latin1 output [php5]<br>";
echo "produces UTF-8 output [php4]<br>";
echo wddx_deserialize("<wddxPacket
version='1.0'><header/><data><string>".utf8_encode("abc-äöü")."</string></data></wddxPacket>")."<hr>";;

?>


-- 
Edit bug report at http://bugs.php.net/?id=35241&edit=1
-- 
Try a CVS snapshot (php4):   http://bugs.php.net/fix.php?id=35241&r=trysnapshot4
Try a CVS snapshot (php5.0): 
http://bugs.php.net/fix.php?id=35241&r=trysnapshot50
Try a CVS snapshot (php5.1): 
http://bugs.php.net/fix.php?id=35241&r=trysnapshot51
Fixed in CVS:                http://bugs.php.net/fix.php?id=35241&r=fixedcvs
Fixed in release:            http://bugs.php.net/fix.php?id=35241&r=alreadyfixed
Need backtrace:              http://bugs.php.net/fix.php?id=35241&r=needtrace
Need Reproduce Script:       http://bugs.php.net/fix.php?id=35241&r=needscript
Try newer version:           http://bugs.php.net/fix.php?id=35241&r=oldversion
Not developer issue:         http://bugs.php.net/fix.php?id=35241&r=support
Expected behavior:           http://bugs.php.net/fix.php?id=35241&r=notwrong
Not enough info:             
http://bugs.php.net/fix.php?id=35241&r=notenoughinfo
Submitted twice:             
http://bugs.php.net/fix.php?id=35241&r=submittedtwice
register_globals:            http://bugs.php.net/fix.php?id=35241&r=globals
PHP 3 support discontinued:  http://bugs.php.net/fix.php?id=35241&r=php3
Daylight Savings:            http://bugs.php.net/fix.php?id=35241&r=dst
IIS Stability:               http://bugs.php.net/fix.php?id=35241&r=isapi
Install GNU Sed:             http://bugs.php.net/fix.php?id=35241&r=gnused
Floating point limitations:  http://bugs.php.net/fix.php?id=35241&r=float
No Zend Extensions:          http://bugs.php.net/fix.php?id=35241&r=nozend
MySQL Configuration Error:   http://bugs.php.net/fix.php?id=35241&r=mysqlcfg

Reply via email to