ID: 37611
Comment by: wiktor at eworld dot hu
Reported By: jdolecek at NetBSD dot org
Status: Open
Bug Type: WDDX related
Operating System: Any
PHP Version: 5.1.5CVS
New Comment:
After the PHP upgrade from 5.0 to 5.1, the hungarian letters with
accents in our javascript framework (using js library from
www.wddx.org) our Hungarian letters with accents disappeared. We have
to solve it quickly, so here is our patch to fix this incompatibility.
$foo = wddx_serialize_value($bar);
$foo = preg_replace("/(<char code='(..)'\/>)/e", "('\\2'<'80' ? '\\1' :
chr(hexdec('\\2')))", $foo);
Previous Comments:
------------------------------------------------------------------------
[2006-06-05 20:03:20] jdolecek at NetBSD dot org
127 serializes/deserialized just fine on my system even without your
change, test script:
$str = wddx_deserialize(wddx_serialize_value(chr(127)));
echo ord($str[0])."\n";
wddx_deserialize() expects UTF-8 input and gives iso-8859-1 output.
There are ways around this, but this is the default way.
wddx_serialize_value() doesn't particularily care, it takes both UTF-8
and iso-8869-1.
So the right way to use the API is to UTF-8-encode text before
serializing, so that we'd get proper output after deserializing.
I'd also point out that both 1) and 2) points still hold, and both are
very painfull for non-english speakers. _Please_ back the change off.
------------------------------------------------------------------------
[2006-05-31 22:22:04] [EMAIL PROTECTED]
Without the 127 bit on chr(128) for example becomes translated
to 0 causing irreversible data loss.
As far as chr(200) you don't need to utf8 encode it.
------------------------------------------------------------------------
[2006-05-30 15:59:24] jdolecek at NetBSD dot org
Yes it is a bug.
1) it breaks current code using UTF-8 and expecting to get iso-8859-1
result from wddx_deserialize(), i.e.
$str = chr(200);
$str_u8 = utf8_encode($str);
$result = wddx_deserialize(wddx_Serialize_value($str_u8));
When run with PHP 5.1.4 or when the data has been serialized with
the older version, $result == $str.
New version has $result == $str_u8.
So, _all_ old serialized UTF-8 data (i.e. stored
in database) serializes to different encoding
then newly serialized data. This is major
backward incompatibility, and is problem for any
current applications using serializing of
UTF-8 input.
(Arguably serializing UTF-8 strings wasn't really
very usable before due to Bug #37571, but you get
the idea)
2) it explodes the size of packet, and it's not clear
what was the reason for the change. This is serious
problem when storing the result serialized data,
and totally unnecessary. XML is designed 8-bit
clean, so encoding high-bit characters this
way doesn't make sense.
Please explain why encoding characters >= 127 is right. Please revert
this part of the patch.
If you want to fix wddx so that the encoding on input is same as
encoding on output it's fine, but it must be done in
backward-compatible way, such as adding some extra parameters to either
wddx_serialize_value() or wddx_deserialize().
------------------------------------------------------------------------
[2006-05-28 15:13:29] [EMAIL PROTECTED]
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php
This is definitely not left over debug code, it is needed on
some system to ensure proper encoding of non-ascii characters.
------------------------------------------------------------------------
[2006-05-27 09:58:51] jdolecek at NetBSD dot org
Seems the bug submit system turns non-ascii character to some entities,
the Č should be character with ordinal value 200 (i.e. result of
chr(200)).
------------------------------------------------------------------------
The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
http://bugs.php.net/37611
--
Edit this bug report at http://bugs.php.net/?id=37611&edit=1