ID:          41067
 Updated by:  [EMAIL PROTECTED]
 Reported By: jp at df5ea dot net
-Status:      Open
+Status:      Closed
 Bug Type:    *Unicode Issues
 PHP Version: 5CVS-2007-04-12 (CVS)
 New Comment:

This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.




Previous Comments:
------------------------------------------------------------------------

[2007-04-16 21:15:13] jp at df5ea dot net

http://anna.df5ea.net/~jp/JSON_parser.c.2.patch

This patch only adds code to utf16_to_utf8(). When it encounters a high
surrogate it looks in the string buffer for a low surrogate. If it has
found a pair then it replaces the pair with the proper UTF-8 sequence.

utf16_to_utf8() will still emit incorrect UTF-8 when you encode
surrogate characters outside of pairs. But UTF-16 containg such
non-paired surrogate code units is incorrect too.

------------------------------------------------------------------------

[2007-04-15 14:39:30] [EMAIL PROTECTED]

Can you please provide an optimized version of the patch?

------------------------------------------------------------------------

[2007-04-12 20:07:52] jp at df5ea dot net

http://anna.df5ea.net/~jp/JSON_parser.c.patch

An extra parameter is added to utf16_to_utf8(): prev_utf16. This
parameter is used to store the previously decoded UTF-16 code unit. When
the function encounters an high surrogate this value is used to look for
a low surrogate. From this pair it builds the correct UTF-8 sequence.

When it encounters an surrogate code point not in a pair it is ignored.
The prev_utf16 variable in JSON_parser() is reset between different
strings.

If there is a speed concern regarding the parser it is also possible to
drop the prev_utf16 part. The decoder function could then look in the
decoding buffer to look for the low surrogate. If needed I can submit a
patch to get the function operating in this way.

------------------------------------------------------------------------

[2007-04-12 19:41:17] [EMAIL PROTECTED]

Can you post a link to the patch?

------------------------------------------------------------------------

[2007-04-12 18:12:28] jp at df5ea dot net

Description:
------------
When decoding a string with surrogate pairs in it, JSON_decode()
produces incorrect UTF-8. Instead of encoding the two surrogate
characters as one UTF-8 sequence it encodes it as two sequences wich
represent the two surrogate code points.

The decoded string is actually CESU-8. The JSON_encode() function can
not encode such a string.

I have a patch to JSON_parse.c that transcodes the UTF-16 properly to
UTF-8.

Reproduce code:
---------------
<?php
$single_barline = "\360\235\204\200";
$array = array($single_barline);
print bin2hex($single_barline) . "\n";
// print $single_barline . "\n\n";
$json = json_encode($array);
print $json . "\n\n";
$json_decoded = json_decode($json, true);
// print $json_decoded[0] . "\n";
print bin2hex($json_decoded[0]) . "\n";
print "END\n";
?>


Expected result:
----------------
The output form the two bin2hex functions should be the same:

f09d8480

["\ud834\udd00"]

f09d8480
END


Actual result:
--------------
The second string is different from the input string and illegal
UTF-8.

f09d8480

["\ud834\udd00"]

eda0b4edb480
END



------------------------------------------------------------------------


-- 
Edit this bug report at http://bugs.php.net/?id=41067&edit=1

Reply via email to