ID: 41067 Updated by: [EMAIL PROTECTED] Reported By: jp at df5ea dot net -Status: Open +Status: Closed Bug Type: *Unicode Issues PHP Version: 5CVS-2007-04-12 (CVS) New Comment:
This bug has been fixed in CVS. Snapshots of the sources are packaged every three hours; this change will be in the next snapshot. You can grab the snapshot at http://snaps.php.net/. Thank you for the report, and for helping us make PHP better. Previous Comments: ------------------------------------------------------------------------ [2007-04-16 21:15:13] jp at df5ea dot net http://anna.df5ea.net/~jp/JSON_parser.c.2.patch This patch only adds code to utf16_to_utf8(). When it encounters a high surrogate it looks in the string buffer for a low surrogate. If it has found a pair then it replaces the pair with the proper UTF-8 sequence. utf16_to_utf8() will still emit incorrect UTF-8 when you encode surrogate characters outside of pairs. But UTF-16 containg such non-paired surrogate code units is incorrect too. ------------------------------------------------------------------------ [2007-04-15 14:39:30] [EMAIL PROTECTED] Can you please provide an optimized version of the patch? ------------------------------------------------------------------------ [2007-04-12 20:07:52] jp at df5ea dot net http://anna.df5ea.net/~jp/JSON_parser.c.patch An extra parameter is added to utf16_to_utf8(): prev_utf16. This parameter is used to store the previously decoded UTF-16 code unit. When the function encounters an high surrogate this value is used to look for a low surrogate. From this pair it builds the correct UTF-8 sequence. When it encounters an surrogate code point not in a pair it is ignored. The prev_utf16 variable in JSON_parser() is reset between different strings. If there is a speed concern regarding the parser it is also possible to drop the prev_utf16 part. The decoder function could then look in the decoding buffer to look for the low surrogate. If needed I can submit a patch to get the function operating in this way. ------------------------------------------------------------------------ [2007-04-12 19:41:17] [EMAIL PROTECTED] Can you post a link to the patch? ------------------------------------------------------------------------ [2007-04-12 18:12:28] jp at df5ea dot net Description: ------------ When decoding a string with surrogate pairs in it, JSON_decode() produces incorrect UTF-8. Instead of encoding the two surrogate characters as one UTF-8 sequence it encodes it as two sequences wich represent the two surrogate code points. The decoded string is actually CESU-8. The JSON_encode() function can not encode such a string. I have a patch to JSON_parse.c that transcodes the UTF-16 properly to UTF-8. Reproduce code: --------------- <?php $single_barline = "\360\235\204\200"; $array = array($single_barline); print bin2hex($single_barline) . "\n"; // print $single_barline . "\n\n"; $json = json_encode($array); print $json . "\n\n"; $json_decoded = json_decode($json, true); // print $json_decoded[0] . "\n"; print bin2hex($json_decoded[0]) . "\n"; print "END\n"; ?> Expected result: ---------------- The output form the two bin2hex functions should be the same: f09d8480 ["\ud834\udd00"] f09d8480 END Actual result: -------------- The second string is different from the input string and illegal UTF-8. f09d8480 ["\ud834\udd00"] eda0b4edb480 END ------------------------------------------------------------------------ -- Edit this bug report at http://bugs.php.net/?id=41067&edit=1