Edit report at https://bugs.php.net/bug.php?id=62010&edit=1
ID: 62010 Comment by: masakielastic at gmail dot com Reported by: tklingenberg at lastflood dot net Summary: json_decode produces invalid byte-sequences Status: Open Type: Bug Package: JSON related Operating System: Windows PHP Version: 5.3.13 Block user comment: N Private report: N New Comment: Here is RFC 3629's description about UTF-8 definition. The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters. http://tools.ietf.org/html/rfc3629 The following patch solve the part of problem, The isolated low surrogate pairs(U+DC00 U+DFFF) are replaced with U+FFFD, The imrovement for high surrogate pairs (U+D800 - U+DBFF) is needed. https://gist.github.com/masakielastic/5985383 var_dump( "\xef\xbf\xbd" === json_decode('"\udc00"'), "\xef\xbf\xbd"."\xed\xa0\x80" === json_decode('"\ud800\ud800"'), "\xed\xa0\x80" === json_decode('"\ud800"') ); The consistency for the following options (under the discussion) is needed too. json_encode's option for replacing ill-formd byte sequences with substitute characters https://bugs.php.net/bug.php?id=65082 Previous Comments: ------------------------------------------------------------------------ [2013-01-11 09:44:55] votefordevnull at gmail dot com Successfully reproduced on Linux ------------------------------------------------------------------------ [2012-05-11 22:46:34] tklingenberg at lastflood dot net Looks like that #41067 https://bugs.php.net/bug.php?id=41067 was not fully fixed. ------------------------------------------------------------------------ [2012-05-11 22:12:42] tklingenberg at lastflood dot net Description: ------------ It's a typical case the JSON *and* UTF-16 specifications warn about: decoding of non-existing UTF-16 code-points: json_decode('"\ud834"') shoud give NULL because \ud834 is *invalid*. But instead it starts some party, get's boozed and offers this as UTF-8 byte-sequence: 1110 1101 1010 0000 1011 0100 1110 xxxx 10xx xxxx 10xx xxxx 1101 1000 0011 0100 D8 34 U+D834 is not a valid unicode character. Test script: --------------- if (NULL !== json_decode('"\ud834"')) { echo "json_decode is still broken."; } Expected result: ---------------- NULL because the json is invalid. Actual result: -------------- PHP tries to create UTF-8 out of it and fails by creating invalid UTF-8 unicode byte-sequences. ------------------------------------------------------------------------ -- Edit this bug report at https://bugs.php.net/bug.php?id=62010&edit=1