Edit report at https://bugs.php.net/bug.php?id=62010&edit=1

 ID:                 62010
 Comment by:         masakielastic at gmail dot com
 Reported by:        tklingenberg at lastflood dot net
 Summary:            json_decode produces invalid byte-sequences
 Status:             Open
 Type:               Bug
 Package:            JSON related
 Operating System:   Windows
 PHP Version:        5.3.13
 Block user comment: N
 Private report:     N

 New Comment:

Here is RFC 3629's description about UTF-8 definition.

The definition of UTF-8 prohibits encoding character numbers
between U+D800 and U+DFFF, which are reserved for use with the 
UTF-16 encoding form (as surrogate pairs) and do not directly
represent characters.

http://tools.ietf.org/html/rfc3629

The following patch solve the part of problem,
The isolated low surrogate pairs(U+DC00 U+DFFF) are replaced with U+FFFD,
The imrovement for high surrogate pairs (U+D800 - U+DBFF) is needed.

https://gist.github.com/masakielastic/5985383

var_dump(
  "\xef\xbf\xbd" === json_decode('"\udc00"'),
  "\xef\xbf\xbd"."\xed\xa0\x80" === json_decode('"\ud800\ud800"'),
  "\xed\xa0\x80" === json_decode('"\ud800"')
);

The consistency for the following options
(under the discussion) is needed too.

json_encode's option for replacing ill-formd byte sequences 
with substitute characters
https://bugs.php.net/bug.php?id=65082


Previous Comments:
------------------------------------------------------------------------
[2013-01-11 09:44:55] votefordevnull at gmail dot com

Successfully reproduced on Linux

------------------------------------------------------------------------
[2012-05-11 22:46:34] tklingenberg at lastflood dot net

Looks like that #41067 https://bugs.php.net/bug.php?id=41067 was not fully 
fixed.

------------------------------------------------------------------------
[2012-05-11 22:12:42] tklingenberg at lastflood dot net

Description:
------------
It's a typical case the JSON *and* UTF-16 specifications warn about: decoding 
of 
non-existing UTF-16 code-points:

    json_decode('"\ud834"')

shoud give NULL because \ud834 is *invalid*. But instead it starts some party, 
get's boozed and offers this as UTF-8 byte-sequence:

    1110 1101  1010 0000  1011 0100
    1110 xxxx  10xx xxxx  10xx xxxx
               1101 1000  0011 0100
               D8         34

U+D834 is not a valid unicode character.



Test script:
---------------
if (NULL !== json_decode('"\ud834"')) {
    echo "json_decode is still broken.";
}

Expected result:
----------------
NULL because the json is invalid.

Actual result:
--------------
PHP tries to create UTF-8 out of it and fails by creating invalid UTF-8 unicode 
byte-sequences.


------------------------------------------------------------------------



-- 
Edit this bug report at https://bugs.php.net/bug.php?id=62010&edit=1

Reply via email to