ID:              41067
 User updated by: jp at df5ea dot net
-Summary:         Patch that only adds one extra if in utf16_to_utf8()
 Reported By:     jp at df5ea dot net
-Status:          Closed
+Status:          Open
 Bug Type:        *Unicode Issues
-PHP Version:     5CVS-2007-04-12 (CVS)
+PHP Version:     6CVS-2007-04-17 (snap)
 New Comment:

I failed to check it in advance, but from looking at the sources of the
latest CVS snapshot this bug does also apply to PHP 6.x.

Maybe it should be fixed there too.


Previous Comments:
------------------------------------------------------------------------

[2007-04-16 22:31:13] [EMAIL PROTECTED]

This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.



------------------------------------------------------------------------

[2007-04-16 21:15:13] jp at df5ea dot net

http://anna.df5ea.net/~jp/JSON_parser.c.2.patch

This patch only adds code to utf16_to_utf8(). When it encounters a high
surrogate it looks in the string buffer for a low surrogate. If it has
found a pair then it replaces the pair with the proper UTF-8 sequence.

utf16_to_utf8() will still emit incorrect UTF-8 when you encode
surrogate characters outside of pairs. But UTF-16 containg such
non-paired surrogate code units is incorrect too.

------------------------------------------------------------------------

[2007-04-15 14:39:30] [EMAIL PROTECTED]

Can you please provide an optimized version of the patch?

------------------------------------------------------------------------

[2007-04-12 20:07:52] jp at df5ea dot net

http://anna.df5ea.net/~jp/JSON_parser.c.patch

An extra parameter is added to utf16_to_utf8(): prev_utf16. This
parameter is used to store the previously decoded UTF-16 code unit. When
the function encounters an high surrogate this value is used to look for
a low surrogate. From this pair it builds the correct UTF-8 sequence.

When it encounters an surrogate code point not in a pair it is ignored.
The prev_utf16 variable in JSON_parser() is reset between different
strings.

If there is a speed concern regarding the parser it is also possible to
drop the prev_utf16 part. The decoder function could then look in the
decoding buffer to look for the low surrogate. If needed I can submit a
patch to get the function operating in this way.

------------------------------------------------------------------------

[2007-04-12 19:41:17] [EMAIL PROTECTED]

Can you post a link to the patch?

------------------------------------------------------------------------

The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
    http://bugs.php.net/41067

-- 
Edit this bug report at http://bugs.php.net/?id=41067&edit=1

Reply via email to