Edit report at https://bugs.php.net/bug.php?id=65323&edit=1
ID: 65323 Updated by: yohg...@php.net Reported by: masakielastic at gmail dot com Summary: improvement for counting ill-formed byte sequences Status: Open Type: Feature/Change Request Package: Strings related PHP Version: 5.5.1 Block user comment: N Private report: N New Comment: Thank you for the report. This seems good. We are also discussing about mb_scrub() as mb_convert_encoding() alias. i.e. calling converter internally like mb_convert_encoding(). On master branch mbfl converter fix has been committed. We appreciate if you could check the current implementation. Previous Comments: ------------------------------------------------------------------------ [2013-07-24 11:20:38] masakielastic at gmail dot com Table 3-8. Use of U+FFFD in UTF-8 Conversion" of The Unicode Standard http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf ------------------------------------------------------------------------ [2013-07-24 10:59:34] masakielastic at gmail dot com Description: ------------ Consider the number of substitute characters (U+FFFD) when the range of UTF-8 string of second byte is narrow (such as 0xA0 - 0xBF) // Code Points First Byte Second Byte Third Byte Fourth Byte // U+0800 - U+0FFF E0 A0 - BF 80 - BF // U+D000 - U+D7FF ED 80 - 9F 80 - BF // U+10000 - U+3FFFF F0 90 - BF 80 - BF 80 - BF // U+100000 - U+10FFFF F4 80 - 8F 80 - BF 80 - BF If you follow the recommended policy describled in "Table 3-8. Use of U+FFFD in UTF-8 Conversion" of The Unicode Standard, "\xE0\x80" should be converted to "\xEF\xBF\xBD"."\xEF\xBF\xBD". The actual result is "\xEF\xBF\xBD". The one of solution for that purpose is introducing a macro that checks second byte by first byte. https://github.com/masakielastic/patches/blob/master/php_htmlspecialchars/html.p atch https://github.com/masakielastic/patches/blob/master/php_htmlspecialchars/test.p hp Test script: --------------- // https://bugs.php.net/bug.php?id=65081 function str_scrub($str) { return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8')); } $ufffd_x2 = "\xEF\xBF\xBD"."\xEF\xBF\xBD"; $ufffd_x3 = $ufffd_x2."\xEF\xBF\xBD"; var_dump( $ufffd_x2 === str_scrub("\xE0\x80"), $ufffd_x3 === str_scrub("\xE0\x80\x80") ); Expected result: ---------------- bool(true) bool(true) Actual result: -------------- bool(false) bool(false) ------------------------------------------------------------------------ -- Edit this bug report at https://bugs.php.net/bug.php?id=65323&edit=1