ID: 41147 Comment by: mike at silverorange dot com Reported By: teracci2002 at yahoo dot co dot jp Status: Assigned Bug Type: mbstring related Operating System: Linux PHP Version: 5.2.1 Assigned To: hirokawa New Comment:
0x00, 0xe3 is a valid byte sequence in UTF-8 but by itself is not a valid UTF-8 string (it's missing two bytes). The function is documented as checking the validity of a string so it should return false for this case. If the function is only supposed to validate byte-streams then the documentation should be fixed. Previous Comments: ------------------------------------------------------------------------ [2007-09-16 08:56:57] [EMAIL PROTECTED] Sorry for delaying response. 0x00,0x81 is also valid byte sequence in Shift_JIS because 0x81 is a valid first byte of a double-byte JIS X 0208 character. See: http://en.wikipedia.org/wiki/Shift_jis We cannot decide the byte stream is valid or invalid because the last byte of byte stream (0x81) is a valid first byte of double-byte character. In this case, true (valid) will be returned. The byte stream including a valid first byte + a invalid second byte returns false. For example, var_dump(mb_check_encoding("\x81\x00", "Shift_JIS")); returns false (invalid). It is because 0x81 is valid first byte of a double-byte JIS X0208 character, but, 0x00 is invalid second byte of a double-byte JIS X0208 character. And, 0x00, 0xe3 in UTF-8, it is also valid byte sequence (a null byte + first byte of a three-byte UTF-8 character). See: http://en.wikipedia.org/wiki/UTF-8 ------------------------------------------------------------------------ [2007-09-04 22:38:26] [EMAIL PROTECTED] Did you read it Rui? (why do your reports end up as 'Analyzed' all the time? :) ------------------------------------------------------------------------ [2007-09-04 14:55:58] teracci2002 at yahoo dot co dot jp > 0x00+0xa1 is valid byte sequence in Shift_JIS sequence. I know it. But 0x00+0x81 is invalid sequence in Shift_JIS. Then, why below statement returns "bool(true)" ? var_dump(mb_check_encoding("\x00\x81", "Shift_JIS")); Read bug report again, please. ------------------------------------------------------------------------ [2007-09-04 14:30:06] [EMAIL PROTECTED] > No one says 0x00,0xa1 is invalid character in ShiftJIS. I didn't say that. 0x00+0xa1 is valid byte sequence in Shift_JIS sequence. A character in Shift_JIS encoding is encoded in either single byte or double byte. In this case, the byte stream is reconigzed as two character, a null byte and a comma character in Katakana(0xa1) see: http://hp.vector.co.jp/authors/VA013241/misc/shiftjis.html ------------------------------------------------------------------------ [2007-08-19 20:10:06] [EMAIL PROTECTED] Someone disagrees, Rui.. :) ------------------------------------------------------------------------ The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at http://bugs.php.net/41147 -- Edit this bug report at http://bugs.php.net/?id=41147&edit=1