ID:               41147
 Comment by:       mike at silverorange dot com
 Reported By:      teracci2002 at yahoo dot co dot jp
 Status:           Assigned
 Bug Type:         mbstring related
 Operating System: Linux
 PHP Version:      5.2.1
 Assigned To:      hirokawa
 New Comment:

0x00, 0xe3 is a valid byte sequence in UTF-8 but by itself is not a
valid UTF-8 string (it's missing two bytes).

The function is documented as checking the validity of a string so it
should return false for this case. If the function is only supposed to
validate byte-streams then the documentation should be fixed.


Previous Comments:
------------------------------------------------------------------------

[2007-09-16 08:56:57] [EMAIL PROTECTED]


Sorry for delaying response.

0x00,0x81 is also valid byte sequence in Shift_JIS
because 0x81 is a valid first byte of a double-byte 
JIS X 0208 character.

See: http://en.wikipedia.org/wiki/Shift_jis

We cannot decide the byte stream is valid or 
invalid because the last byte of byte stream (0x81)
is a valid first byte of double-byte character.
In this case, true (valid) will be returned.

The byte stream including a valid first byte +
a invalid second byte returns false.

For example,

var_dump(mb_check_encoding("\x81\x00", "Shift_JIS"));

returns false (invalid).

It is because 0x81 is valid first byte of a double-byte
JIS X0208 character, but, 0x00 is invalid second byte of
a double-byte JIS X0208 character.

And, 
0x00, 0xe3 in UTF-8, it is also 
valid byte sequence (a null byte + first byte of 
a three-byte UTF-8 character).

See: http://en.wikipedia.org/wiki/UTF-8










------------------------------------------------------------------------

[2007-09-04 22:38:26] [EMAIL PROTECTED]

Did you read it Rui? (why do your reports end up as 'Analyzed' all the
time? :)

------------------------------------------------------------------------

[2007-09-04 14:55:58] teracci2002 at yahoo dot co dot jp

> 0x00+0xa1 is valid byte sequence in Shift_JIS sequence.

I know it.
But 0x00+0x81 is invalid sequence in Shift_JIS.
Then, why below statement returns "bool(true)" ?

var_dump(mb_check_encoding("\x00\x81", "Shift_JIS"));

Read bug report again, please.

------------------------------------------------------------------------

[2007-09-04 14:30:06] [EMAIL PROTECTED]


> No one says 0x00,0xa1 is invalid character in ShiftJIS.
I didn't say that.

0x00+0xa1 is valid byte sequence in Shift_JIS sequence.
A character in Shift_JIS encoding is encoded in either single byte 
or double byte.
In this case, the byte stream is reconigzed as two character,
a null byte and a comma character in Katakana(0xa1) 
 
see: http://hp.vector.co.jp/authors/VA013241/misc/shiftjis.html



------------------------------------------------------------------------

[2007-08-19 20:10:06] [EMAIL PROTECTED]

Someone disagrees, Rui.. :)

------------------------------------------------------------------------

The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
    http://bugs.php.net/41147

-- 
Edit this bug report at http://bugs.php.net/?id=41147&edit=1

Reply via email to