ID:               35711
 Updated by:       [EMAIL PROTECTED]
 Reported By:      matteo at beccati dot com
-Status:           Feedback
+Status:           Closed
 Bug Type:         mbstring related
 Operating System: Debian GNU/Linux
 PHP Version:      5.1.1
 Assigned To:      hirokawa
 New Comment:

This bug has been fixed in CVS.

Snapshots of the sources are packaged every three hours; this change
will be in the next snapshot. You can grab the snapshot at
http://snaps.php.net/.
 
Thank you for the report, and for helping us make PHP better.

The character-end detection was introduced in the strict mode
(mb_detect_encoding ($s,$list,TRUE)).
Please try the strict mode.






Previous Comments:
------------------------------------------------------------------------

[2005-12-24 01:03:21] [EMAIL PROTECTED]

Have you ever tried the strict mode (default:FALSE) ?

string mb_detect_encoding ( string str [, mixed encoding_list [, bool
strict]] )


------------------------------------------------------------------------

[2005-12-20 17:10:56] matteo at beccati dot com

Of course, I agree that 0xe8 is a valid if taken as part of a multibyte
character, but I don't think it could be considered valid it the next
bytes are missing (because the string ends prematurely). The iconv
extension raises notices when it finds illegal or incomplete multibyte
characters, I don't see why mbstring should accept as a valid UTF-8 a
string which indeed isn't.

The same should apply to other multibyte encodings.

------------------------------------------------------------------------

[2005-12-20 15:44:31] [EMAIL PROTECTED]

Please note that encoding detection is not always perfect.
Especially, when the string is too short, the wrong detection might be
caused.
In your case, it is not a bug, but it is the specification.
UTF-8 is a variable length multibyte encoding format,
the length of a character in UTF-8 is from one to six.
Please look at ext/mbstring/libmbfl/filter/mbfilter_utf8.c:about 249L.
0xe8 is a valid byte sequence as the 1st character of 3 byte code.
We cannot detect 0xe8 is ISO-8859-1 or UTF-8,
because this byte is valid in both encodings.
In this case, the response will be choose 
from the order defined by mb_detect_order().
I suggest to use the sufficient length of string
for the reliable encoding detection.











------------------------------------------------------------------------

[2005-12-19 09:03:36] [EMAIL PROTECTED]

Rui, can you check this out please?

------------------------------------------------------------------------

[2005-12-19 09:00:50] matteo at beccati dot com

Oops, I just realized that I forgot the -u flag :)

Here is the downlaodable patch:

http://beccati.com/download/mbstring-patch-20051219.txt

------------------------------------------------------------------------

The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
    http://bugs.php.net/35711

-- 
Edit this bug report at http://bugs.php.net/?id=35711&edit=1

Reply via email to