ID: 35711 Updated by: [EMAIL PROTECTED] Reported By: matteo at beccati dot com -Status: Feedback +Status: Closed Bug Type: mbstring related Operating System: Debian GNU/Linux PHP Version: 5.1.1 Assigned To: hirokawa New Comment:
This bug has been fixed in CVS. Snapshots of the sources are packaged every three hours; this change will be in the next snapshot. You can grab the snapshot at http://snaps.php.net/. Thank you for the report, and for helping us make PHP better. The character-end detection was introduced in the strict mode (mb_detect_encoding ($s,$list,TRUE)). Please try the strict mode. Previous Comments: ------------------------------------------------------------------------ [2005-12-24 01:03:21] [EMAIL PROTECTED] Have you ever tried the strict mode (default:FALSE) ? string mb_detect_encoding ( string str [, mixed encoding_list [, bool strict]] ) ------------------------------------------------------------------------ [2005-12-20 17:10:56] matteo at beccati dot com Of course, I agree that 0xe8 is a valid if taken as part of a multibyte character, but I don't think it could be considered valid it the next bytes are missing (because the string ends prematurely). The iconv extension raises notices when it finds illegal or incomplete multibyte characters, I don't see why mbstring should accept as a valid UTF-8 a string which indeed isn't. The same should apply to other multibyte encodings. ------------------------------------------------------------------------ [2005-12-20 15:44:31] [EMAIL PROTECTED] Please note that encoding detection is not always perfect. Especially, when the string is too short, the wrong detection might be caused. In your case, it is not a bug, but it is the specification. UTF-8 is a variable length multibyte encoding format, the length of a character in UTF-8 is from one to six. Please look at ext/mbstring/libmbfl/filter/mbfilter_utf8.c:about 249L. 0xe8 is a valid byte sequence as the 1st character of 3 byte code. We cannot detect 0xe8 is ISO-8859-1 or UTF-8, because this byte is valid in both encodings. In this case, the response will be choose from the order defined by mb_detect_order(). I suggest to use the sufficient length of string for the reliable encoding detection. ------------------------------------------------------------------------ [2005-12-19 09:03:36] [EMAIL PROTECTED] Rui, can you check this out please? ------------------------------------------------------------------------ [2005-12-19 09:00:50] matteo at beccati dot com Oops, I just realized that I forgot the -u flag :) Here is the downlaodable patch: http://beccati.com/download/mbstring-patch-20051219.txt ------------------------------------------------------------------------ The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at http://bugs.php.net/35711 -- Edit this bug report at http://bugs.php.net/?id=35711&edit=1
