ID: 22041 User updated by: [EMAIL PROTECTED] Reported By: [EMAIL PROTECTED] Status: Open Bug Type: mbstring related Operating System: Red Hat Linux 7.2 PHP Version: 4.3.0 New Comment:
Wow, thanks for the long answer! I didn't realize that EUC-JP was not a single character set ... I tried what you suggested and that fixed the problem ... But now makes me wonder what character set my data is in?? And I set my Postgresql database to be EUC-JP, but since you say that could mean more than one thing, I wonder which one PostgreSQL uses?? Since I am so confused as to what format my data is in, I ended up using the database's substr() function instead of PHP's ... I figure that is safer ... So I guess the is no problem with mb_substr() then ... just that even though the DB says the data is in EUC-JP it really is in eucJP-win? Thanks! PS You can close the report if you agree that there is no error in mb_substr() PPS I love PHP's mb functions, thanks for your work. I just wish the world would agree on ONE japanese encoding =) It would save me a lot of headaches ... Previous Comments: ------------------------------------------------------------------------ [2003-02-05 07:47:27] [EMAIL PROTECTED] Since mb_substr() internally converts input strings to the Unicode character set representation, if it find such an "illegal" character that is not supposed to be a member of the input character set, it simply ends up returning wrong results. eucjp-win is prepared for convenience so that users can handle strings whose components are represented in CP932 character set and encoded in EUC-JP. Practically, there are some EUC-JP variants because EUC-JP itself originally represents just an encoding rather than a whole character set. I think this practice is quite confusing too, but please keep it in your mind that an encoding doesn't always have a single corresponding character set even though their names are the same. In this context, it could be said EUC-JP is rather a name of an encoding and often mistaken as a character set name, where the actual names of character sets which EUC-JP _can_ represent are ISO646, JISX0201-1976, JISX0208-1990, JISX0212-1990, JISX0213-2000, and so on. Anyway, did you try it out? ------------------------------------------------------------------------ [2003-02-04 21:20:49] [EMAIL PROTECTED] Glad you could see the funny side of this bug report :) I did try very hard to find a better example ... but couldn't get mb_substr to break on anything else. Why set internal encoding to eucJP-win? The data is from a database and is in EUC-JP ... When I entered the data into the DB if the were any illegal EUC-JP characters it should have complained ... And as you can see I can display the whole string as EUC-JP perfectly. It's only *after* I use mb_substr() that the string becomes mojibake ... Thanks! ------------------------------------------------------------------------ [2003-02-04 07:36:19] [EMAIL PROTECTED] LOL! It's indeed so OFFENSIVE I have no idea how to translate those words to English. But perhaps you know what that means? Ehm, first try setting the internal encoding to "eucJP-win". ------------------------------------------------------------------------ [2003-02-04 05:28:48] [EMAIL PROTECTED] First, sorry for any offensive japanese words. I can't read/write japanese very well, and the error in mb_substr occurs on data from a list of video titles ... I tried to find another less offensive example but couldn't. I'm just posting this bug report in order to help ... I am trying to use mb_substr on data I get from a postgreSQL DB and in some cases mb_substr seems to cut the string in the middle of a multibyte char .. which turns the "cut" char into mojibake ... The DB is in EUC-JP and my internal encoding is set to EUC-JP in my php.ini file ... As you can see the last character of the string has been improperly cut ... Here is my test program and output: CODE: <?php require_once("db_functions/sql_query.inc"); $sql = "select maker_comment from products where id=12802"; $res = sql_query($sql); $dat = pg_fetch_object($res); $c = $dat->substr; echo "String: <BR>"; echo $c ."<BR>"; $c = mb_substr($c, 0, 80); echo "<BR> After cutting it ... <BR>"; echo $c ."<BR>"; ?> OUPUT: COMMENT2: アングルの「超-股間のアングル」シリーズDX、続々登場!ただの再編ものではありません!余分な画がない AFTER cutting it ... アングルの「超-股間のアングル」シリーズDX、続々登場!ただの再編ものではありま� ------------------------------------------------------------------------ -- Edit this bug report at http://bugs.php.net/?id=22041&edit=1