#22041 [Opn]: mb_substr produces "mojibake" on certain strings ...

jc Wed, 05 Feb 2003 05:55:23 -0800

 ID:               22041
 User updated by:  [EMAIL PROTECTED]
 Reported By:      [EMAIL PROTECTED]
 Status:           Open
 Bug Type:         mbstring related
 Operating System: Red Hat Linux 7.2
 PHP Version:      4.3.0
 New Comment:


Wow, thanks for the long answer! I didn't realize that EUC-JP was not a
single character set ...

I tried what you suggested and that fixed the problem ...

But now makes me wonder what character set my data is in?? And I set my
Postgresql database to be EUC-JP, but since you say that could mean
more than one thing, I wonder which one PostgreSQL uses??

Since I am so confused as to what format my data is in, I ended up
using the database's substr() function instead of PHP's ... I figure
that is safer ...

So I guess the is no problem with mb_substr() then ... just that even
though the DB says the data is in EUC-JP it really is in eucJP-win?

Thanks!

PS You can close the report if you agree that there is no error in
mb_substr()

PPS I love PHP's mb functions, thanks for your work. I just wish the
world would agree on ONE japanese encoding =) It would save me a lot of
headaches ...


Previous Comments:
------------------------------------------------------------------------

[2003-02-05 07:47:27] [EMAIL PROTECTED]

Since mb_substr() internally converts input strings to the Unicode
character set representation, if it find such an "illegal" character
that is not supposed to be a member of the input character set, it
simply ends up returning wrong results. eucjp-win is prepared for
convenience so that users can handle strings whose components are
represented in CP932 character set and encoded in EUC-JP.

Practically, there are some EUC-JP variants because EUC-JP itself
originally represents just an encoding rather than a whole character
set. 

I think this practice is quite confusing too, but please keep it in
your mind that an encoding doesn't always have a single corresponding
character set even though their names are the same. In this context, it
could be said EUC-JP is rather a name of an encoding and often mistaken
as a character set name, where the actual names of character sets which
EUC-JP _can_ represent are ISO646, JISX0201-1976, JISX0208-1990,
JISX0212-1990, JISX0213-2000, and so on.

Anyway, did you try it out?


------------------------------------------------------------------------

[2003-02-04 21:20:49] [EMAIL PROTECTED]

Glad you could see the funny side of this bug report :) I did try very
hard to find a better example ... but couldn't get mb_substr to break
on anything else.

Why set internal encoding to eucJP-win? The data is from a database and
is in EUC-JP ...

When I entered the data into the DB if the were any illegal EUC-JP
characters it should have complained ...

And as you can see I can display the whole string as EUC-JP perfectly.
It's only *after* I use mb_substr() that the string becomes mojibake
...

Thanks!

------------------------------------------------------------------------

[2003-02-04 07:36:19] [EMAIL PROTECTED]

LOL! It's indeed so OFFENSIVE I have no idea how to translate those
words to English. But perhaps you know what that means?

Ehm, first try setting the internal encoding to "eucJP-win".

------------------------------------------------------------------------

[2003-02-04 05:28:48] [EMAIL PROTECTED]

First, sorry for any offensive japanese words. I can't read/write
japanese very well, and the error in mb_substr occurs on data from a
list of video titles ... I tried to find another less offensive example
but couldn't. I'm just posting this bug report in order to help ...

I am trying to use mb_substr on data I get from a postgreSQL DB and in
some cases mb_substr seems to cut the string in the middle of a
multibyte char .. which turns the "cut" char into mojibake ...

The DB is in EUC-JP and my internal encoding is set to EUC-JP in my
php.ini file ...

As you can see the last character of the string has been improperly cut
...

Here is my test program and output:

CODE:

<?php
require_once("db_functions/sql_query.inc");

$sql = "select maker_comment from products where id=12802";
$res = sql_query($sql);
$dat = pg_fetch_object($res);
$c = $dat->substr;

echo "String: <BR>";
echo $c ."<BR>";

$c = mb_substr($c, 0, 80);

echo "<BR> After cutting it ... <BR>";
echo $c ."<BR>";
?>

OUPUT:

COMMENT2:
���󥰥�Ρ�Ķ-�Դ֤Υ��󥰥�ץ��꡼���ģء�³���о졪�����κ��Ԥ�ΤǤϤ���ޤ���;ʬ�ʲ褬�ʤ�


AFTER cutting it ...
���󥰥�Ρ�Ķ-�Դ֤Υ��󥰥�ץ��꡼���ģء�³���о졪�����κ��Ԥ�ΤǤϤ����&#65533;

------------------------------------------------------------------------


-- 
Edit this bug report at http://bugs.php.net/?id=22041&edit=1

#22041 [Opn]: mb_substr produces "mojibake" on certain strings ...

Reply via email to