ID: 22041
User updated by: [EMAIL PROTECTED]
Reported By: [EMAIL PROTECTED]
Status: Open
Bug Type: mbstring related
Operating System: Red Hat Linux 7.2
PHP Version: 4.3.0
New Comment:
Wow, thanks for the long answer! I didn't realize that EUC-JP was not a
single character set ...
I tried what you suggested and that fixed the problem ...
But now makes me wonder what character set my data is in?? And I set my
Postgresql database to be EUC-JP, but since you say that could mean
more than one thing, I wonder which one PostgreSQL uses??
Since I am so confused as to what format my data is in, I ended up
using the database's substr() function instead of PHP's ... I figure
that is safer ...
So I guess the is no problem with mb_substr() then ... just that even
though the DB says the data is in EUC-JP it really is in eucJP-win?
Thanks!
PS You can close the report if you agree that there is no error in
mb_substr()
PPS I love PHP's mb functions, thanks for your work. I just wish the
world would agree on ONE japanese encoding =) It would save me a lot of
headaches ...
Previous Comments:
------------------------------------------------------------------------
[2003-02-05 07:47:27] [EMAIL PROTECTED]
Since mb_substr() internally converts input strings to the Unicode
character set representation, if it find such an "illegal" character
that is not supposed to be a member of the input character set, it
simply ends up returning wrong results. eucjp-win is prepared for
convenience so that users can handle strings whose components are
represented in CP932 character set and encoded in EUC-JP.
Practically, there are some EUC-JP variants because EUC-JP itself
originally represents just an encoding rather than a whole character
set.
I think this practice is quite confusing too, but please keep it in
your mind that an encoding doesn't always have a single corresponding
character set even though their names are the same. In this context, it
could be said EUC-JP is rather a name of an encoding and often mistaken
as a character set name, where the actual names of character sets which
EUC-JP _can_ represent are ISO646, JISX0201-1976, JISX0208-1990,
JISX0212-1990, JISX0213-2000, and so on.
Anyway, did you try it out?
------------------------------------------------------------------------
[2003-02-04 21:20:49] [EMAIL PROTECTED]
Glad you could see the funny side of this bug report :) I did try very
hard to find a better example ... but couldn't get mb_substr to break
on anything else.
Why set internal encoding to eucJP-win? The data is from a database and
is in EUC-JP ...
When I entered the data into the DB if the were any illegal EUC-JP
characters it should have complained ...
And as you can see I can display the whole string as EUC-JP perfectly.
It's only *after* I use mb_substr() that the string becomes mojibake
...
Thanks!
------------------------------------------------------------------------
[2003-02-04 07:36:19] [EMAIL PROTECTED]
LOL! It's indeed so OFFENSIVE I have no idea how to translate those
words to English. But perhaps you know what that means?
Ehm, first try setting the internal encoding to "eucJP-win".
------------------------------------------------------------------------
[2003-02-04 05:28:48] [EMAIL PROTECTED]
First, sorry for any offensive japanese words. I can't read/write
japanese very well, and the error in mb_substr occurs on data from a
list of video titles ... I tried to find another less offensive example
but couldn't. I'm just posting this bug report in order to help ...
I am trying to use mb_substr on data I get from a postgreSQL DB and in
some cases mb_substr seems to cut the string in the middle of a
multibyte char .. which turns the "cut" char into mojibake ...
The DB is in EUC-JP and my internal encoding is set to EUC-JP in my
php.ini file ...
As you can see the last character of the string has been improperly cut
...
Here is my test program and output:
CODE:
<?php
require_once("db_functions/sql_query.inc");
$sql = "select maker_comment from products where id=12802";
$res = sql_query($sql);
$dat = pg_fetch_object($res);
$c = $dat->substr;
echo "String: <BR>";
echo $c ."<BR>";
$c = mb_substr($c, 0, 80);
echo "<BR> After cutting it ... <BR>";
echo $c ."<BR>";
?>
OUPUT:
COMMENT2:
����Ρ�Ķ-�Դ֤Υ���ץ�����ģء�³���о졪�����κ��Ԥ�ΤǤϤ���ޤ���;ʬ�ʲ褬�ʤ�
AFTER cutting it ...
����Ρ�Ķ-�Դ֤Υ���ץ�����ģء�³���о졪�����κ��Ԥ�ΤǤϤ�����
------------------------------------------------------------------------
--
Edit this bug report at http://bugs.php.net/?id=22041&edit=1