php-i18n Digest 28 Sep 2002 12:16:39 -0000 Issue 126
Topics (messages 329 through 337):
Multi-byte strings and Japanese characters
329 by: Jonas Koch Bentzen
332 by: Yasuo Ohgaki
333 by: Jonas Koch Bentzen
334 by: Jonas Koch Bentzen
335 by: Yasuo Ohgaki
336 by: Jonas Koch Bentzen
337 by: Yasuo Ohgaki
Re: problem: php + oracle + utf8
330 by: Jonas Koch Bentzen
!! ImageTTFText not displaying Japanese text !!
331 by: Deepak Karunakaran
Administrivia:
To subscribe to the digest, e-mail:
[EMAIL PROTECTED]
To unsubscribe from the digest, e-mail:
[EMAIL PROTECTED]
To post to the list, e-mail:
[EMAIL PROTECTED]
----------------------------------------------------------------------
--- Begin Message ---
http://dk.php.net/manual/en/ref.mbstring.php claims that "a multi-byte
character string may be destroyed when it is divided and/or counted
unless multi-byte character encoding safe method is used". I've just run
some tests with Unicode and Japanese characters (copied from
http://unicode.org/unicode/standard/translations/japanese.html). I used
functions like preg_match(), strlen(), and substr(), and no matter what
I can't seem to break the Japanese strings. Which leads to my question:
Is it really necessary to use functions like mb_substr() instead of
substr(), mb_strlen() instead of strlen(), etc.? Does anyone have any
examples of strings that would actually break if you use preg_match(),
substr(), strlen() or similar functions on them?
--- End Message ---
--- Begin Message ---
Jonas Koch Bentzen wrote:
> http://dk.php.net/manual/en/ref.mbstring.php claims that "a multi-byte
> character string may be destroyed when it is divided and/or counted
> unless multi-byte character encoding safe method is used". I've just run
> some tests with Unicode and Japanese characters (copied from
> http://unicode.org/unicode/standard/translations/japanese.html). I used
> functions like preg_match(), strlen(), and substr(), and no matter what
> I can't seem to break the Japanese strings. Which leads to my question:
What encoding are you using?
Don't you use func_overload, right?
For instance, UTF-8 can be 6 bytes at most, and inserting newline,
etc to middle of multibyte sequence breaks multi-byte chars obviously.
> Is it really necessary to use functions like mb_substr() instead of
> substr(), mb_strlen() instead of strlen(), etc.? Does anyone have any
> examples of strings that would actually break if you use preg_match(),
> substr(), strlen() or similar functions on them?
Of course, they need it.
We'll make all default string functions multibyte aware someday.
If you use PCRE and UTF-8, it works.
How did you conclude strlen is actually returning number of
chars? How did you check if the multibyte sequence is broken
or not?
See also encodings like ISO 2022 or EUC.
--
Yasuo Ohgaki
--- End Message ---
--- Begin Message ---
Yasuo Ohgaki wrote:
>
>> http://dk.php.net/manual/en/ref.mbstring.php claims that "a multi-byte
>> character string may be destroyed when it is divided and/or counted
>> unless multi-byte character encoding safe method is used". I've just
>> run some tests with Unicode and Japanese characters (copied from
>> http://unicode.org/unicode/standard/translations/japanese.html). I
>> used functions like preg_match(), strlen(), and substr(), and no
>> matter what I can't seem to break the Japanese strings. Which leads to
>> my question:
>
>
> What encoding are you using?
UTF-8.
> Don't you use func_overload, right?
Not at the moment.
> For instance, UTF-8 can be 6 bytes at most, and inserting newline,
> etc to middle of multibyte sequence breaks multi-byte chars obviously.
Obviously - but the functions I'm going to use will not change the
string. E.g., substr() doesn't change the string, and neither does strlen().
>> Is it really necessary to use functions like mb_substr() instead of
>> substr(), mb_strlen() instead of strlen(), etc.? Does anyone have any
>> examples of strings that would actually break if you use preg_match(),
>> substr(), strlen() or similar functions on them?
>
>
> Of course, they need it.
> We'll make all default string functions multibyte aware someday.
>
> If you use PCRE and UTF-8, it works.
I do. So what you're saying is that it's OK to use preg_*, but not
strlen(), substr(), etc. (unless I use function overloading)?
> How did you check if the multibyte sequence is broken
> or not?
Well, I just used my browser (Mozilla 1.1). If the unmodified string
looked exactly like the string that I ran through functions like
strlen(), then I concluded they were the same. But obviously you have
much more experience with Japanese characters, so if you say that
functions like strlen() might break the strings, I'll take your word for it.
> See also encodings like ISO 2022 or EUC.
Well, the site I'm making is not going to be Japanese - it's going to be
international, and so I want it to work for everybody (including the
Japanese).
--- End Message ---
--- Begin Message ---
Yasuo Ohgaki wrote:
>
> For instance, UTF-8 can be 6 bytes at most, and inserting newline,
> etc to middle of multibyte sequence breaks multi-byte chars obviously.
What about functions like wordwrap()? Would a line like
wordwrap($_POST["body"], 76, "\r\n"); break a Japanese string?
Thanks for your help.
--- End Message ---
--- Begin Message ---
Jonas Koch Bentzen wrote:
> Yasuo Ohgaki wrote:
>
>>
>> For instance, UTF-8 can be 6 bytes at most, and inserting newline,
>> etc to middle of multibyte sequence breaks multi-byte chars obviously.
>
>
> What about functions like wordwrap()? Would a line like
> wordwrap($_POST["body"], 76, "\r\n"); break a Japanese string?
>
IIRC, wordwrap wraps when there is space, I guess you
would like to try something like this.
wordwrap($str, 1, "\r\n", TRUE);
--
Yasuo Ohgaki
--- End Message ---
--- Begin Message ---
Yasuo Ohgaki wrote:
> Jonas Koch Bentzen wrote:
>
>> Yasuo Ohgaki wrote:
>>
>>>
>>> For instance, UTF-8 can be 6 bytes at most, and inserting newline,
>>> etc to middle of multibyte sequence breaks multi-byte chars obviously.
>>
>>
>>
>> What about functions like wordwrap()? Would a line like
>> wordwrap($_POST["body"], 76, "\r\n"); break a Japanese string?
>>
>
> IIRC, wordwrap wraps when there is space, I guess you
> would like to try something like this.
>
> wordwrap($str, 1, "\r\n", TRUE);
I'm sorry, I don't understand. Why would I want to break at just 1
character? But I guess since wordwrap() wraps at space, it can't break a
Japanese string?
--- End Message ---
--- Begin Message ---
Jonas Koch Bentzen wrote:
>> IIRC, wordwrap wraps when there is space, I guess you
>> would like to try something like this.
>>
>> wordwrap($str, 1, "\r\n", TRUE);
>
>
> I'm sorry, I don't understand. Why would I want to break at just 1
> character? But I guess since wordwrap() wraps at space, it can't break a
> Japanese string?
I guess you don't know Japanese text does not have spaces between words.
The wordwrap() is useless for Japanese text. strlen(), substr(), etc are
almost useless as well.
Believe me. I'm native Japanese and telling you they are useless
for Japanese text. They are useless for Korean, Chinese, etc text
also.
--
Yasuo Ohgaki
--- End Message ---
--- Begin Message ---
Alexandre wrote:
>
> i'm unable write & read correct UTF-8 data from fields (varchar2) from
> Oracle database...
I don't know how strongly you feel about using Oracle, but PostgreSQL
seems to support Unicode well (and there are no problems with
reading/writing from/to PHP-scripts).
--- End Message ---
--- Begin Message ---
Hello Experts,
I have been trying very hard to get Japanese characters displayed in images
using GD library from PHP. I use:
PHP4.1.2
freetype2.0.9
gd1.8.4
But it seems that the images are displayed with weird strokes instead of
texts. The text consists of English and Japanese Characters. I am using
"watanabe-mincho.ttf" font available in linux.
Please help! Thanks in advance.
--- End Message ---