php-i18n Digest 28 Sep 2002 12:16:39 -0000 Issue 126

Topics (messages 329 through 337):

Multi-byte strings and Japanese characters
        329 by: Jonas Koch Bentzen
        332 by: Yasuo Ohgaki
        333 by: Jonas Koch Bentzen
        334 by: Jonas Koch Bentzen
        335 by: Yasuo Ohgaki
        336 by: Jonas Koch Bentzen
        337 by: Yasuo Ohgaki

Re: problem: php + oracle + utf8
        330 by: Jonas Koch Bentzen

!! ImageTTFText not displaying Japanese text !!
        331 by: Deepak Karunakaran

Administrivia:

To subscribe to the digest, e-mail:
        [EMAIL PROTECTED]

To unsubscribe from the digest, e-mail:
        [EMAIL PROTECTED]

To post to the list, e-mail:
        [EMAIL PROTECTED]


----------------------------------------------------------------------
--- Begin Message ---
http://dk.php.net/manual/en/ref.mbstring.php claims that "a multi-byte 
character string may be destroyed when it is divided and/or counted 
unless multi-byte character encoding safe method is used". I've just run 
some tests with Unicode and Japanese characters (copied from 
http://unicode.org/unicode/standard/translations/japanese.html). I used 
functions like preg_match(), strlen(), and substr(), and no matter what 
I can't seem to break the Japanese strings. Which leads to my question: 
Is it really necessary to use functions like mb_substr() instead of 
substr(), mb_strlen() instead of strlen(), etc.? Does anyone have any 
examples of strings that would actually break if you use preg_match(), 
substr(), strlen() or similar functions on them?

--- End Message ---
--- Begin Message ---
Jonas Koch Bentzen wrote:
> http://dk.php.net/manual/en/ref.mbstring.php claims that "a multi-byte 
> character string may be destroyed when it is divided and/or counted 
> unless multi-byte character encoding safe method is used". I've just run 
> some tests with Unicode and Japanese characters (copied from 
> http://unicode.org/unicode/standard/translations/japanese.html). I used 
> functions like preg_match(), strlen(), and substr(), and no matter what 
> I can't seem to break the Japanese strings. Which leads to my question: 

What encoding are you using?
Don't you use func_overload, right?

For instance, UTF-8 can be 6 bytes at most, and inserting newline,
etc to middle of multibyte sequence breaks multi-byte chars obviously.

> Is it really necessary to use functions like mb_substr() instead of 
> substr(), mb_strlen() instead of strlen(), etc.? Does anyone have any 
> examples of strings that would actually break if you use preg_match(), 
> substr(), strlen() or similar functions on them?

Of course, they need it.
We'll make all default string functions multibyte aware someday.

If you use PCRE and UTF-8, it works.

How did you conclude strlen is actually returning number of
chars? How did you check if the multibyte sequence is broken
or not?

See also encodings like ISO 2022 or EUC.

--
Yasuo Ohgaki

--- End Message ---
--- Begin Message ---
Yasuo Ohgaki wrote:
 >
>> http://dk.php.net/manual/en/ref.mbstring.php claims that "a multi-byte 
>> character string may be destroyed when it is divided and/or counted 
>> unless multi-byte character encoding safe method is used". I've just 
>> run some tests with Unicode and Japanese characters (copied from 
>> http://unicode.org/unicode/standard/translations/japanese.html). I 
>> used functions like preg_match(), strlen(), and substr(), and no 
>> matter what I can't seem to break the Japanese strings. Which leads to 
>> my question: 
> 
> 
> What encoding are you using?

UTF-8.

> Don't you use func_overload, right?

Not at the moment.

> For instance, UTF-8 can be 6 bytes at most, and inserting newline,
> etc to middle of multibyte sequence breaks multi-byte chars obviously.

Obviously - but the functions I'm going to use will not change the 
string. E.g., substr() doesn't change the string, and neither does strlen().

>> Is it really necessary to use functions like mb_substr() instead of 
>> substr(), mb_strlen() instead of strlen(), etc.? Does anyone have any 
>> examples of strings that would actually break if you use preg_match(), 
>> substr(), strlen() or similar functions on them?
> 
> 
> Of course, they need it.
> We'll make all default string functions multibyte aware someday.
> 
> If you use PCRE and UTF-8, it works.

I do. So what you're saying is that it's OK to use preg_*, but not 
strlen(), substr(), etc. (unless I use function overloading)?

 > How did you check if the multibyte sequence is broken
> or not?

Well, I just used my browser (Mozilla 1.1). If the unmodified string 
looked exactly like the string that I ran through functions like 
strlen(), then I concluded they were the same. But obviously you have 
much more experience with Japanese characters, so if you say that 
functions like strlen() might break the strings, I'll take your word for it.

> See also encodings like ISO 2022 or EUC.

Well, the site I'm making is not going to be Japanese - it's going to be 
international, and so I want it to work for everybody (including the 
Japanese).

--- End Message ---
--- Begin Message ---
Yasuo Ohgaki wrote:
>
> For instance, UTF-8 can be 6 bytes at most, and inserting newline,
> etc to middle of multibyte sequence breaks multi-byte chars obviously.

What about functions like wordwrap()? Would a line like 
wordwrap($_POST["body"], 76, "\r\n"); break a Japanese string?

Thanks for your help.

--- End Message ---
--- Begin Message ---
Jonas Koch Bentzen wrote:
> Yasuo Ohgaki wrote:
> 
>>
>> For instance, UTF-8 can be 6 bytes at most, and inserting newline,
>> etc to middle of multibyte sequence breaks multi-byte chars obviously.
> 
> 
> What about functions like wordwrap()? Would a line like 
> wordwrap($_POST["body"], 76, "\r\n"); break a Japanese string?
> 

IIRC, wordwrap wraps when there is space, I guess you
would like to try something like this.

wordwrap($str, 1, "\r\n", TRUE);

--
Yasuo Ohgaki

--- End Message ---
--- Begin Message ---
Yasuo Ohgaki wrote:
> Jonas Koch Bentzen wrote:
> 
>> Yasuo Ohgaki wrote:
>>
>>>
>>> For instance, UTF-8 can be 6 bytes at most, and inserting newline,
>>> etc to middle of multibyte sequence breaks multi-byte chars obviously.
>>
>>
>>
>> What about functions like wordwrap()? Would a line like 
>> wordwrap($_POST["body"], 76, "\r\n"); break a Japanese string?
>>
> 
> IIRC, wordwrap wraps when there is space, I guess you
> would like to try something like this.
> 
> wordwrap($str, 1, "\r\n", TRUE);

I'm sorry, I don't understand. Why would I want to break at just 1 
character? But I guess since wordwrap() wraps at space, it can't break a 
Japanese string?

--- End Message ---
--- Begin Message ---
Jonas Koch Bentzen wrote:
>> IIRC, wordwrap wraps when there is space, I guess you
>> would like to try something like this.
>>
>> wordwrap($str, 1, "\r\n", TRUE);
> 
> 
> I'm sorry, I don't understand. Why would I want to break at just 1 
> character? But I guess since wordwrap() wraps at space, it can't break a 
> Japanese string?

I guess you don't know Japanese text does not have spaces between words.
The wordwrap() is useless for Japanese text. strlen(), substr(), etc are
almost useless as well.

Believe me. I'm native Japanese and telling you they are useless
for Japanese text. They are useless for Korean, Chinese, etc text
also.

--
Yasuo Ohgaki

--- End Message ---
--- Begin Message ---
Alexandre wrote:
>
> i'm unable write & read correct UTF-8 data from fields (varchar2) from
> Oracle database...

I don't know how strongly you feel about using Oracle, but PostgreSQL 
seems to support Unicode well (and there are no problems with 
reading/writing from/to PHP-scripts).

--- End Message ---
--- Begin Message ---
Hello Experts,

I have been trying very hard to get Japanese characters displayed in images
using GD library from PHP. I use:
PHP4.1.2
freetype2.0.9
gd1.8.4

But it seems that the images are displayed with weird strokes instead of
texts. The text consists of English and Japanese Characters. I am using
"watanabe-mincho.ttf" font available in linux.

Please help! Thanks in advance.


--- End Message ---

Reply via email to