Jonas Koch Bentzen wrote:
> http://dk.php.net/manual/en/ref.mbstring.php claims that "a multi-byte 
> character string may be destroyed when it is divided and/or counted 
> unless multi-byte character encoding safe method is used". I've just run 
> some tests with Unicode and Japanese characters (copied from 
> http://unicode.org/unicode/standard/translations/japanese.html). I used 
> functions like preg_match(), strlen(), and substr(), and no matter what 
> I can't seem to break the Japanese strings. Which leads to my question: 

What encoding are you using?
Don't you use func_overload, right?

For instance, UTF-8 can be 6 bytes at most, and inserting newline,
etc to middle of multibyte sequence breaks multi-byte chars obviously.

> Is it really necessary to use functions like mb_substr() instead of 
> substr(), mb_strlen() instead of strlen(), etc.? Does anyone have any 
> examples of strings that would actually break if you use preg_match(), 
> substr(), strlen() or similar functions on them?

Of course, they need it.
We'll make all default string functions multibyte aware someday.

If you use PCRE and UTF-8, it works.

How did you conclude strlen is actually returning number of
chars? How did you check if the multibyte sequence is broken
or not?

See also encodings like ISO 2022 or EUC.

--
Yasuo Ohgaki


-- 
PHP Internationalization Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to