Re: [PHP-I18N] PHP + UTF-8 + mbstring extension issue.

Norbert Lindenberg Wed, 21 Mar 2007 10:25:09 -0800

Hello,

mb_strlen, when run in UTF-8 mode, counts the Unicode characters inthe string. ઝાલા is \u0a9d\u0abe\u0ab2\u0abe (using Javanotation), i.e., 4 Unicode characters. It's actually 12 bytes in UTF-8.

It's not unusual for the user perception of a character and thedefinition used in the computer representation to be different. Forexample, users perceive zälä as 4 characters, but in Unicode thestring can be represented either using precomposed forms, z\u00e4l\u00e4, or using combining marks, za\0308la\0308. The firstrepresentation would count as 4 (Unicode) characters, the second as6. For Gujarati, where Unicode doesn't have precomposed forms, theproblem is just visible more often than with Latin characters.

The ICU library has character break iterators that better approximatethe user perception of characters. If this is important for yourproject, you may want to take a look at it:

http://icu-project.org/userguide/boundaryAnalysis.html

Norbert


On Mar 21, 2007, at 1:37 AM, Anirudh Zala wrote:

Hello Everybody,
While building a truly multilingual project, I am running into aninterestingproblem with php5 + utf-8 + mbstring functions. Please study belowtablecarefully. I have taken 1 word in 3 different languages English,Finnish (of
Finland country) and Gujarati (of India country) to test PHP's Unicode
character set handling with single and multibyte strings usingmbstring
extension.
Word appearing on left of "=" sign is actual string whose length isto becounted. What I have tried here is to count length of word in eachlanguage.For English and Finnish I have got correct results but for Gujaratilanguage
it seems that mbstring functions(?) are not working properly.

=======================================================
zala = 1 word; 4 bytes; 4 characters (z, a, l, a); 4 key-strokes(z, a, l,
 a); "strlen" should be 4 and is 4 also.
zälä = 1 word; 4 bytes; 4 characters (z, ä, l, ä); 4 key-strokes(z, ä, l,
 ä); "strlen" should be 4 and is 4 also.
ઝાલા = 1 word; 4 bytes; 2 characters (ઝા, લા); 4key-strokes (ઝ, ા, લ, ા);
"strlen" should be 2 but is 4.
=======================================================
Question is why PHP is not able to count length of given string inpracticalway. I am aware that current PHP versions are not aware of string,insteadthey just deal with bytes. In that case output is correct but thisis notpractical solution as length of word in Gujarati language is only"2" (InIndic languages, we have primary characters like "ઝ" and secondarycharacterslike "ા", but secondary characters should not be counted whilecalculating
length) and not "4" even if it requires 4 bytes to store data.
I am sure that I am not missing any settings to be done at server,php or at
client level to work this correctly. English and Finnish languages are
different languages but they are part of same character set (i.eLatin) andtheir glyph is also same, while Gujarati language has differentcharacter setand it's glyph is also different. But this should not create thisproblem if
"mbstring functions" are capable to handle strings in proper way.
I have tested same thing using "iconv" extension but same results.Looks like
it is the behavior of php + mb_* functions.

Thanks,
Anirudh Zala

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


-------------------------------------
Norbert Lindenberg
Yahoo! Internationalization Architect

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-I18N] PHP + UTF-8 + mbstring extension issue.

Reply via email to