Re: [PHP-I18N] PHP + UTF-8 + mbstring extension issue.

Anirudh Zala Wed, 21 Mar 2007 19:44:42 -0800

On Wednesday 21 March 2007 23:54, you wrote:
> Hello,
>
> mb_strlen, when run in UTF-8 mode, counts the Unicode characters in
> the string. ઝાલા is \u0a9d\u0abe\u0ab2\u0abe (using Java
> notation), i.e., 4 Unicode characters. It's actually 12 bytes in UTF-8.
>
> It's not unusual for the user perception of a character and the
> definition used in the computer representation to be different. For
> example, users perceive zälä as 4 characters, but in Unicode the
> string can be represented either using precomposed forms, z\u00e4l
> \u00e4, or using combining marks, za\0308la\0308. The first
> representation would count as 4 (Unicode) characters, the second as
> 6. For Gujarati, where Unicode doesn't have precomposed forms, the
> problem is just visible more often than with Latin characters.


It seems that characters of Gujarati and other Indic languages are not in 
precomposed forms, like Latin characters. But then question is why aren't 
they in precomposed forms? I am sorry if I am asking this question to you.

As you might be knowing that Indic languages have different tables, unlike 
English, for vowels and consonants. Hence when any vowel is used with 
consonant, vowel should not be counted while calculating length of string. 
Hence "ઝ" should be 1 and "ઝા" should also be 1 even if both requires 1 and 2 
characters in Unicode respectively.

To cope-up with such problem, string should be represented precomposed forms. 
We have 11 vowels and almost 60 consonants, hence there can be (11 x 60) over 
700 precomposed forms. And I assume that to save space in Unicode, vowels and 
consonants are stored in different ways.

And I wonder why didn't anybody face such problem until now? In future when 
PHP6 will arrive, how is it going to deal with this situation? because if 
problem area lies at Unicode level then I assume PHP can't do much.

>
> The ICU library has character break iterators that better approximate
> the user perception of characters. If this is important for your
> project, you may want to take a look at it:
> http://icu-project.org/userguide/boundaryAnalysis.html

Thanks for this suggestion but this library is in C/C++ and Java hence can't 
be used easily with PHP. I suggest that such library should be provided as an 
extension for PHP and other scripting languages.

Moreover this solves problems at string comparison level. There are more 
higher level problems also while storing string in database. For example if 
length of any field (for MySQL db specifically) is 12 characters then for 
Latin characters there is not any problem to store that string, but for Indic 
languages if user uses 7 consonants with 7 vowels then even if in human 
perception string length is just 7, last 2 characters will get truncated.

Then there could be some more areas where this problem can be more severe.

Anirudh Zala

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-I18N] PHP + UTF-8 + mbstring extension issue.

Reply via email to