On Wednesday 21 March 2007 23:54, you wrote: > Hello, > > mb_strlen, when run in UTF-8 mode, counts the Unicode characters in > the string. ઝાલા is \u0a9d\u0abe\u0ab2\u0abe (using Java > notation), i.e., 4 Unicode characters. It's actually 12 bytes in UTF-8. > > It's not unusual for the user perception of a character and the > definition used in the computer representation to be different. For > example, users perceive zälä as 4 characters, but in Unicode the > string can be represented either using precomposed forms, z\u00e4l > \u00e4, or using combining marks, za\0308la\0308. The first > representation would count as 4 (Unicode) characters, the second as > 6. For Gujarati, where Unicode doesn't have precomposed forms, the > problem is just visible more often than with Latin characters.
It seems that characters of Gujarati and other Indic languages are not in precomposed forms, like Latin characters. But then question is why aren't they in precomposed forms? I am sorry if I am asking this question to you. As you might be knowing that Indic languages have different tables, unlike English, for vowels and consonants. Hence when any vowel is used with consonant, vowel should not be counted while calculating length of string. Hence "ઝ" should be 1 and "ઝા" should also be 1 even if both requires 1 and 2 characters in Unicode respectively. To cope-up with such problem, string should be represented precomposed forms. We have 11 vowels and almost 60 consonants, hence there can be (11 x 60) over 700 precomposed forms. And I assume that to save space in Unicode, vowels and consonants are stored in different ways. And I wonder why didn't anybody face such problem until now? In future when PHP6 will arrive, how is it going to deal with this situation? because if problem area lies at Unicode level then I assume PHP can't do much. > > The ICU library has character break iterators that better approximate > the user perception of characters. If this is important for your > project, you may want to take a look at it: > http://icu-project.org/userguide/boundaryAnalysis.html Thanks for this suggestion but this library is in C/C++ and Java hence can't be used easily with PHP. I suggest that such library should be provided as an extension for PHP and other scripting languages. Moreover this solves problems at string comparison level. There are more higher level problems also while storing string in database. For example if length of any field (for MySQL db specifically) is 12 characters then for Latin characters there is not any problem to store that string, but for Indic languages if user uses 7 consonants with 7 vowels then even if in human perception string length is just 7, last 2 characters will get truncated. Then there could be some more areas where this problem can be more severe. Anirudh Zala -- PHP Unicode & I18N Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php