Hello Everybody, While building a truly multilingual project, I am running into an interesting problem with php5 + utf-8 + mbstring functions. Please study below table carefully. I have taken 1 word in 3 different languages English, Finnish (of Finland country) and Gujarati (of India country) to test PHP's Unicode character set handling with single and multibyte strings using mbstring extension.
Word appearing on left of "=" sign is actual string whose length is to be counted. What I have tried here is to count length of word in each language. For English and Finnish I have got correct results but for Gujarati language it seems that mbstring functions(?) are not working properly. ======================================================= zala = 1 word; 4 bytes; 4 characters (z, a, l, a); 4 key-strokes (z, a, l, a); "strlen" should be 4 and is 4 also. zälä = 1 word; 4 bytes; 4 characters (z, ä, l, ä); 4 key-strokes (z, ä, l, ä); "strlen" should be 4 and is 4 also. ઝાલા = 1 word; 4 bytes; 2 characters (ઝા, લા); 4 key-strokes (ઝ, ા, લ, ા); "strlen" should be 2 but is 4. ======================================================= Question is why PHP is not able to count length of given string in practical way. I am aware that current PHP versions are not aware of string, instead they just deal with bytes. In that case output is correct but this is not practical solution as length of word in Gujarati language is only "2" (In Indic languages, we have primary characters like "ઝ" and secondary characters like "ા", but secondary characters should not be counted while calculating length) and not "4" even if it requires 4 bytes to store data. I am sure that I am not missing any settings to be done at server, php or at client level to work this correctly. English and Finnish languages are different languages but they are part of same character set (i.e Latin) and their glyph is also same, while Gujarati language has different character set and it's glyph is also different. But this should not create this problem if "mbstring functions" are capable to handle strings in proper way. I have tested same thing using "iconv" extension but same results. Looks like it is the behavior of php + mb_* functions. Thanks, Anirudh Zala -- PHP Unicode & I18N Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php