Edit report at https://bugs.php.net/bug.php?id=63663&edit=1
ID: 63663 Updated by: ahar...@php.net Reported by: kobrien at kiva dot org Summary: str_word_count does not properly handle non-latin characters Status: Analyzed Type: Bug Package: Strings related Operating System: Ubuntu 12.04 PHP Version: 5.3.20-dev Block user comment: N Private report: N New Comment: Yeah, a feature request for mb_str_word_count() might be a good idea. The isalpha() issue isn't really PHP specific: the underlying C function simply takes a single byte as its input, so it can't ascertain whether a multibyte character is actually alphanumeric or not (since it only ever gets the first byte of the sequence). There's an iswalpha() function that would do the right thing, but PHP was written before it was widely available, and using it in str_word_count() alone would be inconsistent with the rest of the language: it's something we'd need to think about as part of making the whole language more multibyte-aware. Previous Comments: ------------------------------------------------------------------------ [2012-12-03 02:36:37] kobrien at kiva dot org Thanks for the reply. Given your comments about the problems, would it be helpful for me to also file a feature request for newer versions of php to have a mb_str_word_count function which could properly handle this case? I haven't dug into the C code enough to understand why isalpha() fails on multibyte, but I'd have to imagine there is an alternative available that will handle multi-byte characters properly. I could potentially even create a patch if pointed in the right direction. ------------------------------------------------------------------------ [2012-12-03 02:29:16] ahar...@php.net This is due to the use of isalpha() internally, which doesn't play well with multibyte encodings like UTF-8, regardless of the locale setting. Fundamentally, this is the same issue as bug #27668 â I'm not sure there's a lot we can do about this in PHP 5.x, but it's worth noting if and when we revisit Unicode string handling internally. ------------------------------------------------------------------------ [2012-12-01 02:29:17] kobrien at kiva dot org Description: ------------ The function str_word_count() does work properly on non-latin characters. It will return a value of zero. Whereas str_word_count() works properly on latin characters and returns the value for the number of words in a string. Test script: --------------- <?php print str_word_count("PHP function str_word_count does not properly handle non-latin characters") . "\n"; // returns 11 print str_word_count("Хабилло жиÑÐµÐ»Ñ Ð¯Ð²Ð°Ð½Ñкого Ñайона. ÐÐ¼Ñ 70 леÑ. Ðн женаÑ. У него ÑеÑвеÑо деÑей. Хабилло Ñилолог. Ðн более двадÑаÑи Ð»ÐµÑ ÑабоÑÐ°ÐµÑ Ð¿Ð¾ пÑоÑеÑÑии. Также Хабилло занимаеÑÑÑ Ð²Ð¸Ð½Ð¾Ð³ÑадаÑÑÑвом. У него имееÑÑÑ Ð½ÐµÐ±Ð¾Ð»ÑÑой виногÑадник. ÐÑим видом деÑÑелÑноÑÑи Хабилло занимаеÑÑÑ 15 леÑ."); // returns 0, but should return 37 Expected result: ---------------- The second instruction should return 37 Actual result: -------------- The second instruction returns 0 ------------------------------------------------------------------------ -- Edit this bug report at https://bugs.php.net/bug.php?id=63663&edit=1