ID: 43896 User updated by: arnaud dot lb at gmail dot com Reported By: arnaud dot lb at gmail dot com Status: Open -Bug Type: Unicode Function Upgrades relate +Bug Type: *General Issues Operating System: Any PHP Version: 5.2.5 New Comment:
I made a patch for this bug: http://s3.amazonaws.com/arnaud.lb/php_htmlentities_utf.patch The internal get_next_char() function returns a status of FAILURE when it encounters a invalid or incomplete sequence, which causes the htmlspecialchars and htmlentities functions to return an empty string. This patch modify the behavior of these functions to skip invalid sequences, without discarding the whole string. This involves a very few changes and makes the behavior of theses functions more consistent with previous PHP versions. It also adds a few tests to htmlentities-utf.phpt. Previous Comments: ------------------------------------------------------------------------ [2008-01-20 02:12:01] arnaud dot lb at gmail dot com Description: ------------ htmlspecialchars/htmlentities returns an empty string when the input contains an invalid unicode sequence. I think these functions should just skip the invalid sequences or encode them byte by byte (e.g. 0xE9 => é), instead of discarding the whole string. Sometimes you have to display arbitrary strings of unknow encoding. So you make them more safe using htmlspecialchars($string, ENT_COMPAT, "site_encoding, utf-8 in my case"), but if there is at least one invalid sequence in the string, it returns an empty string :/ Reproduce code: --------------- $string = "Voil\xE0"; // "VoilĂ ", in ISO-8859-15 var_dump(htmlspecialchars($string, ENT_COMPAT, "utf-8")); Expected result: ---------------- string(4) "Voil" OR string(10) "Voilà" Actual result: -------------- string(0) "" ------------------------------------------------------------------------ -- Edit this bug report at http://bugs.php.net/?id=43896&edit=1
