Edit report at https://bugs.php.net/bug.php?id=62861&edit=1
ID: 62861 Updated by: ras...@php.net Reported by: soapergem at gmail dot com Summary: htmlentities returns empty string when it shouldn't Status: Not a bug Type: Bug Package: *General Issues Operating System: Windows PHP Version: 5.4.6 Block user comment: N Private report: N New Comment: I think you are confusing CP-1252 with ISO-8859-1. And the default on Windows internally is actually UTF-16 but there is a library call named isTextUnicode() which most apps use to determine which encoding something is in and it tends towards CP-1252 if it can't figure it out, so I assume that is what you mean when you say everyone saves things in ISO-8859-1 on Windows. Every editor I know of has a very simple encoding setting to force the editor to a specific encoding. Set it to UTF-8 and all your problems will go away. Note also that CP- 1252 is not used in most of the world, so this assertion that most pages are saved in ISO-8859-1 is obviously not true. Regardless, this is not something that will be reverted. CP-1252 is disappearing and I think you will find much less of it in Windows8 as it really doesn't play well with HTML5. Previous Comments: ------------------------------------------------------------------------ [2012-08-19 05:02:02] soapergem at gmail dot com With respect, the 72% figure you cited is misleading at best. The character encoding listed in the HTML gives no indication of what encoding the files were actually saved in. All it is is a <meta> tag in that <head> that says UTF-8. I would suspect the vast majority of those files are still saved in ISO-8859-1, though. My prediction is that you're going to get A LOT of complaints over the switch -- especially from Windows users, who almost always save things in ISO-8859-1, since that is the default encoding in Windows. With PHP on Windows ever growing, fighting the Windows users is just shooting yourself in the foot. ------------------------------------------------------------------------ [2012-08-19 04:38:50] ras...@php.net UTF-8 is only compatible with low-ascii, not with high. The copyright symbol in ISO-8859-1 is character code (in hex) <A9>. In UTF-8 the copyright symbol is represented by two bytes, <C2><A9>. The world has gone UTF-8. If your editor is in UTF-8 mode and you enter/paste a copyright symbol and pass it to htmlentities() you will get "©" back. So rather than change the code to hardcode ISO-8859-1 you should convert your datasources to UTF-8. Most of them are probably already UTF-8 which means that your current code was likely not handling these correctly since it assumed ISO-8859-1 before. For some perspetive: http://w3techs.com/technologies/overview/character_encoding/all which shows that 72% of the top-million sites on the Web are using UTF-8. And this number is growing. ------------------------------------------------------------------------ [2012-08-19 04:14:03] soapergem at gmail dot com Description: ------------ Doesn't UTF-8 include basic ASCII characters, too? Right now when I try to encode the copyright symbol (©) using htmlentities (it should encode to ©), it doesn't work. I discovered this since the default encoding for htmlentities() was switched from ISO-8859-1 to UTF-8 in version 5.4. I have plenty of places where I rely on basic symbols, such as the copyright symbol, being encoded properly with htmlentities(). Having to go in and change all the instances of htmlentities($string) to htmlentities($string, ENT_COMPAT | ENT_HTML401, 'ISO-8859-1') is not practical (there are MANY). And with the whole output of the function being blank, it just makes my scripts completely unusable now. Help! Test script: --------------- <?php echo htmlentities('©', ENT_COMPAT | ENT_HTML401, 'UTF-8'); ?> Expected result: ---------------- © Actual result: -------------- (Nothing - an empty string) ------------------------------------------------------------------------ -- Edit this bug report at https://bugs.php.net/bug.php?id=62861&edit=1