Edit report at https://bugs.php.net/bug.php?id=62861&edit=1
ID: 62861 Updated by: ni...@php.net Reported by: soapergem at gmail dot com Summary: htmlentities returns empty string when it shouldn't Status: Not a bug Type: Bug Package: *General Issues Operating System: Windows PHP Version: 5.4.6 Block user comment: N Private report: N New Comment: Save your document as UTF-8 *without* BOM. The  is just what the UTF-8 Byte Order Mark (BOM) looks like when it is output (which is probably something you don't want, so save the file without it). Previous Comments: ------------------------------------------------------------------------ [2012-08-19 13:49:39] ras...@php.net >From my command line: php > echo htmlentities('©', ENT_COMPAT | ENT_HTML401, 'UTF-8'); © it works fine. If you are actually providing the correct UTF-8 char it will work fine. You can verify that by doing this: php > $a = chr(0xC2).chr(0xA9); php > echo htmlentities($a, ENT_COMPAT | ENT_HTML401, 'UTF-8'); © Here I am explicitly passing C2A9 in and I get © back out. So I have no idea what your Windows Notepad is doing. Look at the output with a hex editor and see what it is converting that copyright character to. ------------------------------------------------------------------------ [2012-08-19 13:30:07] soapergem at gmail dot com Yes, your assumptions about what I was meaning to say were correct. I really meant "ANSI," which you know as CP-1252. But there is definitely still a bug with this. I just followed your instructions by saving my test script specifically in the "UTF-8" encoding hoping that, as you said, "all my problems will go away." They didn't. My test script is exactly the same one that I have listed on this bug report. I saved it in Windows Notepad, using the "UTF-8" encoding. I am no longer getting an empty string -- which is progress. But now I am getting the following output: © This is definitely NOT the expected result here. It did finally convert the copyright symbol, but it prepended not one, not two, but THREE junk characters in front of it. This is even worse than before. If I'm not mistaken, wasn't the whole reason PHP6 was abandoned because the idea of converting everything to Unicode deemed too ambitious? I've already spent far too much time dealing with this than is practical, as I'm sure you have much better things to do, as well. It just seems to me that you guys had a wonderful hammer -- a wonderful tool for the job -- and you went and broke off the hammer head for no apparent reason. If I might make a humble suggestion, why not let htmlentities() default to whatever the default_charset option is in php.ini? Right now you can only do that by explicitly passing an empty string as the third parameter to htmlentities, which is very messy and counterintuitive. Shouldn't the default_charset actually be, you know, the _default character set_? ------------------------------------------------------------------------ [2012-08-19 05:22:03] ras...@php.net I think you are confusing CP-1252 with ISO-8859-1. And the default on Windows internally is actually UTF-16 but there is a library call named isTextUnicode() which most apps use to determine which encoding something is in and it tends towards CP-1252 if it can't figure it out, so I assume that is what you mean when you say everyone saves things in ISO-8859-1 on Windows. Every editor I know of has a very simple encoding setting to force the editor to a specific encoding. Set it to UTF-8 and all your problems will go away. Note also that CP- 1252 is not used in most of the world, so this assertion that most pages are saved in ISO-8859-1 is obviously not true. Regardless, this is not something that will be reverted. CP-1252 is disappearing and I think you will find much less of it in Windows8 as it really doesn't play well with HTML5. ------------------------------------------------------------------------ [2012-08-19 05:02:02] soapergem at gmail dot com With respect, the 72% figure you cited is misleading at best. The character encoding listed in the HTML gives no indication of what encoding the files were actually saved in. All it is is a <meta> tag in that <head> that says UTF-8. I would suspect the vast majority of those files are still saved in ISO-8859-1, though. My prediction is that you're going to get A LOT of complaints over the switch -- especially from Windows users, who almost always save things in ISO-8859-1, since that is the default encoding in Windows. With PHP on Windows ever growing, fighting the Windows users is just shooting yourself in the foot. ------------------------------------------------------------------------ [2012-08-19 04:38:50] ras...@php.net UTF-8 is only compatible with low-ascii, not with high. The copyright symbol in ISO-8859-1 is character code (in hex) <A9>. In UTF-8 the copyright symbol is represented by two bytes, <C2><A9>. The world has gone UTF-8. If your editor is in UTF-8 mode and you enter/paste a copyright symbol and pass it to htmlentities() you will get "©" back. So rather than change the code to hardcode ISO-8859-1 you should convert your datasources to UTF-8. Most of them are probably already UTF-8 which means that your current code was likely not handling these correctly since it assumed ISO-8859-1 before. For some perspetive: http://w3techs.com/technologies/overview/character_encoding/all which shows that 72% of the top-million sites on the Web are using UTF-8. And this number is growing. ------------------------------------------------------------------------ The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at https://bugs.php.net/bug.php?id=62861 -- Edit this bug report at https://bugs.php.net/bug.php?id=62861&edit=1