Bug #62861 [Nab]: htmlentities returns empty string when it shouldn't

rasmus Sat, 18 Aug 2012 22:22:16 -0700

Edit report at https://bugs.php.net/bug.php?id=62861&edit=1


 ID:                 62861
 Updated by:         ras...@php.net
 Reported by:        soapergem at gmail dot com
 Summary:            htmlentities returns empty string when it shouldn't
 Status:             Not a bug
 Type:               Bug
 Package:            *General Issues
 Operating System:   Windows
 PHP Version:        5.4.6
 Block user comment: N
 Private report:     N

 New Comment:

I think you are confusing CP-1252 with ISO-8859-1. And the default on Windows 
internally is actually UTF-16 but there is a library call named isTextUnicode() 
which most apps use to determine which encoding something is in and it tends 
towards CP-1252 if it can't figure it out, so I assume that is what you mean 
when you say everyone saves things in ISO-8859-1 on Windows. Every editor I 
know 
of has a very simple encoding setting to force the editor to a specific 
encoding. Set it to UTF-8 and all your problems will go away. Note also that CP-
1252 is not used in most of the world, so this assertion that most pages are 
saved in ISO-8859-1 is obviously not true. Regardless, this is not something 
that will be reverted. CP-1252 is disappearing and I think you will find much 
less of it in Windows8 as it really doesn't play well with HTML5.


Previous Comments:
------------------------------------------------------------------------
[2012-08-19 05:02:02] soapergem at gmail dot com

With respect, the 72% figure you cited is misleading at best. The character 
encoding listed in the HTML gives no indication of what encoding the files were 
actually saved in. All it is is a <meta> tag in that <head> that says UTF-8. I 
would suspect the vast majority of those files are still saved in ISO-8859-1, 
though.

My prediction is that you're going to get A LOT of complaints over the switch 
-- 
especially from Windows users, who almost always save things in ISO-8859-1, 
since that is the default encoding in Windows. With PHP on Windows ever 
growing, 
fighting the Windows users is just shooting yourself in the foot.

------------------------------------------------------------------------
[2012-08-19 04:38:50] ras...@php.net

UTF-8 is only compatible with low-ascii, not with high. The copyright symbol in 
ISO-8859-1 is character code (in hex) <A9>. In UTF-8 the copyright symbol is 
represented by two bytes, <C2><A9>. The world has gone UTF-8. If your editor is 
in UTF-8 mode and you enter/paste a copyright symbol and pass it to 
htmlentities() you will get "&copy;" back. So rather than change the code to 
hardcode ISO-8859-1 you should convert your datasources to UTF-8. Most of them 
are probably already UTF-8 which means that your current code was likely not 
handling these correctly since it assumed ISO-8859-1 before.

For some perspetive: 
http://w3techs.com/technologies/overview/character_encoding/all
which shows that 72% of the top-million sites on the Web are using UTF-8. And 
this number is growing.

------------------------------------------------------------------------
[2012-08-19 04:14:03] soapergem at gmail dot com

Description:
------------
Doesn't UTF-8 include basic ASCII characters, too? Right now when I try to 
encode the copyright symbol (Â©) using htmlentities (it should encode to 
&copy;), it doesn't work. I discovered this since the default encoding for 
htmlentities() was switched from ISO-8859-1 to UTF-8 in version 5.4.

I have plenty of places where I rely on basic symbols, such as the copyright 
symbol, being encoded properly with htmlentities(). Having to go in and change 
all the instances of htmlentities($string) to htmlentities($string, ENT_COMPAT 
| ENT_HTML401, 'ISO-8859-1') is not practical (there are MANY). And with the 
whole output of the function being blank, it just makes my scripts completely 
unusable now.

Help!

Test script:
---------------
<?php

echo htmlentities('Â©', ENT_COMPAT | ENT_HTML401, 'UTF-8');

?>

Expected result:
----------------
&copy;

Actual result:
--------------
(Nothing - an empty string)


------------------------------------------------------------------------



-- 
Edit this bug report at https://bugs.php.net/bug.php?id=62861&edit=1

Bug #62861 [Nab]: htmlentities returns empty string when it shouldn't

Reply via email to