ID: 39269
User updated by: arturm at union dot com dot pl
Reported By: arturm at union dot com dot pl
Status: Bogus
Bug Type: DOM XML related
Operating System: Windows
PHP Version: 5.1.6
New Comment:
Below is corrected example. Still generates wrong output. Remove title
tag and get good output. HTML, HEAD, META charset are used, as online
comment states.
<?php
header("Content-type: text/plain; charset=UTF-8");
$doc = new DOMDocument();
# title contains aogonek
# p contains some Polish small accented characters
$doc->loadHTML("<html><head><title>\xC4\x85</title>"
.'<meta http-equiv="Content-Type" content="text/html;
charset=UTF-8">'
.'</head><body>'
."<p>\xC4\x85\xC4\x99\xC3\xB3\xC5\x82\xC5\x9B\xC4\x87</p></body></html>");
echo "Encoding=".$doc->encoding;
echo " Text=".$doc->textContent;
?>
Previous Comments:
------------------------------------------------------------------------
[2006-10-26 17:42:12] [EMAIL PROTECTED]
The answer is in the very first user note of DOMDocument->loadHTML():
http://php.net/manual/en/function.dom-domdocument-loadhtml.php
You must specify the character set in <HEAD> tag to be used by
libxml2.
We can't change this behaviour, as this is how libxml2 works.
------------------------------------------------------------------------
[2006-10-26 17:23:27] arturm at union dot com dot pl
Sorry, charset on bugs.php.net is not UTF-8. Please follow an original
thread on pl.comp.lang.php for source code:
http://groups.google.pl/group/pl.comp.lang.php/browse_frm/thread/e0de8a41d687aef3/d2c602e5ac1d40cb?hl=pl#d2c602e5ac1d40cb
------------------------------------------------------------------------
[2006-10-26 17:17:56] arturm at union dot com dot pl
Description:
------------
If you load HTML using DOM::loadHTML() wrong charset is used when non
US-ASCII characters are used in source before charset declaration in
meta tag.
Reproduce code:
---------------
<?php
header("Content-type: text/plain; charset=UTF-8");
$doc = new DOMDocument();
$doc->loadHTML('<title>ą</title>'
.'<meta http-equiv="Content-Type" content="text/html;
charset=UTF-8">'
.'<p>ąęółść</p>');
echo $doc->encoding;
echo $doc->textContent;
?>
Expected result:
----------------
UTF-8ąęółść
Actual result:
--------------
UTF-8ąąęółść
------------------------------------------------------------------------
--
Edit this bug report at http://bugs.php.net/?id=39269&edit=1