ID: 41980
User updated by: borys dot forytarz at gmail dot com
Reported By: borys dot forytarz at gmail dot com
Status: Open
Bug Type: DOM XML related
Operating System: Linux
PHP Version: 5.2.0
New Comment:
I have also figured out, that if I add in content.tpl:
<meta http-equiv="content-type" content="text/html; charset=iso-8859-2"
/>
before <content> then I have polish characters. But what is strange, if
I set:
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
I don't have them again. The most strange thing is that main.tpl has
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
and those characters from this file are displayed correctly. The server
also sends HTTP header that tells browser that the content is in utf-8.
And if I change it in main.tpl to:
<meta http-equiv="content-type" content="text/html; charset=iso-8859-2"
/>
I don't have those characters again.
Previous Comments:
------------------------------------------------------------------------
[2007-07-12 20:48:10] borys dot forytarz at gmail dot com
I have checked about files encodings.
mb_detect_encoding() returns, that they are ASCII-encoded (!?). So I
wrote a simple script to convert them to utf-8:
<?php
$cont = file_get_contents('login.php.tpl');
$f = fopen('login.php.tpl','w');
echo "\n".mb_detect_encoding('login.php.tpl').' > ';
fwrite($f,mb_convert_encoding($cont,'utf-8'));
echo mb_detect_encoding('login.php.tpl')."\n";
fclose($f);
?>
and the output is: ASCII > ASCII (I expected ASCII > UTF-8)
result of using iconv instead of mb_convert_encoding is the same
what's going on?
------------------------------------------------------------------------
[2007-07-12 20:38:33] [EMAIL PROTECTED]
Please try using this CVS snapshot:
http://snaps.php.net/php5.2-latest.tar.gz
For Windows (zip):
http://snaps.php.net/win32/php5.2-win32-latest.zip
For Windows (installer):
http://snaps.php.net/win32/php5.2-win32-installer-latest.msi
------------------------------------------------------------------------
[2007-07-12 19:58:58] borys dot forytarz at gmail dot com
there should be:
...
foreach($content->childNodes as $child) {
...
sorry
------------------------------------------------------------------------
[2007-07-12 19:55:58] borys dot forytarz at gmail dot com
Here is an example:
At first, source files (both encoded with UTF-8)
First file (main.tpl):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<title>Some title</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body>
Some polish letters: ę ó ą ś ć ż ź ń
- they are encoded correctly and displays correctly.
</body>
</html>
Second file (contents.tpl):
<content>
<h1>some polish letters, like: ę ó ł ą ś ć
ź ń ż - they are not encoded correctly and does not
display correctly.</h1>
</content>
PHP file:
<?php
$dom = new DOMDocument('1.0','UTF-8');
$dom->loadHtmlFile('main.tpl');
$dom2 = new DOMDocument('1.0','UTF-8');
$dom2->loadHTMLFile('contents.tpl');
$contents = $dom2->getElementsByTagName('content');
$body = $dom->getElementsByTagName('body')->items(0);
foreach($contents as $content) {
foreach($content as $child) {
$imp = $dom->importNode($child,true);
$body->appendChild($imp);
}
}
$dom->saveXML();
?>
It is something like above. I was writing from memory because the real
script is really huge. But it demonstrates the idea and what is going
not properly.
------------------------------------------------------------------------
[2007-07-12 19:24:45] borys dot forytarz at gmail dot com
Description:
------------
There is a problem with DOM and encoding. I have two separate files,
one full XHTML code (DTD, head, meta, body and more contents) saved in
UTF-8. Meta declaration is UTF-8, server sends the code in UTF-8 too.
The second file is a simple file without any DTD, head, meta and body.
Saved in UTF-8 too. The problem is, when I import nodes from the second
file using importNode(), in the output there are invalid encoded
characters (those who were declared in the second file). It is strange
because as I read, DOM works in UTF-8 so there should be not such a
problem.
What is more, I was debugging the properties such as actualEncoding and
they shown me that there is UTF-8...
If it's not a bug, but I think it is, how to fix that? I can't declare
in the second file DTD, head and body elements.
Reproduce code:
---------------
$this->dom = new DOMDocument('1.0','UTF-8');
$this->dom->encoding = 'UTF-8';
$this->dom->formatOutput = self::$formatOutput;
$this->dom->preserveWhiteSpace = self::$preserveWhiteSpace;
@$this->dom->loadHtmlFile($html);
...
echo $this->dom->saveXML();
The above works well for the complete XHTML file. But when I load an
incomplete file (encoded in UTF-8) I don't see properly encoded
characters when I import nodes from the second document to the first
one.
I tried to convert the whole output with iconv() and
mb_convert_encoding() but it seems not to make any difference at all.
Expected result:
----------------
Properly encoded characters from both complete XHTML file and second
"poor" file. The second file is such as follows:
<content id="something">
<h1>some string</h1>
</content>
Actual result:
--------------
Not properly encoded characters from between <content> tag.
------------------------------------------------------------------------
--
Edit this bug report at http://bugs.php.net/?id=41980&edit=1