Edit report at https://bugs.php.net/bug.php?id=47875&edit=1
ID: 47875
Comment by: julien at go-on-web dot com
Reported by: thomas dot koch at ymc dot ch
Summary: No option to set HTML input encoding
Status: Open
Type: Feature/Change Request
Package: DOM XML related
Operating System: Debian Lenny
PHP Version: 5.2.9
Block user comment: N
Private report: N
New Comment:
I have another test case for you, using HTML5 :
<?php
// -----
// FAIL CASE
$html = <<<HTML
<!DOCTYPE html>
<html lang="fr">
<head>
<meta charset="UTF-8"/>
</head>
<body>
<p id="accent">Test case with simple accent (é) : é</p>
</body>
</html>
HTML;
$doc = new DomDocument( 1.0, 'UTF-8' );
$doc->loadHTML( $html );
var_dump( $doc->getElementById('accent')->textContent );
//=> string(40) "Test case with simple accent (é) : é"
// ----
// -----
// SUCCESS CASE (but invalid html5)
$html = <<<HTML
<!DOCTYPE html>
<html lang="fr">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8"/>
</head>
<body>
<p id="accent">Test case with simple accent (é) : é</p>
</body>
</html>
HTML;
$doc = new DomDocument( 1.0, 'UTF-8' );
$doc->loadHTML( $html );
var_dump( $doc->getElementById('accent')->textContent );
//=> string(38) "Test case with simple accent (é) : é"
// -----
?>
Regards,
Julien
Previous Comments:
------------------------------------------------------------------------
[2009-04-02 09:07:32] thomas dot koch at ymc dot ch
Description:
------------
Enhancement request.
I need a possibility to indicate the html input encoding (as parsed from the
HTTP headers) when parsing a html string with DOMDocument::loadHTML. Using
loadHTMLFile is not always an option.
libxml2 honors the content-type meta tag, but this may not always be present.
How should the input encoding be indicated? In DOMDocument::__construct() or in
DOMDocument::encoding or is that both the same?
One could look in libxml2/HTMLparser.c#5580, function
htmlCreateFileParserCtxt(const char *filename, const char *encoding)
There the encoding is set by first building a "charset=$encoding" string and
passing it to htmlCheckEncoding, which in turn parses the encoding out of the
string again. This may be worth cleaning up together with upstream.
Reproduce code:
---------------
<?php
$html = <<<EOT
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<!--meta http-equiv="content-type" content="text/html; charset=utf-8" -->
</head>
<body id="umlaut">süÃ</body>
</html>
EOT;
$dom = new DOMDocument;
var_dump( $dom->loadHTML( $html ) );
$elem = $dom->getElementById( 'umlaut' );
echo $elem->textContent;
------------------------------------------------------------------------
--
Edit this bug report at https://bugs.php.net/bug.php?id=47875&edit=1