Edit report at https://bugs.php.net/bug.php?id=47875&edit=1
ID: 47875 Comment by: julien at go-on-web dot com Reported by: thomas dot koch at ymc dot ch Summary: No option to set HTML input encoding Status: Open Type: Feature/Change Request Package: DOM XML related Operating System: Debian Lenny PHP Version: 5.2.9 Block user comment: N Private report: N New Comment: I have another test case for you, using HTML5 : <?php // ----- // FAIL CASE $html = <<<HTML <!DOCTYPE html> <html lang="fr"> <head> <meta charset="UTF-8"/> </head> <body> <p id="accent">Test case with simple accent (é) : é</p> </body> </html> HTML; $doc = new DomDocument( 1.0, 'UTF-8' ); $doc->loadHTML( $html ); var_dump( $doc->getElementById('accent')->textContent ); //=> string(40) "Test case with simple accent (é) : é" // ---- // ----- // SUCCESS CASE (but invalid html5) $html = <<<HTML <!DOCTYPE html> <html lang="fr"> <head> <meta http-equiv="content-type" content="text/html; charset=utf-8"/> </head> <body> <p id="accent">Test case with simple accent (é) : é</p> </body> </html> HTML; $doc = new DomDocument( 1.0, 'UTF-8' ); $doc->loadHTML( $html ); var_dump( $doc->getElementById('accent')->textContent ); //=> string(38) "Test case with simple accent (é) : é" // ----- ?> Regards, Julien Previous Comments: ------------------------------------------------------------------------ [2009-04-02 09:07:32] thomas dot koch at ymc dot ch Description: ------------ Enhancement request. I need a possibility to indicate the html input encoding (as parsed from the HTTP headers) when parsing a html string with DOMDocument::loadHTML. Using loadHTMLFile is not always an option. libxml2 honors the content-type meta tag, but this may not always be present. How should the input encoding be indicated? In DOMDocument::__construct() or in DOMDocument::encoding or is that both the same? One could look in libxml2/HTMLparser.c#5580, function htmlCreateFileParserCtxt(const char *filename, const char *encoding) There the encoding is set by first building a "charset=$encoding" string and passing it to htmlCheckEncoding, which in turn parses the encoding out of the string again. This may be worth cleaning up together with upstream. Reproduce code: --------------- <?php $html = <<<EOT <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html> <head> <!--meta http-equiv="content-type" content="text/html; charset=utf-8" --> </head> <body id="umlaut">süÃ</body> </html> EOT; $dom = new DOMDocument; var_dump( $dom->loadHTML( $html ) ); $elem = $dom->getElementById( 'umlaut' ); echo $elem->textContent; ------------------------------------------------------------------------ -- Edit this bug report at https://bugs.php.net/bug.php?id=47875&edit=1