ID: 47108 Updated by: [email protected] Reported By: [email protected] -Status: Open +Status: Bogus Bug Type: DOM XML related Operating System: Windows XP PHP Version: 5.2.8 New Comment:
Sorry, but your problem does not imply a bug in PHP itself. For a list of more appropriate places to ask for help using PHP, please visit http://www.php.net/support.php as this bug system is not the appropriate forum for asking support questions. Due to the volume of reports we can not explain in detail here why your report is not a bug. The support channels will be able to provide an explanation for you. Thank you for your interest in PHP. Thats how its handled by libxml2 Previous Comments: ------------------------------------------------------------------------ [2009-01-15 17:54:01] [email protected] That makes sense. I updated the script to iterate through the problem characters and the ones you mentioned are included. Other problem characters include 0x26, 0x3C, 0x3E, 0xA4, 0xA5 and 0xAA. The first three make sense - they correspond to &, <, and >, respectively. The latter three don't make as much sense to me. Also, it seems to me that it ought to fail more gracefully than it does - you wouldn't expect your browser to ignore all HTML after an invalid character is encountered and it seems to me like this shouldn't, either. Per your suggestion, I've filed a bug report on libxml2 here: http://bugzilla.gnome.org/show_activity.cgi?id=567885 Not sure if that's the appropriate bug tracker, though. Also, it seems like reproducing the bug using the language libxml2 is intended as a library for would be prudent, but alas, I don't have any C/C++ compilers on this computer. ------------------------------------------------------------------------ [2009-01-15 02:53:45] typoon at gmail dot com The explanation to this might be the fact that ISO-8859-7 does not have the character 0xAE. When libxml tries to convert it, an error is thrown because of this. References: http://www.itscj.ipsj.or.jp/ISO-IR/227.pdf http://en.wikipedia.org/wiki/ISO_8859-7 Checking the PDF you will see 0xAE is not assigned. Quoting wikipedia: "Code values 001F, 7F, 809F, AE, D2 and FF are not assigned to characters by ISO/IEC 8859-7." More information and other reference can also be found on google. My 2 cents then are that this is not a bug at all. If you still think it is, the we might need to open a bug report for the libxml team as this is an error generated inside libxml, not PHP. Regards, Henrique ------------------------------------------------------------------------ [2009-01-14 20:08:27] [email protected] Description: ------------ All HTML after chr(0xAE) (if present) is ignored by DOMDocument's loadHTML(), even if chr(0xAE) is a valid character per the HTML's charset. In the Reproduce code, replace chr(0xAE) with chr(0xAF) or chr(0xAD) or just remove it all together, and it works. Further, if you echo out $str and copy / paste the HTML into validator.w3.org, it's valid HTML, even with the chr(0xAE). Reproduce code: --------------- <?php $str = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <meta http-equiv="content-type" content="text/html; charset=iso-8859-7"> <title>test</title> </head> <body><p>aaaaa' . chr(0xAE) . 'zzzzz</p></body> </html>'; $xml = new DOMDocument(); $xml->loadHTML($str); echo $xml->saveHTML(); Expected result: ---------------- aaaaa�zzzzz Actual result: -------------- Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: input conversion failed due to input error, bytes 0xAE 0x7A 0x7A 0x7A in C:\htdocs\test.php on line 14 Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: input conversion failed due to input error, bytes 0xAE 0x7A 0x7A 0x7A in C:\htdocs\test.php on line 14 Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlCheckEncoding: encoder error in Entity, line: 4 in C:\htdocs\test.php on line 14 Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: input conversion failed due to input error, bytes 0xAE 0x7A 0x7A 0x7A in C:\htdocs\test.php on line 14 aaaaa ------------------------------------------------------------------------ -- Edit this bug report at http://bugs.php.net/?id=47108&edit=1
