Hello,
I’m a little puzzled by the behavior of the lxml.html.tostring() function, and
would appreciate if somebody could shed some light on this.
The test code is as follows: first we parse a small HTML document (derived from
an actual real-world document!)
s = """<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
</head>
<body>
</body>
</html>
"""
This reads ok as XML:
lxml.etree.XML(s.encode()) # <Element {http://www.w3.org/1999/xhtml}html
at 0x10e837d00>
lxml.etree.fromstring(s.encode()) # <Element
{http://www.w3.org/1999/xhtml}html at 0x10e848980>
and HTML:
elm = lxml.html.fromstring(s.encode()) # <Element html at 0x10e7d00f0>
root = elm.getroottree()
root.docinfo.doctype # '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd”>'
Serializing this back to HTML creates an unexpected string, though:
lxml.html.tostring(elm.getroottree(), method="xml", encoding="unicode")
Produces for lxml v5.3.0
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<?xml version="1.0" encoding="UTF-8"??><html xml:lang="en"
xmlns="http://www.w3.org/1999/xhtml">
<head>
</head>
<body>
</body>
</html>
and for lxml v6.0.2
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<!--?xml version="1.0" encoding="UTF-8"?--><html xml:lang="en"
xmlns="http://www.w3.org/1999/xhtml">
<head>
</head>
<body>
</body>
</html>
The latter parses ok with both lxml.etree.XML() and lxml.html.fromstring()
whereas the former fails to parse as an XML file using lxml.etree.XML(). So it
seem that *some* behavior was changed/fixed but I was unable to find that
mentioned in the changelog.
Both serialized documents, though, are different than the original in that the
<!DOCTYPE> and <?XML?> elements are swapped, and removed/commented out
entirely. Why?
Also, is there a way to generate both elements in the original order?
Much thanks!
Jens
_______________________________________________
lxml - The Python XML Toolkit mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/lxml.python.org
Member address: [email protected]