In more recent versions of lxml the tostring() method can return extra text 
after the closing tag of the node I've passed to it. So instead of returning
b'<form action="action1">\n</form>\n'
it returns
b'<form action="action1">\n</form>\n</body>\n</html>\n'

Here's a (python3) script along with two outputs, one from a machine running 
lxml 4.6.5 and one running 4.8.0. NOTE the output only changes if the DOCTYPE 
line is left in the "html" variable.

import sys
from lxml import etree
print("%-20s: %s" % ('Python',           sys.version_info))
print("%-20s: %s" % ('lxml.etree',       etree.LXML_VERSION))
print("%-20s: %s" % ('libxml used',      etree.LIBXML_VERSION))
print("%-20s: %s" % ('libxml compiled',  etree.LIBXML_COMPILED_VERSION))
print("%-20s: %s" % ('libxslt used',     etree.LIBXSLT_VERSION))
print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))
html = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd";>
<html>
<body>
<form action="action1">
</form>
</body>
</html>
"""
parser = etree.XMLParser()
doc = etree.fromstring(html, parser=parser)
nodeList = doc.xpath("//form")
print(etree.tostring(nodeList[0]))


This is the output I would expect to see:
Python              : sys.version_info(major=3, minor=8, micro=10, 
releaselevel='final', serial=0)
lxml.etree          : (4, 6, 5, 0)
libxml used         : (2, 9, 10)
libxml compiled     : (2, 9, 10)
libxslt used        : (1, 1, 34)
libxslt compiled    : (1, 1, 34)
b'<form action="action1">\n</form>\n'    #<------ Notice how the tostring() has 
returned the opening and closing <form> node (as I expected)


This is the output I get when I upgrade:
Python              : sys.version_info(major=3, minor=8, micro=10, 
releaselevel='final', serial=0)
lxml.etree          : (4, 8, 0, 0)
libxml used         : (2, 9, 12)
libxml compiled     : (2, 9, 12)
libxslt used        : (1, 1, 34)
libxslt compiled    : (1, 1, 34)
b'<form action="action1">\n</form>\n</body>\n</html>\n'       #<-------- Notice 
how the tostring() has returned extra text after the closing </form> tag

Is this a bug? Or is this expected behaviour if the DOCTYPE is defined in the 
html passed to etree.fromstring()? ie. Is there a valid reason why tostring() 
might return an invalid XML byte string?

I've seen the same behaviour in lxml 4.7.1 but I've not tried 4.9.0 as it's not 
in my repo yet.

Any help appreciated! If this is a deliberate change I've got quite a lot of 
legacy code that will need updating to cope.
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

Reply via email to