In more recent versions of lxml the tostring() method can return extra text after the closing tag of the node I've passed to it. So instead of returning b'<form action="action1">\n</form>\n' it returns b'<form action="action1">\n</form>\n</body>\n</html>\n'
Here's a (python3) script along with two outputs, one from a machine running lxml 4.6.5 and one running 4.8.0. NOTE the output only changes if the DOCTYPE line is left in the "html" variable. import sys from lxml import etree print("%-20s: %s" % ('Python', sys.version_info)) print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION)) print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION)) print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION)) print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION)) print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION)) html = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html> <body> <form action="action1"> </form> </body> </html> """ parser = etree.XMLParser() doc = etree.fromstring(html, parser=parser) nodeList = doc.xpath("//form") print(etree.tostring(nodeList[0])) This is the output I would expect to see: Python : sys.version_info(major=3, minor=8, micro=10, releaselevel='final', serial=0) lxml.etree : (4, 6, 5, 0) libxml used : (2, 9, 10) libxml compiled : (2, 9, 10) libxslt used : (1, 1, 34) libxslt compiled : (1, 1, 34) b'<form action="action1">\n</form>\n' #<------ Notice how the tostring() has returned the opening and closing <form> node (as I expected) This is the output I get when I upgrade: Python : sys.version_info(major=3, minor=8, micro=10, releaselevel='final', serial=0) lxml.etree : (4, 8, 0, 0) libxml used : (2, 9, 12) libxml compiled : (2, 9, 12) libxslt used : (1, 1, 34) libxslt compiled : (1, 1, 34) b'<form action="action1">\n</form>\n</body>\n</html>\n' #<-------- Notice how the tostring() has returned extra text after the closing </form> tag Is this a bug? Or is this expected behaviour if the DOCTYPE is defined in the html passed to etree.fromstring()? ie. Is there a valid reason why tostring() might return an invalid XML byte string? I've seen the same behaviour in lxml 4.7.1 but I've not tried 4.9.0 as it's not in my repo yet. Any help appreciated! If this is a deliberate change I've got quite a lot of legacy code that will need updating to cope. _______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-le...@python.org https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: arch...@mail-archive.com