[lxml] etree.tostring returns all content after element in XHTML 1.0 Transitional?

Jim Wisniewski Thu, 10 Aug 2023 00:21:30 -0700

I recently noticed what seems like an odd behavior of etree.tostring, and I'm
trying to figure out if this is a bug or some subtlety of the API or of X(HT)ML
processing that I'm not aware of.


Given the following document (1_1.xhtml):

    <?xml version="1.0" encoding="utf-8"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
     "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd";>
    <html xmlns="http://www.w3.org/1999/xhtml";>
      <head>
        <title>Title</title>
      </head>
      <body>
        <p>One</p>
        <p>Two</p>
        <p>Three</p>
      </body>
    </html>

and the following script (test-tostring.py):

    import sys
    import lxml
    import lxml.html

    doc = lxml.etree.parse(sys.argv[1], parser=lxml.html.XHTMLParser())
    body = doc.find(".//{*}body")
    for elt in body:
        print(lxml.etree.tostring(elt))

running "python3 test-tostring.py 1_1.xhtml" produces this output:

    b'<p xmlns="http://www.w3.org/1999/xhtml";>One</p>\n    '
    b'<p xmlns="http://www.w3.org/1999/xhtml";>Two</p>\n    '
    b'<p xmlns="http://www.w3.org/1999/xhtml";>Three</p>\n  '

So far, so good.  However, if I copy 1_1.xhtml to 1_0-transitional.xhtml and
change its doctype to

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>

then the output of "python3 test-tostring.py 1_0-transitional.xhtml" is:

    b'<p xmlns="http://www.w3.org/1999/xhtml";>One</p>\n    <p>Two</p>\n    
<p>Three</p>\n  </body>\n</html>\n    '
    b'<p xmlns="http://www.w3.org/1999/xhtml";>Two</p>\n    <p>Three</p>\n  
</body>\n</html>\n    '
    b'<p xmlns="http://www.w3.org/1999/xhtml";>Three</p>\n  </body>\n</html>\n  '

In other words, the text of each element as serialized with tostring() includes
the entire rest of the document after it, not just its own subtree!

This is, at the very least, not what I was expecting to see.  Both XHTML
documents pass the checks on validator.w3.org, so I don't think it's a matter of
bad formatting, and I haven't been able to find anything in the lxml
documentation or recent changelogs that would explain it.  Using an
lxml.etree.XMLParser as the parser produces the same results.  Setting
with_tail=False removes the trailing whitespace from each line, but not the
content after the element in the 1.0 Transitional doc.

Any idea what might be causing this?

Version information:
Python              : sys.version_info(major=3, minor=11, micro=2, 
releaselevel='final', serial=0)
lxml.etree          : (4, 9, 2, 0)
libxml used         : (2, 9, 14)
libxml compiled     : (2, 9, 14)
libxslt used        : (1, 1, 35)
libxslt compiled    : (1, 1, 35)
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

[lxml] etree.tostring returns all content after element in XHTML 1.0 Transitional?

Reply via email to