Hello,

I’m a little puzzled by the behavior of the lxml.html.tostring() function, and 
would appreciate if somebody could shed some light on this.

The test code is as follows: first we parse a small HTML document (derived from 
an actual real-world document!)

    s = """<?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" 
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd";>
    <html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml";>
      <head>
      </head>
      <body>
      </body>
    </html>
    """

This reads ok as XML:

    lxml.etree.XML(s.encode())  # <Element {http://www.w3.org/1999/xhtml}html 
at 0x10e837d00>
    lxml.etree.fromstring(s.encode())  # <Element 
{http://www.w3.org/1999/xhtml}html at 0x10e848980>

and HTML:

    elm = lxml.html.fromstring(s.encode())  # <Element html at 0x10e7d00f0>
    root = elm.getroottree()
    root.docinfo.doctype  # '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" 
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd”>'

Serializing this back to HTML creates an unexpected string, though:

    lxml.html.tostring(elm.getroottree(), method="xml", encoding="unicode") 

Produces for lxml v5.3.0 

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" 
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd";>
    <?xml version="1.0" encoding="UTF-8"??><html xml:lang="en" 
xmlns="http://www.w3.org/1999/xhtml";>
      <head>
      </head>
      <body>
      </body>
    </html>

and for lxml v6.0.2

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" 
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd";>
    <!--?xml version="1.0" encoding="UTF-8"?--><html xml:lang="en" 
xmlns="http://www.w3.org/1999/xhtml";>
      <head>
      </head>
      <body>
      </body>
    </html>

The latter parses ok with both lxml.etree.XML() and lxml.html.fromstring() 
whereas the former fails to parse as an XML file using lxml.etree.XML(). So it 
seem that *some* behavior was changed/fixed but I was unable to find that 
mentioned in the changelog.

Both serialized documents, though, are different than the original in that the 
<!DOCTYPE> and <?XML?> elements are swapped, and removed/commented out 
entirely. Why?

Also, is there a way to generate both elements in the original order?

Much thanks!
Jens
_______________________________________________
lxml - The Python XML Toolkit mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/lxml.python.org
Member address: [email protected]
  • [lxml] Surprising behavior ... Jens Tröger via lxml - The Python XML Toolkit

Reply via email to