[lxml] lxml.html.fromstring() doesn’t seem to get the doctype right?

Jens Tröger via lxml - The Python XML Toolkit Mon, 09 Feb 2026 14:40:43 -0800

Hello,

Following from my previous post ( 
https://mail.python.org/archives/list/[email protected]/thread/NT7GNLORN676BMSAXKNZLXDWYMS76Z4A/
 ) I also noticed that reading an x/html file without doctype produces an 
incorrect/unexpected doctype. For example:


    b = b"""<?xml version="1.0" encoding="UTF-8”?>
    <html xmlns="http://www.w3.org/1999/xhtml"; xml:lang="en"></html>
    “""

parses ok into an element and element tree:

    elm = lxml.html.fromstring(b)  # <Element html at 0x10fbea530>

but the doctype for that document is — I believe — incorrect:

    root = elm.getroottree()
    root.docinfo.doctype  # '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 
Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd”>'

Considering the xml declaration and the html element’s namespace, I would have 
expected the derived doctype to be

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN” 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd”>

for an xhtml file.

Also, the DocInfo ( https://lxml.de/apidoc/lxml.etree.html#lxml.etree.DocInfo ) 
doesn’t actually denote whether the original document contained an xml 
declaration; wouldn’t a flag be useful?

I ask because ideally round-tripping a document should produce that same 
document, but that is currently not the case:

    b = b"""<?xml version="1.0" encoding="UTF-8"?><html 
xmlns="http://www.w3.org/1999/xhtml"; xml:lang="en"></html>”""
    elm = lxml.html.fromstring(b)  # <Element html at 0x10fbea670>
    lxml.html.tostring(elm)  # b'<html xmlns="http://www.w3.org/1999/xhtml"; 
xml:lang="en"></html>'
    lxml.html.tostring(elm.getroottree())  # b'<!DOCTYPE html PUBLIC 
"-//W3C//DTD HTML 4.0 Transitional//EN" 
"http://www.w3.org/TR/REC-html40/loose.dtd";>\n<!--?xml version="1.0" 
encoding="UTF-8"?--><html xmlns="http://www.w3.org/1999/xhtml"; 
xml:lang="en"></html>’
    lxml.html.tostring(elm.getroottree(), method="xml”)  # b'<!DOCTYPE html 
PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
"http://www.w3.org/TR/REC-html40/loose.dtd";>\n<!--?xml version="1.0" 
encoding="UTF-8"?--><html xmlns="http://www.w3.org/1999/xhtml"; xml:lang="en"/>'

Cheers,
Jens

_______________________________________________
lxml - The Python XML Toolkit mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/lxml.python.org
Member address: [email protected]

[lxml] lxml.html.fromstring() doesn’t seem to get the doctype right?

Reply via email to