Hello,
Following from my previous post (
https://mail.python.org/archives/list/[email protected]/thread/NT7GNLORN676BMSAXKNZLXDWYMS76Z4A/
) I also noticed that reading an x/html file without doctype produces an
incorrect/unexpected doctype. For example:
b = b"""<?xml version="1.0" encoding="UTF-8”?>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"></html>
“""
parses ok into an element and element tree:
elm = lxml.html.fromstring(b) # <Element html at 0x10fbea530>
but the doctype for that document is — I believe — incorrect:
root = elm.getroottree()
root.docinfo.doctype # '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0
Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd”>'
Considering the xml declaration and the html element’s namespace, I would have
expected the derived doctype to be
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN”
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd”>
for an xhtml file.
Also, the DocInfo ( https://lxml.de/apidoc/lxml.etree.html#lxml.etree.DocInfo )
doesn’t actually denote whether the original document contained an xml
declaration; wouldn’t a flag be useful?
I ask because ideally round-tripping a document should produce that same
document, but that is currently not the case:
b = b"""<?xml version="1.0" encoding="UTF-8"?><html
xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"></html>”""
elm = lxml.html.fromstring(b) # <Element html at 0x10fbea670>
lxml.html.tostring(elm) # b'<html xmlns="http://www.w3.org/1999/xhtml"
xml:lang="en"></html>'
lxml.html.tostring(elm.getroottree()) # b'<!DOCTYPE html PUBLIC
"-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">\n<!--?xml version="1.0"
encoding="UTF-8"?--><html xmlns="http://www.w3.org/1999/xhtml"
xml:lang="en"></html>’
lxml.html.tostring(elm.getroottree(), method="xml”) # b'<!DOCTYPE html
PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">\n<!--?xml version="1.0"
encoding="UTF-8"?--><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"/>'
Cheers,
Jens
_______________________________________________
lxml - The Python XML Toolkit mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/lxml.python.org
Member address: [email protected]