Hi, Good question! The answer can be found in the libxml2 HTMLparser.c in the function htmlParseDocument in lines 4430-4452 (latest commit hash currently 54824911). (Lxml wraps libxml2, a C library.) As you can see, libxml2 expects a doctype declaration always to begin with <!DOCTYPE. In this case, libxml2 calls htmlParseDocTypeDecl(ctxt) and the doctype is parsed. However, in lines 4450-4452, you can see that an XML or XML-like declaration beginning with <? leads to a "bogus" comment being recorded - essentially a malformed comment.
When libxml2 finishes parsing, it adds a doctype in. In libxml2, the doctype is stored in ctxt->myDoc->internalSubset (not sure why). In SAX2.c the function xmlSAX2EndDocument runs when an HTML document finishes parsing. You can see in lines 869-874 that intSubset on the document is set if it is originally NULL. And, the hard-coded doctype matches what you see in your testing. Also, regarding recent changes between lxml versions, I'm not sure where this is coming from, but there's a commit in libxml2 from seven months ago that modifies this code, commit b424bae7. To answer your question of fixing this, I doubt there's a way without changing those lines of code in libxml2. Links to the code: https://gitlab.gnome.org/GNOME/libxml2/-/blob/54824911cd8a5f6918d2ca74cfd86538ee4b4d05/HTMLparser.c#L4430 https://gitlab.gnome.org/GNOME/libxml2/-/blob/54824911cd8a5f6918d2ca74cfd86538ee4b4d05/SAX2.c#L869 Link to the commit: https://gitlab.gnome.org/GNOME/libxml2/-/commit/b424bae705180a2d6df2db1767e33eeec73ac029 Best, Abe _______________________________________________ lxml - The Python XML Toolkit mailing list -- [email protected] To unsubscribe send an email to [email protected] https://mail.python.org/mailman3//lists/lxml.python.org Member address: [email protected]
