Hi,

Good question! The answer can be found in the libxml2 HTMLparser.c in the 
function htmlParseDocument in lines 4430-4452 (latest commit hash currently 
54824911). (Lxml wraps libxml2, a C library.) As you can see, libxml2 expects a 
doctype declaration always to begin with <!DOCTYPE. In this case, libxml2 calls 
htmlParseDocTypeDecl(ctxt) and the doctype is parsed. However, in lines 
4450-4452, you can see that an XML or XML-like declaration beginning with <? 
leads to a "bogus" comment being recorded - essentially a malformed comment.

When libxml2 finishes parsing, it adds a doctype in. In libxml2, the doctype is 
stored in ctxt->myDoc->internalSubset (not sure why). In SAX2.c the function 
xmlSAX2EndDocument runs when an HTML document finishes parsing. You can see in 
lines 869-874 that intSubset on the document is set if it is originally NULL. 
And, the hard-coded doctype matches what you see in your testing.

Also, regarding recent changes between lxml versions, I'm not sure where this 
is coming from, but there's a commit in libxml2 from seven months ago that 
modifies this code, commit b424bae7.

To answer your question of fixing this, I doubt there's a way without changing 
those lines of code in libxml2.

Links to the code:
https://gitlab.gnome.org/GNOME/libxml2/-/blob/54824911cd8a5f6918d2ca74cfd86538ee4b4d05/HTMLparser.c#L4430
https://gitlab.gnome.org/GNOME/libxml2/-/blob/54824911cd8a5f6918d2ca74cfd86538ee4b4d05/SAX2.c#L869

Link to the commit:
https://gitlab.gnome.org/GNOME/libxml2/-/commit/b424bae705180a2d6df2db1767e33eeec73ac029

Best,
Abe
_______________________________________________
lxml - The Python XML Toolkit mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/lxml.python.org
Member address: [email protected]
  • [lxml] Surprising behavior ... Jens Tröger via lxml - The Python XML Toolkit
    • [lxml] Re: Surprising ... abepolk--- via lxml - The Python XML Toolkit

Reply via email to