[lxml] Re: How to best deal with HTML entities in an XML file?

Stefan Behnel Fri, 26 Mar 2021 09:44:24 -0700

Jens Tröger schrieb am 26.03.21 um 08:34:
> I received a bunch of XML files that contain HTML entities (so far, I’ve seen 
> only &nbsp; used). I can’t parse these files with an XML parser because of 
> these HTML entities:
> 
>  >>> parser = lxml.etree.XMLParser(huge_tree=True, remove_comments=True, 
> schema=None, dtd_validation=False)
>  >>> xml = lxml.etree.parse("test.xml", parser)
>   Traceback (most recent call last):
>     File "<stdin>", line 1, in <module>
>     File "src/lxml/etree.pyx", line 3521, in lxml.etree.parse
>     File "src/lxml/parser.pxi", line 1859, in lxml.etree._parseDocument
>     File "src/lxml/parser.pxi", line 1885, in lxml.etree._parseDocumentFromURL
>     File "src/lxml/parser.pxi", line 1789, in lxml.etree._parseDocFromFile
>     File "src/lxml/parser.pxi", line 1177, in 
> lxml.etree._BaseParser._parseDocFromFile
>     File "src/lxml/parser.pxi", line 615, in 
> lxml.etree._ParserContext._handleParseResultDoc
>     File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
>     File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
>     File "test.xml", line 66
>   lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 66, column 25
> 
> The `resolve_entities` parameter for XMLParser unfortunately doesn’t seem to 
> resolve HTML entities.


It's on by default. You can only disable it, in which case the Entities do
not get resolved and stay in the tree. That makes the processing a bit more
tedious, but it allows passing the entities through into the output as the
came in.


> If I parse the file using an HTMLParser it works:
> 
>  >>> parser = lxml.etree.HTMLParser(huge_tree=True, remove_comments=True)
>  >>> xml = lxml.etree.parse("test.xml", parser)
>  >>> xml
>   <lxml.etree._ElementTree object at 0x10589baa0>
> 
> but then the upper/lower case of all tags is lost because HTML is 
> case-insensitive (XML is not) and it seems that the HTML parser turns all tag 
> names to lower case:
> 
>  >>> xml.getroot().find("body/*")
>   <Element docxml at 0x1059cb730>
> 
> This should be a `DocXML` tag name. Now my original XML file is broken and 
> fails schema validation…
> 
> So, what now? I feel very hesitant to treat the original XML file as a string 
> and replace HTML entities (except &amp; &lt; &gt;) on a string level. I think 
> a better approach would be to make the XML parser aware of HTML entities but 
> that may be a libxml2 issue rather than lxml? (Haven’t looked at the source 
> yet.)

If the XML contains entities, then it probably starts with a DOCTYPE
declaration. That would refer to a DTD that defines the entities. If that's
the case, then load_dtd=True would tell the parser to read the entity
definitions from the DTD, so that they can be resolved.

Note that you should best configure your locally installed catalogues to
include that DTD, so that it won't have to be loaded from the network on
each use.

Stefan
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

[lxml] Re: How to best deal with HTML entities in an XML file?

Reply via email to