[lxml] How to best deal with HTML entities in an XML file?

Jens Tröger Fri, 26 Mar 2021 00:47:03 -0700

Hello,

I received a bunch of XML files that contain HTML entities (so far, I’ve seen 
only &nbsp; used). I can’t parse these files with an XML parser because of 
these HTML entities:


 >>> parser = lxml.etree.XMLParser(huge_tree=True, remove_comments=True, 
 >>> schema=None, dtd_validation=False)
 >>> xml = lxml.etree.parse("test.xml", parser)
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "src/lxml/etree.pyx", line 3521, in lxml.etree.parse
    File "src/lxml/parser.pxi", line 1859, in lxml.etree._parseDocument
    File "src/lxml/parser.pxi", line 1885, in lxml.etree._parseDocumentFromURL
    File "src/lxml/parser.pxi", line 1789, in lxml.etree._parseDocFromFile
    File "src/lxml/parser.pxi", line 1177, in 
lxml.etree._BaseParser._parseDocFromFile
    File "src/lxml/parser.pxi", line 615, in 
lxml.etree._ParserContext._handleParseResultDoc
    File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
    File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
    File "test.xml", line 66
  lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 66, column 25

The `resolve_entities` parameter for XMLParser unfortunately doesn’t seem to 
resolve HTML entities.

If I parse the file using an HTMLParser it works:

 >>> parser = lxml.etree.HTMLParser(huge_tree=True, remove_comments=True)
 >>> xml = lxml.etree.parse("test.xml", parser)
 >>> xml
  <lxml.etree._ElementTree object at 0x10589baa0>

but then the upper/lower case of all tags is lost because HTML is 
case-insensitive (XML is not) and it seems that the HTML parser turns all tag 
names to lower case:

 >>> xml.getroot().find("body/*")
  <Element docxml at 0x1059cb730>

This should be a `DocXML` tag name. Now my original XML file is broken and 
fails schema validation…

So, what now? I feel very hesitant to treat the original XML file as a string 
and replace HTML entities (except &amp; &lt; &gt;) on a string level. I think a 
better approach would be to make the XML parser aware of HTML entities but that 
may be a libxml2 issue rather than lxml? (Haven’t looked at the source yet.)

Would you have any other recommendations? How else could I work with this issue?

Much thanks!
Jens

--
Jens Tröger
https://savage.light-speed.de/
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

[lxml] How to best deal with HTML entities in an XML file?

Reply via email to