Hello, I received a bunch of XML files that contain HTML entities (so far, I’ve seen only used). I can’t parse these files with an XML parser because of these HTML entities:
>>> parser = lxml.etree.XMLParser(huge_tree=True, remove_comments=True, >>> schema=None, dtd_validation=False) >>> xml = lxml.etree.parse("test.xml", parser) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "src/lxml/etree.pyx", line 3521, in lxml.etree.parse File "src/lxml/parser.pxi", line 1859, in lxml.etree._parseDocument File "src/lxml/parser.pxi", line 1885, in lxml.etree._parseDocumentFromURL File "src/lxml/parser.pxi", line 1789, in lxml.etree._parseDocFromFile File "src/lxml/parser.pxi", line 1177, in lxml.etree._BaseParser._parseDocFromFile File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError File "test.xml", line 66 lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 66, column 25 The `resolve_entities` parameter for XMLParser unfortunately doesn’t seem to resolve HTML entities. If I parse the file using an HTMLParser it works: >>> parser = lxml.etree.HTMLParser(huge_tree=True, remove_comments=True) >>> xml = lxml.etree.parse("test.xml", parser) >>> xml <lxml.etree._ElementTree object at 0x10589baa0> but then the upper/lower case of all tags is lost because HTML is case-insensitive (XML is not) and it seems that the HTML parser turns all tag names to lower case: >>> xml.getroot().find("body/*") <Element docxml at 0x1059cb730> This should be a `DocXML` tag name. Now my original XML file is broken and fails schema validation… So, what now? I feel very hesitant to treat the original XML file as a string and replace HTML entities (except & < >) on a string level. I think a better approach would be to make the XML parser aware of HTML entities but that may be a libxml2 issue rather than lxml? (Haven’t looked at the source yet.) Would you have any other recommendations? How else could I work with this issue? Much thanks! Jens -- Jens Tröger https://savage.light-speed.de/ _______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-le...@python.org https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: arch...@mail-archive.com