Hello,

> If the XML contains entities, then it probably starts with a DOCTYPE
> declaration.

After the <?xml> declaration, yes.

> That would refer to a DTD that defines the entities. If that's
> the case, then load_dtd=True would tell the parser to read the entity
> definitions from the DTD, so that they can be resolved.

I tried with `load_dtd=True` and `dtd_validation=True` and received this error:

    lxml.etree.XMLSyntaxError: failed to load external entity 
"https://some.domain/xml/dtd/some.dtd";, line 2, column 97

although that file exists and lxml should be able to access the network. That 
error sent me on the goose chase which triggered my initial email…

> Note that you should best configure your locally installed catalogues to
> include that DTD, so that it won't have to be loaded from the network on
> each use.

Oh, I wasn’t aware of the catalogues and resolvers 
(https://lxml.de/resolvers.html) that’s great to know! What I tried now is this:

    class DTDResolver(lxml.etree.Resolver):
        def resolve(self, url, id, context):
            if url == "https://some.domain/xml/dtd/some.dtd":
                return self.resolve_filename("/path/to/local/some.dtd", context)
            return None

    parser = lxml.etree.XMLParser(huge_tree=True, dtd_validation=True, 
load_dtd=True)
    parser.resolvers.add(DTDResolver())
    lxml.etree.parse("test.xml", parser)

This loads the XML but I then get an error:

    lxml.etree.XMLSyntaxError: Content model of div is not determinist: 
((argument | byline … ))

which is independent of the original problem to resolve the entities and load 
the XML. I can read the XML file by loading the DTD and disabling validation 
using `dtd_validation=False`. Not pretty and needs to be resolved (pun 
intended) by the document owners, but this unblocks me.

Looks like this is the proper way to go about this. Much thanks!
Jens

--
Jens Tröger
https://savage.light-speed.de/

_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

Reply via email to