I think that launchpad ticket is what I need to understand the issue better! Great :) I will look into it in the weekend.
I did try setting other xml catalogs. And I did manage to set up a catalog and local files for my use case such that nothing is downloaded from the Internet. So that's not my mission right now. But in my first version I _thought_ I managed to change everything so nothing was downloaded. But in fact two files were downloaded from w3c.com. There was no noticeable delay so everything seemed fine. Until one point, when a bunch of files were validated in succession. After around 20-60 successful validations, the rest would fail. w3c has some kind of filter/firewall. If you download the resources in rapid succession (e.g. roughly 1 per second, for 10-40 seconds) it will start rejecting requests. It only takes 5-20 seconds for the firewall to forgive you and let you download again. This meant that I got some random / intermittent failures. Thats why I want to _know_ that I have disabled networking. So that any error with incorrectly set up catalog will give an error now, and not later. The above happened with xmllint. With lxml I can load the schema once and use it for validating hundreds of xml files, so I can easily circumvent the w3c filter. But in any case, I would like to set up my lxml code such that any attempt to download resources will result in an error now, and not when that resource is one day unavailable :) Thanks for helping out! On Mon, Mar 4, 2024 at 10:47 PM <holger.jo...@lbbw.de> wrote: > (cc-ing the mailing list) > > > Thanks for all the feedback :) For now, I will stick to just one part of > your feedback. > > > > Consider your example: > > > > if not no_network: > > parse_options = parse_options ^ xmlparser.XML_PARSE_NONET > > > > Won't that always "negate" the XML_PARSE_NONET bit? If 0, it will change > to 1. If 1 it will change to 0. Right? So using --no_network will always > pick the opposite of the default. Am I wrong? > > That's probably a bit unintuitive from the snippet I gave. It works like > this: Since XML_PARSE_NONET is also part of > _XML_DEFAULT_PARSE_OPTIONS, the XOR logic here switches it off again when > no_network has been explicitly set to False. > > > > And regarding the default behavior. When it comes to validating > according to an xml schema then the default is to download the xsd's that > are imported, at least on my system > > (Ubuntu 22.04 with libxml 2.9.12 installed via APT). I tried with both > xmllint, xmlstarlet and lxml. Perhaps the default is for > > something other than downloading xsd's? I guess > > there can be references / entities stuff in the target xml document, and > those references will not be downloaded. Could that be it? > > > >Maybe the parser used for parsing XML Schemas is set up to ignore the > "normal default" and in the case of "lxml" also set up to ignore the > options set with no_network. > > It's been a long time since I experimented with includes/imports in XML > Schemas in lxml. Can't really remember > the workings but in my case it was rather the other way round, i.e. no > (external/remote) network access and wanting > to load included/imported schemas from a local catalog. > > Have you tried running with XML_CATALOG_FILES set to an empty value to > suppress default catalog settings? > E.g. XML_CATALOG_FILES= python myprog.py > > I found this: https://bugs.launchpad.net/lxml/+bug/1234114 > > So it seems like it is indeed not "simply" possible to suppress XMLSchema > network access (but maybe through catalog setup: > "[...] and external imports should always be covered by catalogues > (otherwise, that's a configuration problem on the user side)[...]", > see the issue conversation). > > Another thought might be custom URI resolvers but I don't know how they > tie into XML Schema handling > (https://lxml.de/resolvers.html#uri-resolvers). > > Holger > > > > > > > Landesbank Baden-Wuerttemberg > Anstalt des oeffentlichen Rechts > Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz > HRA 12704 > Amtsgericht Stuttgart > HRA 4356, HRA 104 440 > Amtsgericht Mannheim > HRA 40687 > Amtsgericht Mainz > > Die LBBW verarbeitet gemaess Erfordernissen der DSGVO Ihre > personenbezogenen Daten. > Informationen finden Sie unter https://www.lbbw.de/datenschutz. > _______________________________________________ > lxml - The Python XML Toolkit mailing list -- lxml@python.org > To unsubscribe send an email to lxml-le...@python.org > https://mail.python.org/mailman3/lists/lxml.python.org/ > Member address: mrve...@gmail.com >
_______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-le...@python.org https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: arch...@mail-archive.com