I think that launchpad ticket is what I need to understand the issue
better! Great :) I will look into it in the weekend.

I did try setting other xml catalogs. And I did manage to set up a catalog
and local files for my use case such that nothing is downloaded from the
Internet. So that's not my mission right now.

But in my first version I _thought_ I managed to change everything so
nothing was downloaded. But in fact two files were downloaded from w3c.com.
There was no noticeable delay so everything seemed fine. Until one point,
when a bunch of files were validated in succession. After around 20-60
successful validations, the rest would fail. w3c has some kind of
filter/firewall. If you  download the resources in rapid succession (e.g.
roughly 1 per second, for 10-40 seconds) it will start rejecting requests.
It only takes 5-20 seconds for the firewall to forgive you and let you
download again. This meant that I got some random / intermittent failures.

Thats why I want to _know_ that I have disabled networking. So that any
error with incorrectly set up catalog will give an error now, and not
later.

The above happened with xmllint. With lxml I can load the schema once and
use it for validating hundreds of xml files, so I can easily circumvent
the w3c filter. But in any case, I would like to set up my lxml code such
that any attempt to download resources will result in an error now, and not
when that resource is one day unavailable  :)

Thanks for helping out!

On Mon, Mar 4, 2024 at 10:47 PM <holger.jo...@lbbw.de> wrote:

> (cc-ing the mailing list)
>
> > Thanks for all the feedback :) For now, I will stick to just one part of
> your feedback.
> >
> > Consider your example:
> >
> >        if not no_network:
> >            parse_options = parse_options ^ xmlparser.XML_PARSE_NONET
> >
> > Won't that always "negate" the XML_PARSE_NONET bit? If 0, it will change
> to 1. If 1 it will change to 0. Right? So using --no_network will always
> pick the opposite of the default. Am I wrong?
>
> That's probably a bit unintuitive from the snippet I gave. It works like
> this: Since XML_PARSE_NONET is also part of
> _XML_DEFAULT_PARSE_OPTIONS, the XOR logic here switches it off again when
> no_network has been explicitly set to False.
>
>
> > And regarding the default behavior. When it comes to validating
> according to an xml schema then the default is to download the xsd's that
> are imported, at least on my system
> > (Ubuntu 22.04 with libxml 2.9.12 installed via APT). I tried with both
> xmllint, xmlstarlet and lxml. Perhaps the default is for
> > something other than downloading xsd's? I guess
> > there can be references / entities stuff in the target xml document, and
> those references will not be downloaded. Could that be it?
> >
> >Maybe the parser used for parsing XML Schemas is set up to ignore the
> "normal default" and in the case of "lxml" also set up to ignore the
> options set with no_network.
>
> It's been a long time since I experimented with includes/imports in XML
> Schemas in lxml. Can't really remember
> the workings but in my case it was rather the other way round, i.e. no
> (external/remote) network access and wanting
> to load included/imported schemas from a local catalog.
>
> Have you tried running with XML_CATALOG_FILES set to an empty value to
> suppress default catalog settings?
> E.g. XML_CATALOG_FILES= python myprog.py
>
> I found this: https://bugs.launchpad.net/lxml/+bug/1234114
>
> So it seems like it is indeed not "simply" possible to suppress XMLSchema
> network access (but maybe through catalog setup:
> "[...] and external imports should always be covered by catalogues
> (otherwise, that's a configuration problem on the user side)[...]",
> see the issue conversation).
>
> Another thought might be custom URI resolvers but I don't know how they
> tie into XML Schema handling
> (https://lxml.de/resolvers.html#uri-resolvers).
>
> Holger
>
>
>
>
>
>
> Landesbank Baden-Wuerttemberg
> Anstalt des oeffentlichen Rechts
> Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz
> HRA 12704
> Amtsgericht Stuttgart
> HRA 4356, HRA 104 440
> Amtsgericht Mannheim
> HRA 40687
> Amtsgericht Mainz
>
> Die LBBW verarbeitet gemaess Erfordernissen der DSGVO Ihre
> personenbezogenen Daten.
> Informationen finden Sie unter https://www.lbbw.de/datenschutz.
> _______________________________________________
> lxml - The Python XML Toolkit mailing list -- lxml@python.org
> To unsubscribe send an email to lxml-le...@python.org
> https://mail.python.org/mailman3/lists/lxml.python.org/
> Member address: mrve...@gmail.com
>
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

Reply via email to