Hi all,

Thanks Stefan! I also looked into this, and it appears that index.buf or
index->buf (C) is not getting set on the lxml side or the libxml2 side. It
looks like the call on line 487 of parser.pxi, c_input =
xmlparser.xmlNewInputStream(c_context), calls a deprecated (since at least
11 months ago) function in libxml2's parserInternals.c. So Stefan's fix
probably just updates lxml to use the updated libxml2 API, which *does *set
buf.

For those who want more details:
It was probably deprecated because of the new functions starting with
xmlNewInputFrom. The containing function in lxml, _local_resolver, is
passed into libxml2's xmlSetExternalEntityLoader in _register_document_loader.
xmlSetExternalEntityLoader itself replaces xmlDefaultExternalEntityLoader
with a custom callback. For reference, xmlDefaultExternalEntityLoader *does* in
fact set the input->buf. If you follow a few function calls down to
xmlNewInputFromUrl, there is a call to xmlParserInputBufferCreateUrl, which
creates the buffer. However, the calls in lxml and the deprecated function
leave the buffer as NULL.

Best,
Abe

On Sun, Jul 6, 2025 at 3:59 AM Stefan Behnel via lxml - The Python XML
Toolkit <lxml@python.org> wrote:

> Stefan Behnel schrieb am 06.07.25 um 08:38:
> > Austin Matherne schrieb am 01.07.25 um 04:01:
> >> I’m upgrading a project from lxml 5.4.0 to the newly released lxml
> 6.0.0
> >> and encountering an unexpected XMLSchemaParseError. I’ve distilled the
> >> problem into a minimal, self-contained example and uploaded it as a
> >> GitHub gist:
> >>
> >> https://gist.github.com/AustinMatherne/533a4b6a31a63e11bfd8c09c03c05183
> >>
> >> * The same XML and XSD files parse and schema validate cleanly with
> lxml
> >> 5.4.0.
> >> * With lxml 6.0.0, calling XMLSchema() raises an XMLSchemaParseError
> with
> >> no obvious culprit.
> >>
> >> Is this a bug in libxml, lxml, or am I doing something unsupported with
> >> the API?
> >
> > So, I added a print(system_url) to your resolver and where the working
> > version downloads a whole pack of schema files transitively, the failing
> > version only gives the following output:
> >
> > """
> > READ http://www.w3.org/2001/xml.xsd
> > READ
> http://www.xbrl.org/2013/inlineXBRL/xhtml-inlinexbrl-1_1-modules.xsd
> > Traceback (most recent call last):
> >    File "/home/stefan/source/Python/lxml/lxml-hg/TEST/
> > schema_error_ml_20250701/lxml.test.py", line 45, in <module>
> >      schema = etree.XMLSchema(schema_tree)
> >               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >    File "src/lxml/xmlschema.pxi", line 90, in
> lxml.etree.XMLSchema.__init__
> > lxml.etree.XMLSchemaParseError: Invalid argument, line 1, column 37
> > """
> >
> > First of all, I highly recommend setting up XML catalogues on your
> system
> > to avoid downloading the schemas over and over again. It's really a lot
> of
> > useless network back and forth, server usage, waiting time etc. going on
> > here that can be avoided entirely by installing local copies of the
> > schemas. libxml2 will search the usual system directories automatically
> > when asked to use a schema and thus avoid any network traffic.
> >
> > Then, it seems to fail immediately at the first included schema file, at
> a
> > suspicious position of 37 characters, which is right after the XML
> > declaration. That hints more at something going wrong in libxml2 than
> lxml
> > but is so surprisingly obviously not working that it's unlikely to go
> > undetected in libxml2 releases. I recommend bringing this to the
> attention
> > of the libxml2 developers.
>
> Actually, it *was* something that lxml can resolve on its own side.
> libxml2
> got a new API for passing data from resolvers into the parser and lxml
> didn't use that yet but had to resort to some manual setup that apparently
> no longer works in libxml2 2.14+.
>
> There is a test for this, so I'm not sure why it didn't fail when
> switching
> to libxml2 2.14, but in any case, I pushed a fix to the 6.0 branch that
> resolves it on my side:
>
>
> https://github.com/lxml/lxml/commit/2aae3a9625fcb858f83715a81b4d7182d2529a09
>
> I'll release a bug fix version soon.
>
> Stefan
>
> _______________________________________________
> lxml - The Python XML Toolkit mailing list -- lxml@python.org
> To unsubscribe send an email to lxml-le...@python.org
> https://mail.python.org/mailman3//lists/lxml.python.org
> Member address: abep...@gmail.com
>
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3//lists/lxml.python.org
Member address: arch...@mail-archive.com

Reply via email to