Hi all, Thanks Stefan! I also looked into this, and it appears that index.buf or index->buf (C) is not getting set on the lxml side or the libxml2 side. It looks like the call on line 487 of parser.pxi, c_input = xmlparser.xmlNewInputStream(c_context), calls a deprecated (since at least 11 months ago) function in libxml2's parserInternals.c. So Stefan's fix probably just updates lxml to use the updated libxml2 API, which *does *set buf.
For those who want more details: It was probably deprecated because of the new functions starting with xmlNewInputFrom. The containing function in lxml, _local_resolver, is passed into libxml2's xmlSetExternalEntityLoader in _register_document_loader. xmlSetExternalEntityLoader itself replaces xmlDefaultExternalEntityLoader with a custom callback. For reference, xmlDefaultExternalEntityLoader *does* in fact set the input->buf. If you follow a few function calls down to xmlNewInputFromUrl, there is a call to xmlParserInputBufferCreateUrl, which creates the buffer. However, the calls in lxml and the deprecated function leave the buffer as NULL. Best, Abe On Sun, Jul 6, 2025 at 3:59 AM Stefan Behnel via lxml - The Python XML Toolkit <lxml@python.org> wrote: > Stefan Behnel schrieb am 06.07.25 um 08:38: > > Austin Matherne schrieb am 01.07.25 um 04:01: > >> I’m upgrading a project from lxml 5.4.0 to the newly released lxml > 6.0.0 > >> and encountering an unexpected XMLSchemaParseError. I’ve distilled the > >> problem into a minimal, self-contained example and uploaded it as a > >> GitHub gist: > >> > >> https://gist.github.com/AustinMatherne/533a4b6a31a63e11bfd8c09c03c05183 > >> > >> * The same XML and XSD files parse and schema validate cleanly with > lxml > >> 5.4.0. > >> * With lxml 6.0.0, calling XMLSchema() raises an XMLSchemaParseError > with > >> no obvious culprit. > >> > >> Is this a bug in libxml, lxml, or am I doing something unsupported with > >> the API? > > > > So, I added a print(system_url) to your resolver and where the working > > version downloads a whole pack of schema files transitively, the failing > > version only gives the following output: > > > > """ > > READ http://www.w3.org/2001/xml.xsd > > READ > http://www.xbrl.org/2013/inlineXBRL/xhtml-inlinexbrl-1_1-modules.xsd > > Traceback (most recent call last): > > File "/home/stefan/source/Python/lxml/lxml-hg/TEST/ > > schema_error_ml_20250701/lxml.test.py", line 45, in <module> > > schema = etree.XMLSchema(schema_tree) > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > File "src/lxml/xmlschema.pxi", line 90, in > lxml.etree.XMLSchema.__init__ > > lxml.etree.XMLSchemaParseError: Invalid argument, line 1, column 37 > > """ > > > > First of all, I highly recommend setting up XML catalogues on your > system > > to avoid downloading the schemas over and over again. It's really a lot > of > > useless network back and forth, server usage, waiting time etc. going on > > here that can be avoided entirely by installing local copies of the > > schemas. libxml2 will search the usual system directories automatically > > when asked to use a schema and thus avoid any network traffic. > > > > Then, it seems to fail immediately at the first included schema file, at > a > > suspicious position of 37 characters, which is right after the XML > > declaration. That hints more at something going wrong in libxml2 than > lxml > > but is so surprisingly obviously not working that it's unlikely to go > > undetected in libxml2 releases. I recommend bringing this to the > attention > > of the libxml2 developers. > > Actually, it *was* something that lxml can resolve on its own side. > libxml2 > got a new API for passing data from resolvers into the parser and lxml > didn't use that yet but had to resort to some manual setup that apparently > no longer works in libxml2 2.14+. > > There is a test for this, so I'm not sure why it didn't fail when > switching > to libxml2 2.14, but in any case, I pushed a fix to the 6.0 branch that > resolves it on my side: > > > https://github.com/lxml/lxml/commit/2aae3a9625fcb858f83715a81b4d7182d2529a09 > > I'll release a bug fix version soon. > > Stefan > > _______________________________________________ > lxml - The Python XML Toolkit mailing list -- lxml@python.org > To unsubscribe send an email to lxml-le...@python.org > https://mail.python.org/mailman3//lists/lxml.python.org > Member address: abep...@gmail.com >
_______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-le...@python.org https://mail.python.org/mailman3//lists/lxml.python.org Member address: arch...@mail-archive.com