Hi,

Noorulamry Daud schrieb am 13.02.25 um 12:28:
I've been cracking my head about this performance issue I'm having and I
could use some help.

At my work we have to parse extremely large XML files - 20GB and even
larger. The basic algorithm is as follows:

with open(file, "rb") as reader:
    context = etree.iterparse(reader, events=('start', 'end'))
    for ev, el in context:
       (processing)
       el.clear()

I guess this is not a look-alike example but just meant as a hint, right? Clearing the elements on both start and end events seems useless, clearing them on start is probably outright dangerous, etc. You would at least want to pass the "keep_tail=True" option and clear them only at the end.

https://lxml.de/parsing.html#modifying-the-tree

https://lxml.de/parsing.html#incremental-event-parsing


In Python 2.7, the processing time for a 20GB XML file is approximately 40
minutes.

In Python 3.13, it's 7 hours, more than ten times from Python 2.

Are you using the same versions of lxml (and libxml2) in both?

There shouldn't be a difference in behaviour, except for the obvious language differences (bytes/unicode).

Does the memory consumption stay constant over time or does it continuously grow as it parses?

Have you run a memory profiler on your code? Or a (statistical) line profiler to see where the time is spent?

Stefan

_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com
  • [lxml] Performance issue... Noorulamry Daud via lxml - The Python XML Toolkit
    • [lxml] Re: Performa... Stefan Behnel via lxml - The Python XML Toolkit
      • [lxml] Re: Perf... Charlie Clark
        • [lxml] Re: ... Noorulamry Daud via lxml - The Python XML Toolkit
          • [lxml] ... Stefan Behnel via lxml - The Python XML Toolkit
            • [l... Charlie Clark
              • ... Noorulamry Daud via lxml - The Python XML Toolkit
                • ... Xavier Morel via lxml - The Python XML Toolkit
                • ... Noorulamry Daud via lxml - The Python XML Toolkit

Reply via email to