[lxml] Re: Performance issues when using element.clear() in Python 3.x

Stefan Behnel via lxml - The Python XML Toolkit Thu, 13 Feb 2025 06:38:15 -0800

Hi,

Noorulamry Daud schrieb am 13.02.25 um 12:28:

I've been cracking my head about this performance issue I'm having and I
could use some help.


At my work we have to parse extremely large XML files - 20GB and even
larger. The basic algorithm is as follows:

with open(file, "rb") as reader:
    context = etree.iterparse(reader, events=('start', 'end'))
    for ev, el in context:
       (processing)
       el.clear()

I guess this is not a look-alike example but just meant as a hint, right?Clearing the elements on both start and end events seems useless, clearingthem on start is probably outright dangerous, etc. You would at least wantto pass the "keep_tail=True" option and clear them only at the end.


https://lxml.de/parsing.html#modifying-the-tree

https://lxml.de/parsing.html#incremental-event-parsing

In Python 2.7, the processing time for a 20GB XML file is approximately 40
minutes.

In Python 3.13, it's 7 hours, more than ten times from Python 2.


Are you using the same versions of lxml (and libxml2) in both?

There shouldn't be a difference in behaviour, except for the obviouslanguage differences (bytes/unicode).

Does the memory consumption stay constant over time or does it continuouslygrow as it parses?

Have you run a memory profiler on your code? Or a (statistical) lineprofiler to see where the time is spent?


Stefan

_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

[lxml] Re: Performance issues when using element.clear() in Python 3.x

Reply via email to