Hi,
Noorulamry Daud schrieb am 13.02.25 um 12:28:
I've been cracking my head about this performance issue I'm having and I
could use some help.
At my work we have to parse extremely large XML files - 20GB and even
larger. The basic algorithm is as follows:
with open(file, "rb") as reader:
context = etree.iterparse(reader, events=('start', 'end'))
for ev, el in context:
(processing)
el.clear()
I guess this is not a look-alike example but just meant as a hint, right?
Clearing the elements on both start and end events seems useless, clearing
them on start is probably outright dangerous, etc. You would at least want
to pass the "keep_tail=True" option and clear them only at the end.
https://lxml.de/parsing.html#modifying-the-tree
https://lxml.de/parsing.html#incremental-event-parsing
In Python 2.7, the processing time for a 20GB XML file is approximately 40
minutes.
In Python 3.13, it's 7 hours, more than ten times from Python 2.
Are you using the same versions of lxml (and libxml2) in both?
There shouldn't be a difference in behaviour, except for the obvious
language differences (bytes/unicode).
Does the memory consumption stay constant over time or does it continuously
grow as it parses?
Have you run a memory profiler on your code? Or a (statistical) line
profiler to see where the time is spent?
Stefan
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com