[lxml] Re: Performance issues when using element.clear() in Python 3.x

Stefan Behnel via lxml - The Python XML Toolkit Fri, 14 Feb 2025 02:19:59 -0800

Hi,

Noorulamry Daud schrieb am 14.02.25 um 09:56:

  Are you using the same versions of lxml (and libxml2) in both?


No, and that's what makes it so frustrating. I cannot tell management that
using the latest version of Python and lxml actually causes a significant
performance penalty. By rights using the latest versions should be at least
as good as, if not better, than the older version.


It should be. This seems more of a memory problem.

  Does the memory consumption stay constant over time or does it
  continuously grow as it parses?


It grows larger until it eventually crashed. My colleague expanded the page
file and managed to delay said crash, but it happened eventually.

Then you're not cleaning up enough of the XML tree. Some of it remains inmemory after processing it, and thus leads to swapping and long waiting times.

Try to find out how the tree looks after a few iterations. You'recollecting "start" events, so grab the first returned element (that's theroot element) and print its tostring() after each ".clean()" call. Thatshould show you what data you're missing in the cleanup.

Have you run a memory profiler on your code? Or a (statistical) line

profiler to see where the time is spent

I used Python's cprofile to find the bottlenecks, but unfortunately the
results weren't making sense. It identified which functions were taking the
most time, but when I did a line-by-line analysis the times didn't add up.

That's not unusual. Line profiling takes additional time *per line*, so theresults are often different from simple *per function* timings. Statisticalprofilers are much better than cProfile here since they add less overhead.

Since commenting out the element.clear() lines did bring the result close
the Python 2.7's performance, the rest of the team decided that this is
where the issue is.


Sort-of, but probably for other reasons.

the standard library's etree module is often significantly faster,


I have not considered that angle since what I can find on Google indicated
that lxml is the fastest; but I'll give this a try.

I can second that. It uses a different parser and in-memory model, so theperformance is different – better for some things, worse for others. Try itto see where your code ends up. Note that the feature set is also verydifferent, though. lxml adds lots of functionality that "xml.etree" cannotprovide.


Stefan

_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

[lxml] Re: Performance issues when using element.clear() in Python 3.x

Reply via email to