Hi,

Noorulamry Daud schrieb am 14.02.25 um 09:56:
  Are you using the same versions of lxml (and libxml2) in both?

No, and that's what makes it so frustrating. I cannot tell management that
using the latest version of Python and lxml actually causes a significant
performance penalty. By rights using the latest versions should be at least
as good as, if not better, than the older version.

It should be. This seems more of a memory problem.


  Does the memory consumption stay constant over time or does it
  continuously grow as it parses?

It grows larger until it eventually crashed. My colleague expanded the page
file and managed to delay said crash, but it happened eventually.

Then you're not cleaning up enough of the XML tree. Some of it remains in memory after processing it, and thus leads to swapping and long waiting times.

Try to find out how the tree looks after a few iterations. You're collecting "start" events, so grab the first returned element (that's the root element) and print its tostring() after each ".clean()" call. That should show you what data you're missing in the cleanup.


Have you run a memory profiler on your code? Or a (statistical) line
profiler to see where the time is spent

I used Python's cprofile to find the bottlenecks, but unfortunately the
results weren't making sense. It identified which functions were taking the
most time, but when I did a line-by-line analysis the times didn't add up.

That's not unusual. Line profiling takes additional time *per line*, so the results are often different from simple *per function* timings. Statistical profilers are much better than cProfile here since they add less overhead.


Since commenting out the element.clear() lines did bring the result close
the Python 2.7's performance, the rest of the team decided that this is
where the issue is.

Sort-of, but probably for other reasons.


the standard library's etree module is often significantly faster,

I have not considered that angle since what I can find on Google indicated
that lxml is the fastest; but I'll give this a try.

I can second that. It uses a different parser and in-memory model, so the performance is different – better for some things, worse for others. Try it to see where your code ends up. Note that the feature set is also very different, though. lxml adds lots of functionality that "xml.etree" cannot provide.

Stefan

_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com
  • [lxml] Performance issue... Noorulamry Daud via lxml - The Python XML Toolkit
    • [lxml] Re: Performa... Stefan Behnel via lxml - The Python XML Toolkit
      • [lxml] Re: Perf... Charlie Clark
        • [lxml] Re: ... Noorulamry Daud via lxml - The Python XML Toolkit
          • [lxml] ... Stefan Behnel via lxml - The Python XML Toolkit
            • [l... Charlie Clark
              • ... Noorulamry Daud via lxml - The Python XML Toolkit
                • ... Xavier Morel via lxml - The Python XML Toolkit
                • ... Noorulamry Daud via lxml - The Python XML Toolkit

Reply via email to