Hi,
Noorulamry Daud schrieb am 14.02.25 um 09:56:
Are you using the same versions of lxml (and libxml2) in both?
No, and that's what makes it so frustrating. I cannot tell management that
using the latest version of Python and lxml actually causes a significant
performance penalty. By rights using the latest versions should be at least
as good as, if not better, than the older version.
It should be. This seems more of a memory problem.
Does the memory consumption stay constant over time or does it
continuously grow as it parses?
It grows larger until it eventually crashed. My colleague expanded the page
file and managed to delay said crash, but it happened eventually.
Then you're not cleaning up enough of the XML tree. Some of it remains in
memory after processing it, and thus leads to swapping and long waiting times.
Try to find out how the tree looks after a few iterations. You're
collecting "start" events, so grab the first returned element (that's the
root element) and print its tostring() after each ".clean()" call. That
should show you what data you're missing in the cleanup.
Have you run a memory profiler on your code? Or a (statistical) line
profiler to see where the time is spent
I used Python's cprofile to find the bottlenecks, but unfortunately the
results weren't making sense. It identified which functions were taking the
most time, but when I did a line-by-line analysis the times didn't add up.
That's not unusual. Line profiling takes additional time *per line*, so the
results are often different from simple *per function* timings. Statistical
profilers are much better than cProfile here since they add less overhead.
Since commenting out the element.clear() lines did bring the result close
the Python 2.7's performance, the rest of the team decided that this is
where the issue is.
Sort-of, but probably for other reasons.
the standard library's etree module is often significantly faster,
I have not considered that angle since what I can find on Google indicated
that lxml is the fastest; but I'll give this a try.
I can second that. It uses a different parser and in-memory model, so the
performance is different – better for some things, worse for others. Try it
to see where your code ends up. Note that the feature set is also very
different, though. lxml adds lots of functionality that "xml.etree" cannot
provide.
Stefan
_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com