[lxml] Re: Performance issues when using element.clear() in Python 3.x

Noorulamry Daud via lxml - The Python XML Toolkit Fri, 14 Feb 2025 00:58:04 -0800

Hi everyone,

Thank you for your replies.

> I guess this is not a look-alike example but just meant as a hint, right?

Yes. My work is very protective about the source code, so I am only allowed
to sketch out the rough approximation.

The start events are involved during some of the processing, and as you
mentioned, element.clear() was not invoked with it.

>  Are you using the same versions of lxml (and libxml2) in both?

No, and that's what makes it so frustrating. I cannot tell management that
using the latest version of Python and lxml actually causes a significant
performance penalty. By rights using the latest versions should be at least
as good as, if not better, than the older version.

>  Does the memory consumption stay constant over time or does it
continuously
grow as it parses?

It grows larger until it eventually crashed. My colleague expanded the page
file and managed to delay said crash, but it happened eventually.

> Have you run a memory profiler on your code? Or a (statistical) line
profiler to see where the time is spent

I used Python's cprofile to find the bottlenecks, but unfortunately the
results weren't making sense. It identified which functions were taking the
most time, but when I did a line-by-line analysis the times didn't add up.

Since commenting out the element.clear() lines did bring the result close
the Python 2.7's performance, the rest of the team decided that this is
where the issue is.

> the standard library's etree module is often significantly faster,

I have not considered that angle since what I can find on Google indicated
that lxml is the fastest; but I'll give this a try.

On Fri, 14 Feb 2025 at 00:21, Charlie Clark <
charlie.cl...@clark-consulting.eu> wrote:

> On 13 Feb 2025, at 15:18, Stefan Behnel via lxml - The Python XML Toolkit
> wrote:
>
>
> > Are you using the same versions of lxml (and libxml2) in both?
> >
> > There shouldn't be a difference in behaviour, except for the obvious
> language differences (bytes/unicode).
>
> Based on the parsing code we use in Openpyxl, I'd agree with this. NB., we
> discovered that, for pure parsing, ie. you just want to get at the data,
> the standard library's etree module is often significantly faster, but YMMV.
>
> > Does the memory consumption stay constant over time or does it
> continuously grow as it parses?
> >
> > Have you run a memory profiler on your code? Or a (statistical) line
> profiler to see where the time is spent
>
> Excellent suggestions: memory_profiler and pympler are useful tools for
> this.
>
> Charlie
>
> --
> Charlie Clark
> Managing Director
> Clark Consulting & Research
> German Office
> Sengelsweg 34
> Düsseldorf
> D- 40489
> Tel: +49-203-3925-0390
> Mobile: +49-178-782-6226
> _______________________________________________
> lxml - The Python XML Toolkit mailing list -- lxml@python.org
> To unsubscribe send an email to lxml-le...@python.org
> https://mail.python.org/mailman3/lists/lxml.python.org/
> Member address: noorulamry.d...@gmail.com
>

_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

[lxml] Re: Performance issues when using element.clear() in Python 3.x

Reply via email to