[lxml] Re: Performance issues when using element.clear() in Python 3.x

Noorulamry Daud via lxml - The Python XML Toolkit Mon, 03 Mar 2025 22:18:01 -0800

I see.

Then what I'm wondering is, for the "clearing the elements" process, is
there a difference in what Python 2.7 did that is different than Python
3.8+? Because we have the same code but vastly different execution times
when we run it through the different versions. And we narrowed it down to
the clear() method, so what is the secret of Py2.7's performance?


Out of desperation my manager is suggesting that we use Python's native xml
processing instead of lxml but I feel like I've lost the battle if I do
that.

On Mon, 3 Mar 2025, 16:04 Xavier Morel via lxml - The Python XML Toolkit, <
lxml@python.org> wrote:

> You're clearing the subelements, the attributes, the text, and the tail
>
> https://github.com/lxml/lxml/blob/0eb4f0029497957e58a9f15280b3529bdb18d117/src/lxml/etree.pyx#L1008-L1038
>
> By default sys.getsizeof only measures the "intrinsic" size of an
> object, it does not traverse pointers unless the object specifically
> overrode `__sizeof__` to expose this information.
>
> lxml does not, it always returns 56. That's trivially testable, just
> create an empty element, check its size, it's 56, then add a bunch of
> children, check the root element's size, still 56.
>
> An lxml element would be 56 bytes because (w/ GIL and GC) a Python
> object has a baseline of 4*8 = 32 bytes (class pointer, refcount, prev
> and next gc pointers) and to that it adds 3*8 = 24 bytes for pointers to
> the internal _Document, to the libxml2 xmlNode, and to the cached tag
> string (in clark's notation).
>
> On 3/03/25 07:41, Noorulamry Daud via lxml - The Python XML Toolkit wrote:
> > Hi everyone,
> >
> > Despite the brief respite, the issue my team is having with the
> > element.clear() persists. I honestly have no idea why lxml 2.2.3 can do
> > it instantly while the latest version took ages.
> >
> > I do wonder about something though; I used sys.getsizeof to see the size
> > of the elements before and after clear, but to my surprise the size
> > remained constant at 56 bytes. In that case what are we clearing?
> >
> >
> > On Fri, 14 Feb 2025, 21:08 Charlie Clark,
> > <charlie.cl...@clark-consulting.eu
> > <mailto:charlie.cl...@clark-consulting.eu>> wrote:
> >
> >     __
> >
> >     On 14 Feb 2025, at 11:12, Stefan Behnel via lxml - The Python XML
> >     Toolkit wrote:
> >
> >         Then you're not cleaning up enough of the XML tree. Some of it
> >         remains in memory after processing it, and thus leads to
> >         swapping and long waiting times.
> >
> >     It's definitely a memory issue. You can write some code to catch
> >     memory use quickly. This is something we wrote for openpyxl while we
> >     trying to "contain" memory use:
> >
> >     |import os import openpyxl from memory_profiler import memory_usage
> >     def test_memory_use(): """Naive test that assumes memory use will
> >     never be more than 120 % of that for first 50 rows""" folder =
> >     os.path.split(__file__)[0] src = os.path.join(folder, "files",
> >     "very_large.xlsx") wb = openpyxl.load_workbook(src, read_only=True)
> >     ws = wb.active initial_use = None for n, line in
> >     enumerate(ws.iter_rows(values_only=True)): if n % 50 == 0: use =
> >     memory_usage(proc=-1, interval=1)[0] if initial_use is None:
> >     initial_use = use assert use/initial_use < 1.2 print(n, use) if
> >     __name__ == '__main__': test_memory_use() |
> >
> >     You should be able to adapt this for your parser and it'll tell you
> >     soon enough how far in you get before your memory use balloons. If
> >     memory serves I had one problem where I was clearing in the wrong
> >     place, which meant that other elements were sticking around. Thanks
> >     to Stefan for helping me sort it. I think your code made be too
> >     aggressive. It might help to look at the Openpyxl worksheet parser
> >     which has to handle what happens if you do additional processing
> >     within nodes.
> >
> >     Charlie
> >
> >     --
> >     Charlie Clark
> >     Managing Director
> >     Clark Consulting & Research
> >     German Office
> >     Sengelsweg 34
> >     <
> https://www.google.com/maps/search/Sengelsweg+34+%0D%0AD%C3%BCsseldorf+%0D%0AD-+40489?entry=gmail&source=g
> >
> >     Düsseldorf
> >     <
> https://www.google.com/maps/search/Sengelsweg+34+%0D%0AD%C3%BCsseldorf+%0D%0AD-+40489?entry=gmail&source=g
> >
> >     D- 40489
> >     <
> https://www.google.com/maps/search/Sengelsweg+34+%0D%0AD%C3%BCsseldorf+%0D%0AD-+40489?entry=gmail&source=g
> >
> >     Tel: +49-203-3925-0390
> >     Mobile: +49-178-782-6226
> >
> >     _______________________________________________
> >     lxml - The Python XML Toolkit mailing list -- lxml@python.org
> >     <mailto:lxml@python.org>
> >     To unsubscribe send an email to lxml-le...@python.org
> >     <mailto:lxml-le...@python.org>
> >     https://mail.python.org/mailman3/lists/lxml.python.org/
> >     <https://mail.python.org/mailman3/lists/lxml.python.org/>
> >     Member address: noorulamry.d...@gmail.com
> >     <mailto:noorulamry.d...@gmail.com>
> >
> >
> > _______________________________________________
> > lxml - The Python XML Toolkit mailing list -- lxml@python.org
> > To unsubscribe send an email to lxml-le...@python.org
> > https://mail.python.org/mailman3/lists/lxml.python.org/
> > Member address: x...@odoo.com
> _______________________________________________
> lxml - The Python XML Toolkit mailing list -- lxml@python.org
> To unsubscribe send an email to lxml-le...@python.org
> https://mail.python.org/mailman3/lists/lxml.python.org/
> Member address: noorulamry.d...@gmail.com
>

_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

[lxml] Re: Performance issues when using element.clear() in Python 3.x

Reply via email to