[lxml] Re: Performance issues when using element.clear() in Python 3.x

Xavier Morel via lxml - The Python XML Toolkit Mon, 03 Mar 2025 00:05:44 -0800

You're clearing the subelements, the attributes, the text, and the tailhttps://github.com/lxml/lxml/blob/0eb4f0029497957e58a9f15280b3529bdb18d117/src/lxml/etree.pyx#L1008-L1038

By default sys.getsizeof only measures the "intrinsic" size of anobject, it does not traverse pointers unless the object specificallyoverrode `__sizeof__` to expose this information.

lxml does not, it always returns 56. That's trivially testable, justcreate an empty element, check its size, it's 56, then add a bunch ofchildren, check the root element's size, still 56.

An lxml element would be 56 bytes because (w/ GIL and GC) a Pythonobject has a baseline of 4*8 = 32 bytes (class pointer, refcount, prevand next gc pointers) and to that it adds 3*8 = 24 bytes for pointers tothe internal _Document, to the libxml2 xmlNode, and to the cached tagstring (in clark's notation).


On 3/03/25 07:41, Noorulamry Daud via lxml - The Python XML Toolkit wrote:

Hi everyone,

Despite the brief respite, the issue my team is having with theelement.clear() persists. I honestly have no idea why lxml 2.2.3 can doit instantly while the latest version took ages.

I do wonder about something though; I used sys.getsizeof to see the sizeof the elements before and after clear, but to my surprise the sizeremained constant at 56 bytes. In that case what are we clearing?

On Fri, 14 Feb 2025, 21:08 Charlie Clark,<charlie.cl...@clark-consulting.eu<mailto:charlie.cl...@clark-consulting.eu>> wrote:


    __

    On 14 Feb 2025, at 11:12, Stefan Behnel via lxml - The Python XML
    Toolkit wrote:

        Then you're not cleaning up enough of the XML tree. Some of it
        remains in memory after processing it, and thus leads to
        swapping and long waiting times.

    It's definitely a memory issue. You can write some code to catch
    memory use quickly. This is something we wrote for openpyxl while we
    trying to "contain" memory use:

    |import os import openpyxl from memory_profiler import memory_usage
    def test_memory_use(): """Naive test that assumes memory use will
    never be more than 120 % of that for first 50 rows""" folder =
    os.path.split(__file__)[0] src = os.path.join(folder, "files",
    "very_large.xlsx") wb = openpyxl.load_workbook(src, read_only=True)
    ws = wb.active initial_use = None for n, line in
    enumerate(ws.iter_rows(values_only=True)): if n % 50 == 0: use =
    memory_usage(proc=-1, interval=1)[0] if initial_use is None:
    initial_use = use assert use/initial_use < 1.2 print(n, use) if
    __name__ == '__main__': test_memory_use() |

    You should be able to adapt this for your parser and it'll tell you
    soon enough how far in you get before your memory use balloons. If
    memory serves I had one problem where I was clearing in the wrong
    place, which meant that other elements were sticking around. Thanks
    to Stefan for helping me sort it. I think your code made be too
    aggressive. It might help to look at the Openpyxl worksheet parser
    which has to handle what happens if you do additional processing
    within nodes.

    Charlie

    --
    Charlie Clark
    Managing Director
    Clark Consulting & Research
    German Office
    Sengelsweg 34
    
<https://www.google.com/maps/search/Sengelsweg+34+%0D%0AD%C3%BCsseldorf+%0D%0AD-+40489?entry=gmail&source=g>
    Düsseldorf
    
<https://www.google.com/maps/search/Sengelsweg+34+%0D%0AD%C3%BCsseldorf+%0D%0AD-+40489?entry=gmail&source=g>
    D- 40489
    
<https://www.google.com/maps/search/Sengelsweg+34+%0D%0AD%C3%BCsseldorf+%0D%0AD-+40489?entry=gmail&source=g>
    Tel: +49-203-3925-0390
    Mobile: +49-178-782-6226

    _______________________________________________
    lxml - The Python XML Toolkit mailing list -- lxml@python.org
    <mailto:lxml@python.org>
    To unsubscribe send an email to lxml-le...@python.org
    <mailto:lxml-le...@python.org>
    https://mail.python.org/mailman3/lists/lxml.python.org/
    <https://mail.python.org/mailman3/lists/lxml.python.org/>
    Member address: noorulamry.d...@gmail.com
    <mailto:noorulamry.d...@gmail.com>


_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: x...@odoo.com

_______________________________________________
lxml - The Python XML Toolkit mailing list -- lxml@python.org
To unsubscribe send an email to lxml-le...@python.org
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: arch...@mail-archive.com

[lxml] Re: Performance issues when using element.clear() in Python 3.x

Reply via email to